Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error for training landsacoe dataset #6

Open
jiajiaxiaoskx opened this issue Apr 23, 2024 · 8 comments
Open

Error for training landsacoe dataset #6

jiajiaxiaoskx opened this issue Apr 23, 2024 · 8 comments

Comments

@jiajiaxiaoskx
Copy link

When I run the training code on the landscape dataset, I encounter an error. How should I solve it?

LoRA rank 16 is too large. setting to: 4
Traceback (most recent call last):
File "train.py", line 1221, in
main(**config)
File "train.py", line 770, in main
unet_lora_params, unet_negation = inject_lora(
File "train.py", line 293, in inject_lora
params, negation = injector(**injector_args)
File "/home/TempoTokens/utils/lora.py", line 461, in inject_trainable_lora_extended
_tmp.to(_child_module.bias.device).to(_child_module.bias.dtype)
AttributeError: 'NoneType' object has no attribute 'device'

Thank you for your answer!

@guyyariv
Copy link
Owner

Hi, our pre-trained models do not include training with LoRA, thus, I have not encountered this error. Try using the config file to disable LoRA during training (train the adapter only).

@jiajiaxiaoskx
Copy link
Author

Thank you! I still don't quite understand the adapter you mentioned. Regarding the landscape and audioset-drum datasets, would you mind telling me which training modules should be set to True in the config file for training?

@guyyariv
Copy link
Owner

Be sure to set use_unet_lora to False in the config file to disable LoRA training: use_unet_lora: False. However, the adapter is required for training, so you cannot disable it from the config file. Good luck! :)

@jiajiaxiaoskx
Copy link
Author

Thank you for your patient reply and excellent work! I encountered several errors while running your training code, including issues with default parameter settings and dataset input. I'm not sure if it's because I don't fully understand the code framework or if there are some issues with the code. Like this error:

File "train.py", line 1056, in main
for step, batch in enumerate(train_dataloader):
File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/accelerate/data_loader.py", line 384, in iter
current_batch = next(dataloader_iter)
File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in next
data = self._next_data()
File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 721, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/anaconda3/envs/protagonist-113/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/TempoTokens/utils/dataset.py", line 666, in getitem
index = random.choice(list(self.valid_videos))
File "/home/anaconda3/envs/protagonist-113/lib/python3.8/random.py", line 290, in choice
raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

Could you possibly provide a final version of the code you used for training on the audioset-drum or landscape datasets (including the config file)? I would be very grateful!
If it's not convenient, I can communicate with you via email. Thank you!

@guyyariv
Copy link
Owner

It looks like you didn't load the datasets as required; they should be split into an audio folder and a video folder. It appears you tried to load them from an empty sequence. The current code should run without any issues. Feel free to reach out to me via email at guyyariv.mail at gmail dot com

@jiajiaxiaoskx
Copy link
Author

Following your suggestions, I attempted to replicate the process and conducted experiments on the Landscape and Audioset-Drum datasets (I did not change the provided config files). However, my results have been less than satisfactory. Below are the changes I made to the code:

The audio data is stereo, so the input dimension is [2, 16000]. I noticed that in your dataset code, the audio input dimension is set as [1, 16000], so I performed a simple average on the first dimension.
I imported randn_tensor from diffusers.utils.torch_utils.
After making these changes, I was able to successfully complete the training, but the validation results were very poor. The outcomes were significantly inferior compared to the results showcased on your project page and those generated using your provided pretrained model. I'm not sure if I have made a mistake somewhere or if you employed some sophisticated strategies during your training. Or I should change the parameters in the config file. I would like to seek your advice on this matter.

@guyyariv
Copy link
Owner

guyyariv commented May 4, 2024

Hi, I'm not sure why you cannot reconstruct the Landscape and Audioset-Drum results. These are both easy datasets (less challenging than VGGSound, for example), and the model should converge in high quality and quickly when using them. I used the provided version of those datasets (as mentioned in the README, for example, https://drive.google.com/drive/folders/14A1zaQI5EfShlv3QirgCGeNFzZBzQ3lq is Landscape) and split them into video alone and audio alone (mono), then used the provided config file. Please ask ChatGPT to split them for you into two different folders. Then, try to train again. Let me know if it is improved now.

@jiajiaxiaoskx
Copy link
Author

Hello, thank you for your patient reply! I still have a few questions regarding the code implementation that I would like to confirm with you:

1、The original video sizes for the three datasets—landscape, audioset-drum, and vggsound—are different (landscape is 288x512, audioset-drum is 96x96, vggsound is 360x212). However, you have set the video size for training and inference in the three config files as 384x384 and used a bucketing strategy. Should I change this parameter, or should I follow your setting and standardize it to 384x384?
2、The audio data for these datasets is in stereo. In the process of converting it to mono, should I average the data from both channels, or should I use the data from the first or the second channel?
3、Will changing the batch size during training affect the training results (can I change it to 2)?

I look forward to your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants