-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process got stuck when trying to optimize different groups of parameters using different types of data #584
Comments
For some further information, I use a single node, multi-GPU distributed training. When waiting for a long time, I received the following messages:
|
It may help if you can provide a repro of some kind and/or give some more information about what parallelism you are using. |
Hi, [training] [experimental] |
Is it possible that you have any kind of conditional computation? For example, one data parallel rank does not receive the multimodal data, so the linear layer did not get used? It also depends a bit on how you applied FSDP to the modules. Is it difficult to provide a way to repro the issue so that we can help debug? (I understand it might be very hard but just wanted to ask.) |
Yes. It can happen (one data parallel rank uses the linear layer and the others do not). SO it seems like the current implementation doesn't support such function, right? Yes 😂 it is still an ongoing project so we do not opensource the code yet. |
yea... you might need to feed some dummy data through |
I see. Thanks for your help! |
Just one quick question. When we run the dummy input through the added linear layer, do we need to compute the gradient for the linear layer regarding this dummy part? Or just runing the dummy input through the entire model (the added linear layer and the whole Transformer) be enough? |
I think there is a bit of nuance depending on how you apply FSDP to the model. If you are not directly calling |
Thanks for the clarification! |
Hi,
I'm adding a new linear projection layer (nn.Linear) to the original Llama3 architecture to process a new type of data. During training, I use two types of data (language-only and multimodal data). When using language-only data, the whole Llama-3 parameters will be finetuned. When using multimodal data, the whole Llama-3 parameters and the parameters in the added linear layer will be finetuned. Both of them can function well independently.
However, when I combined these two types of data to do multi-task learning, the process just got stuck without any further information. Doesn't the current torchtitan support this kind of function? Thanks.
Tasks
The text was updated successfully, but these errors were encountered: