-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Cannot train Seeker with batch size > 1 #4531
Comments
Can you report your fairscale version? |
The above error message was from fairscale 0.3.7. Also tried fairscale 0.4.6 and got a similar error:
|
I wonder if it's multiprocessing train... Does that work with a |
No. We just tried running this (fairscale 0.4.6): And got this:
|
Sorry, one more thing. Can you roll back to 0.4.4? |
Sorry, it turns out that transformer/generator works fine with bs > 1. We ran into the above error because we turned off flatten_parameter (which is also strange, but I suppose this is a fairscale problem). We still couldn't train Seeker with bs > 1 with fairscale 0.4.6. We're trying 0.4.4 now and will report back. |
So we just tried training Seeker with fairscale 0.4.4 and got the same error. |
I'm able to repro on my end so I'll try to look into it a bit more and report back here with findings Edit Update 1The model is able to train with Passing command:
Failing command:
Update 2This fails with the gold doc standard FiD agent as well
|
Should we try w/ slurm to rule out it being multiprocessing? |
Tried this, still fails. something is hanging somewhere... |
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening. |
Bug description
It seems that Seeker training command does not support batch size > 1. I ran into a FSDP error when training Seeker-400M with
-bs 2
.Reproduction steps
Expected behavior
I was hoping that training could succeed.
Logs
Please paste the command line output:
Additional context
Not sure if this is a bug or a feature request.
The text was updated successfully, but these errors were encountered: