-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate failing on multi-gpu rng synchronization #209
Comments
@LWprogramming i can't tell from first glance; code looks ok from a quick scan i may be getting back to audio stuff / TTS later this week, so can help with this issue then are you using Encodec? |
Yeah, using Encodec. Do you suspect that the codec might be the issue somehow? I also notice (after adding some more prints) that we see some weird behavior:
|
i don't really know, but probably good to rule out an external library as the issue will get back to this either end of this week or next Monday. going all out on audio again soon |
OK, this is pretty baffling. I tried rearranging the order in which I train semantic, coarse, and fine (starting with coarse and then semantic) and it ran fine and I was actually able to get samples! Still using my script, gotta run now but to take a look in a bit. Not sure why it reliably breaks down immediately at the start of coarse if I do semantic, coarse, then fine in that order? |
are you training them all at once? |
Yeah, the setup is something like (given some configurable integer train semantic for |
ohh! yeah that's the issue then you can only train one network per training script |
I can add some logic to prevent this issue in the future, with an informative error |
wait what haha does accelerator do something weird that can only happen once per call? (also are you defining training script as a single python script? e.g. can only prepare accelerator once per execution of the thing I call with |
you'd need the training script to be executed 3 times separately for training each network, with each script terminating before the next. Then you put all the models together |
Oh interesting, is this something that's built into accelerate or is it specific to your code? don't recall seeing any warnings about this in the huggingface docs or if it's your code, which part assumes that haha |
@LWprogramming this is just how neural network training is generally done today, if you have multiple big networks to train |
i can add the error message later today! this is a common gotcha, which i handled before over at imagen-pytorch (which is also multiple networks) |
Ahh ok! I'll have to rewrite some of my code haha (for anyone looking at this in the future, I just talked to a friend of mine and they pointed out that training multiple models in parallel either requires moving parameters on and off gpu a lot more, or if they're small enough, they can still fit in memory but then the batch size is necessarily smaller. I guess I still don't know what exactly caused things to break but it doesn't matter so much now.) Thanks so much! |
haha yea, we are still in the mainframe days of deep learning. A century from now, maybe it won't even matter |
I can do semantic but not coarse transformer training right now. Here's what the error message looks like:
This is in the trainer.py file. I don't think the dataloaders are constructed any differently so I'm confused if this is expected (also wasn't clear what generator means vs. rng type cuda or whatever). Do you have ideas for why this might be failing only on coarse but not semantic?
I found this issue with the same error message but it never got resolved unfortunately, and didn't find any similar issues besides that one.
The text was updated successfully, but these errors were encountered: