Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate failing on multi-gpu rng synchronization #209

Closed
LWprogramming opened this issue Jul 18, 2023 · 15 comments
Closed

Accelerate failing on multi-gpu rng synchronization #209

LWprogramming opened this issue Jul 18, 2023 · 15 comments

Comments

@LWprogramming
Copy link
Contributor

I can do semantic but not coarse transformer training right now. Here's what the error message looks like:

File "/path/to/trainer.py", line 999, in train_step
data_kwargs = dict(zip(self.ds_fields, next(self.dl_iter)))
File "/path/to/trainer.py", line 78, in cycle
for data in dl:
File "/path/to/venv/site-packages/accelerate/data_loader.py", line 367, in iter
synchronize_rng_states(self.rng_types, self.synchronized_generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 100, in synchronize_rng_states
synchronize_rng_state(RNGType(rng_type), generator=generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 95, in synchronize_rng_state
generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state

This is in the trainer.py file. I don't think the dataloaders are constructed any differently so I'm confused if this is expected (also wasn't clear what generator means vs. rng type cuda or whatever). Do you have ideas for why this might be failing only on coarse but not semantic?

I found this issue with the same error message but it never got resolved unfortunately, and didn't find any similar issues besides that one.

@lucidrains
Copy link
Owner

lucidrains commented Jul 18, 2023

@LWprogramming i can't tell from first glance; code looks ok from a quick scan

i may be getting back to audio stuff / TTS later this week, so can help with this issue then

are you using Encodec?

@LWprogramming
Copy link
Contributor Author

Yeah, using Encodec. Do you suspect that the codec might be the issue somehow?

I also notice (after adding some more prints) that we see some weird behavior:

  • all the GPUs make it to on device {device}: accelerator has...
  • only the main GPU makes it to device {device} arrived at 2
  • main gpu crashes at the wait_for_everyone() shortly after point 2, so it never arrives at 3. That seems like wait_for_everyone() is either causing or maybe exposing an issue if the other GPUs are already unable to train properly.

@lucidrains
Copy link
Owner

i don't really know, but probably good to rule out an external library as the issue

will get back to this either end of this week or next Monday. going all out on audio again soon

@LWprogramming
Copy link
Contributor Author

OK, this is pretty baffling. I tried rearranging the order in which I train semantic, coarse, and fine (starting with coarse and then semantic) and it ran fine and I was actually able to get samples! Still using my script, gotta run now but to take a look in a bit. Not sure why it reliably breaks down immediately at the start of coarse if I do semantic, coarse, then fine in that order?

@lucidrains
Copy link
Owner

are you training them all at once?

@LWprogramming
Copy link
Contributor Author

LWprogramming commented Jul 18, 2023

Yeah, the setup is something like (given some configurable integer save_every):

train semantic for save_every steps, then train coarse for save_every steps, then fine. then try sampling, then do another save_every steps per trainer, and repeat. This way we can gradually see what the samples look like as the transformers gradually train

@lucidrains
Copy link
Owner

ohh! yeah that's the issue then

you can only train one network per training script

@lucidrains
Copy link
Owner

I can add some logic to prevent this issue in the future, with an informative error

@LWprogramming
Copy link
Contributor Author

wait what haha

does accelerator do something weird that can only happen once per call?

(also are you defining training script as a single python script? e.g. can only prepare accelerator once per execution of the thing I call with accelerate launch)

@lucidrains
Copy link
Owner

you'd need the training script to be executed 3 times separately for training each network, with each script terminating before the next. Then you put all the models together

@LWprogramming
Copy link
Contributor Author

Oh interesting, is this something that's built into accelerate or is it specific to your code? don't recall seeing any warnings about this in the huggingface docs or if it's your code, which part assumes that haha

@lucidrains
Copy link
Owner

lucidrains commented Jul 18, 2023

@LWprogramming this is just how neural network training is generally done today, if you have multiple big networks to train

@lucidrains
Copy link
Owner

i can add the error message later today! this is a common gotcha, which i handled before over at imagen-pytorch (which is also multiple networks)

@LWprogramming
Copy link
Contributor Author

Ahh ok! I'll have to rewrite some of my code haha

(for anyone looking at this in the future, I just talked to a friend of mine and they pointed out that training multiple models in parallel either requires moving parameters on and off gpu a lot more, or if they're small enough, they can still fit in memory but then the batch size is necessarily smaller. I guess I still don't know what exactly caused things to break but it doesn't matter so much now.)

Thanks so much!

@lucidrains
Copy link
Owner

haha yea, we are still in the mainframe days of deep learning. A century from now, maybe it won't even matter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants