Accelerate failing on multi-gpu rng synchronization #209

LWprogramming · 2023-07-18T02:36:48Z

I can do semantic but not coarse transformer training right now. Here's what the error message looks like:

File "/path/to/trainer.py", line 999, in train_step
data_kwargs = dict(zip(self.ds_fields, next(self.dl_iter)))
File "/path/to/trainer.py", line 78, in cycle
for data in dl:
File "/path/to/venv/site-packages/accelerate/data_loader.py", line 367, in iter
synchronize_rng_states(self.rng_types, self.synchronized_generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 100, in synchronize_rng_states
synchronize_rng_state(RNGType(rng_type), generator=generator)
File "/path/to/venv/site-packages/accelerate/utils/random.py", line 95, in synchronize_rng_state
generator.set_state(rng_state)
RuntimeError: Invalid mt19937 state

This is in the trainer.py file. I don't think the dataloaders are constructed any differently so I'm confused if this is expected (also wasn't clear what generator means vs. rng type cuda or whatever). Do you have ideas for why this might be failing only on coarse but not semantic?

I found this issue with the same error message but it never got resolved unfortunately, and didn't find any similar issues besides that one.

The text was updated successfully, but these errors were encountered:

lucidrains · 2023-07-18T03:40:52Z

@LWprogramming i can't tell from first glance; code looks ok from a quick scan

i may be getting back to audio stuff / TTS later this week, so can help with this issue then

are you using Encodec?

LWprogramming · 2023-07-18T05:06:38Z

Yeah, using Encodec. Do you suspect that the codec might be the issue somehow?

I also notice (after adding some more prints) that we see some weird behavior:

all the GPUs make it to on device {device}: accelerator has...
only the main GPU makes it to device {device} arrived at 2
main gpu crashes at the wait_for_everyone() shortly after point 2, so it never arrives at 3. That seems like wait_for_everyone() is either causing or maybe exposing an issue if the other GPUs are already unable to train properly.

lucidrains · 2023-07-18T14:04:30Z

i don't really know, but probably good to rule out an external library as the issue

will get back to this either end of this week or next Monday. going all out on audio again soon

LWprogramming · 2023-07-18T15:57:09Z

OK, this is pretty baffling. I tried rearranging the order in which I train semantic, coarse, and fine (starting with coarse and then semantic) and it ran fine and I was actually able to get samples! Still using my script, gotta run now but to take a look in a bit. Not sure why it reliably breaks down immediately at the start of coarse if I do semantic, coarse, then fine in that order?

lucidrains · 2023-07-18T16:40:24Z

are you training them all at once?

LWprogramming · 2023-07-18T18:29:22Z

Yeah, the setup is something like (given some configurable integer save_every):

train semantic for save_every steps, then train coarse for save_every steps, then fine. then try sampling, then do another save_every steps per trainer, and repeat. This way we can gradually see what the samples look like as the transformers gradually train

lucidrains · 2023-07-18T18:37:42Z

ohh! yeah that's the issue then

you can only train one network per training script

lucidrains · 2023-07-18T18:41:34Z

I can add some logic to prevent this issue in the future, with an informative error

LWprogramming · 2023-07-18T18:42:13Z

wait what haha

does accelerator do something weird that can only happen once per call?

(also are you defining training script as a single python script? e.g. can only prepare accelerator once per execution of the thing I call with accelerate launch)

lucidrains · 2023-07-18T18:45:45Z

you'd need the training script to be executed 3 times separately for training each network, with each script terminating before the next. Then you put all the models together

LWprogramming · 2023-07-18T18:49:12Z

Oh interesting, is this something that's built into accelerate or is it specific to your code? don't recall seeing any warnings about this in the huggingface docs or if it's your code, which part assumes that haha

lucidrains · 2023-07-18T18:52:12Z

@LWprogramming this is just how neural network training is generally done today, if you have multiple big networks to train

lucidrains · 2023-07-18T18:56:16Z

i can add the error message later today! this is a common gotcha, which i handled before over at imagen-pytorch (which is also multiple networks)

LWprogramming · 2023-07-18T19:08:09Z

Ahh ok! I'll have to rewrite some of my code haha

(for anyone looking at this in the future, I just talked to a friend of mine and they pointed out that training multiple models in parallel either requires moving parameters on and off gpu a lot more, or if they're small enough, they can still fit in memory but then the batch size is necessarily smaller. I guess I still don't know what exactly caused things to break but it doesn't matter so much now.)

Thanks so much!

lucidrains · 2023-07-18T19:12:47Z

haha yea, we are still in the mainframe days of deep learning. A century from now, maybe it won't even matter

see here: lucidrains/audiolm-pytorch#209 (comment)

LWprogramming closed this as completed Jul 18, 2023

LWprogramming added a commit to LWprogramming/audiolm-pytorch-training that referenced this issue Jul 18, 2023

eliminate parallel training

cdb842c

see here: lucidrains/audiolm-pytorch#209 (comment)

LWprogramming mentioned this issue Jul 21, 2023

When running the example code, I get an error where the trainer says to be instantiated twice: #211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate failing on multi-gpu rng synchronization #209

Accelerate failing on multi-gpu rng synchronization #209

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023 •

edited

Loading

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023 •

edited

Loading

lucidrains commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023 •

edited

Loading

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

Accelerate failing on multi-gpu rng synchronization #209

Accelerate failing on multi-gpu rng synchronization #209

Comments

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023 • edited Loading

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023 • edited Loading

lucidrains commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023 • edited Loading

lucidrains commented Jul 18, 2023

LWprogramming commented Jul 18, 2023

lucidrains commented Jul 18, 2023

lucidrains commented Jul 18, 2023 •

edited

Loading

LWprogramming commented Jul 18, 2023 •

edited

Loading

lucidrains commented Jul 18, 2023 •

edited

Loading