Long Pauses Between Epochs #36

official-elinas · 2023-01-10T23:37:37Z

I noticed that there are long pauses between epochs, around 35-40 seconds. What is the reasoning behind this and is there a way to lower it significantly or disable it entirely?

I looked through the code and was not able to find anything specific to pausing between epochs. The other db extension that is for auto's ui has an option to pause between epochs (default 60s) but can be lowered to 1s if desired.

This also applies to native training.

Edit: Looks like it hangs on line 206 in train_db.py every epoch.

Thanks.

The text was updated successfully, but these errors were encountered:

bmaltais · 2023-01-16T18:48:14Z

This is something related with the latest accelerate version... I can't do much about it until kohya upgrade accelerate to another version as part of his supported code base... but glad you raised it so others are not taken aback by it ;-)

official-elinas · 2023-01-24T02:19:39Z

So I'm a bit confused. The latest version of accelerate is 0.15.0 (https://pypi.org/project/accelerate/). What do you mean by upgrading to another version when this is the latest version? I feel this might not be an accelerate issue.

DarkAlchy · 2023-01-30T22:06:10Z

kohya-ss/sd-scripts#125

It is using the CPU far too much as the memory usage ramps up, slowly, then it dumps to CUDA, then epoch where the cycle starts over. Considering Automatic1111 works fine, and is using the GPU (1 gig more of it) I can say it is something in Kohya's code.

official-elinas · 2023-02-02T02:21:33Z

I definitely agree. I also do not want to use repeats as epochs is a better measurement of progress, and repeats produce different results than just using epochs. It almost seems like a workaround for whatever is wrong with the code.

official-elinas mentioned this issue Jan 29, 2023

"MemoryError" (not CUDA) crashes training (dreambooth) #82

Closed

official-elinas mentioned this issue Feb 2, 2023

Slow TI training compared to Automatic1111 kohya-ss/sd-scripts#125

Closed

bmaltais closed this as completed Mar 15, 2023

Aniket22156 mentioned this issue Jun 1, 2023

terminating due to uncaught exception of type c10::TypeError: Trying to convert BFloat16 to the MPS backend but it does not have support for that dtype. #882

Closed

AbstractEyes mentioned this issue Oct 14, 2024

[rank2]:[W1014 03:47:39.789025589 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). #2899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Pauses Between Epochs #36

Long Pauses Between Epochs #36

official-elinas commented Jan 10, 2023 •

edited

Loading

bmaltais commented Jan 16, 2023

official-elinas commented Jan 24, 2023 •

edited

Loading

DarkAlchy commented Jan 30, 2023 •

edited

Loading

official-elinas commented Feb 2, 2023 •

edited

Loading

Long Pauses Between Epochs #36

Long Pauses Between Epochs #36

Comments

official-elinas commented Jan 10, 2023 • edited Loading

bmaltais commented Jan 16, 2023

official-elinas commented Jan 24, 2023 • edited Loading

DarkAlchy commented Jan 30, 2023 • edited Loading

official-elinas commented Feb 2, 2023 • edited Loading

official-elinas commented Jan 10, 2023 •

edited

Loading

official-elinas commented Jan 24, 2023 •

edited

Loading

DarkAlchy commented Jan 30, 2023 •

edited

Loading

official-elinas commented Feb 2, 2023 •

edited

Loading