You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:
from accelerate import InitProcessGroupKwargs
from datetime import timedelta
x = InitProcessGroupKwargs(timeout=timedelta(seconds=N)))
accelerator = Accelerator(
...,
kwargs_handlers = [x]
)
however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
@sayakpaul i did by passing the timeout arg when initializing the Accelerator object. increasing it to a reasonable number delays the error to a later iteration, increasing it to too large a number causes a timeout of its own
I trained SD3 controlnet with the same issue. And I also found that during multi-GPU training, the computation of text embedding will only be computed on one GPU, and I really don't know why.
Describe the bug
Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:
however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().
Reproduction
accelerate launch --config_file configs/distributed train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="path"
--mixed_precision="bf16"
--resolution=1024
--learning_rate=5e-6
--max_train_steps=100000
--validation_steps=1000
--checkpointing_steps=25000
--validation_image "placeholder"
--validation_prompt "placeholder"
--train_batch_size=4
--gradient_accumulation_steps=1
--report_to="tensorboard"
--seed=42
--jsonl_for_train="path"
--cache_dir="path"
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
use_cpu: false
Logs
System Info
diffusers from source
accelerate == 1.1.1
datasets == 3.1.0
transformers == 4.46.2
Who can help?
No response
The text was updated successfully, but these errors were encountered: