nccl timeout on train_controlnet_flux.py when doing multigpu training #9936

neuron-party · 2024-11-15T22:08:19Z

Describe the bug

Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing Accelerator with a greater timeout argument in the following way:

from accelerate import InitProcessGroupKwargs
from datetime import timedelta

x = InitProcessGroupKwargs(timeout=timedelta(seconds=N)))

accelerator = Accelerator(
   ...,
   kwargs_handlers = [x]
)

however, the NCCL timeout error reoccurs at a later iteration of train_dataset.map().

Reproduction

accelerate launch --config_file configs/distributed train_controlnet_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--conditioning_image_column=conditioning_image
--image_column=image
--caption_column=text
--output_dir="path"
--mixed_precision="bf16"
--resolution=1024
--learning_rate=5e-6
--max_train_steps=100000
--validation_steps=1000
--checkpointing_steps=25000
--validation_image "placeholder"
--validation_prompt "placeholder"
--train_batch_size=4
--gradient_accumulation_steps=1
--report_to="tensorboard"
--seed=42
--jsonl_for_train="path"
--cache_dir="path"

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
use_cpu: false

Logs

Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

System Info

diffusers from source
accelerate == 1.1.1
datasets == 3.1.0
transformers == 4.46.2

Who can help?

No response

The text was updated successfully, but these errors were encountered:

sayakpaul · 2024-11-17T02:20:50Z

Can you try to increase the NCCL timeout value and see if that helps?

neuron-party · 2024-11-17T05:57:40Z

@sayakpaul i did by passing the timeout arg when initializing the Accelerator object. increasing it to a reasonable number delays the error to a later iteration, increasing it to too large a number causes a timeout of its own

sayakpaul · 2024-11-17T06:01:49Z

Okay. Then maybe precomputing dataset processing step outputs would be more useful in this setup?

xduzhangjiayu · 2024-11-17T06:56:13Z

I trained SD3 controlnet with the same issue. And I also found that during multi-GPU training, the computation of text embedding will only be computed on one GPU, and I really don't know why.

neuron-party added the bug Something isn't working label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl timeout on train_controlnet_flux.py when doing multigpu training #9936

nccl timeout on train_controlnet_flux.py when doing multigpu training #9936

neuron-party commented Nov 15, 2024

sayakpaul commented Nov 17, 2024

neuron-party commented Nov 17, 2024

sayakpaul commented Nov 17, 2024

xduzhangjiayu commented Nov 17, 2024

nccl timeout on train_controlnet_flux.py when doing multigpu training #9936

nccl timeout on train_controlnet_flux.py when doing multigpu training #9936

Comments

neuron-party commented Nov 15, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

sayakpaul commented Nov 17, 2024

neuron-party commented Nov 17, 2024

sayakpaul commented Nov 17, 2024

xduzhangjiayu commented Nov 17, 2024