Flux Controlnet Train Example, will run out of memory on validation step #9546

Night1099 · 2024-09-28T00:41:29Z

Describe the bug

On default settings provided in flux train example readme, with 10 validation images training will error out with out of memory error during validation. on A100 80GB

09/28/2024 00:34:14 - INFO - __main__ - Running validation... 
model_index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 536/536 [00:00<00:00, 1.54MB/s]
{'controlnet'} was not found in config. Values will be initialized to default values.                                                   | 0.00/536 [00:00<?, ?B/s]
                                                                                                                                                                 Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of black-forest-labs/FLUX.1-dev.                                          | 0/7 [00:00<?, ?it/s]
                                                                                                                                                                 Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.                      | 1/7 [00:00<00:01,  3.77it/s]
Loaded vae as AutoencoderKL from `vae` subfolder of black-forest-labs/FLUX.1-dev.
                                                                                                                                                                 Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of black-forest-labs/FLUX.1-dev.                                        | 3/7 [00:00<00:00,  8.01it/s]
Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of black-forest-labs/FLUX.1-dev.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.82it/s]
Loaded text_encoder_2 as T5EncoderModel from `text_encoder_2` subfolder of black-forest-labs/FLUX.1-dev.██████████████▍             | 6/7 [00:00<00:00,  7.63it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  4.41it/s]
Traceback (most recent call last):0%|███████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  3.64it/s]
  File "/workspace/diffusers/examples/controlnet/train_controlnet_flux.py", line 1434, in <module>
    main(args)
  File "/workspace/diffusers/examples/controlnet/train_controlnet_flux.py", line 1370, in main
    image_logs = log_validation(
                 ^^^^^^^^^^^^^^^
  File "/workspace/diffusers/examples/controlnet/train_controlnet_flux.py", line 146, in log_validation
    image = pipeline(
            ^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/pipelines/flux/pipeline_flux_controlnet.py", line 860, in __call__
    controlnet_block_samples, controlnet_single_block_samples = self.controlnet(
                                                                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/controlnet_flux.py", line 336, in forward
    encoder_hidden_states, hidden_states = block(
                                           ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 172, in forward
    attn_output, context_attn_output = self.attn(
                                       ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/attention_processor.py", line 1762, in __call__
    query = apply_rotary_emb(query, image_rotary_emb)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/embeddings.py", line 680, in apply_rotary_emb
    out = (x.float() * cos + x_rotated.float() * sin).to(x.dtype)
           ~~~~~~~~~~^~~~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB. GPU 0 has a total capacity of 79.14 GiB of which 52.75 MiB is free. Process 2301333 has 79.08 GiB memory in use. Of the allocated memory 78.35 GiB is allocated by PyTorch, and 217.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Reproduction

Run Train Flux controlnet example with default args in Flux Readme with 10 validation images

Logs

No response

System Info

🤗 Diffusers version: 0.31.0.dev0
Platform: Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Running on Google Colab?: No
Python version: 3.11.10
PyTorch version (GPU?): 2.4.1+cu124 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.25.1
Transformers version: 4.45.1
Accelerate version: 0.34.2
PEFT version: not installed
Bitsandbytes version: not installed
Safetensors version: 0.4.5
xFormers version: not installed
Accelerator: NVIDIA A100 80GB PCIe, 81920 MiB
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sayakpaul @PromeAIpro

The text was updated successfully, but these errors were encountered:

sayakpaul · 2024-09-28T02:30:06Z

Can you try the following?

In place of

diffusers/examples/controlnet/train_controlnet_flux.py

Line 196 in 81cf3b2

clear_objs_and_retain_memory([pipeline])

do:

del pipeline

gc.collect()

torch.cuda.empty_cache()

There is a problem with clear_objs_and_retain_memory() which is being fixed in:
#9543

Night1099 · 2024-09-28T20:56:09Z

@sayakpaul i tried newest commit and now flux training will crash before training

09/28/2024 20:53:43 - INFO - __main__ - Initializing controlnet weights from transformer
09/28/2024 20:53:54 - INFO - __main__ - all models loaded successfully
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1174, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 769, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_controlnet_flux.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--dataset_name=NightRaven109/DeBaketest', '--conditioning_image_column=conditioning_image', '--image_column=ground_truth_image', '--caption_column=caption', '--output_dir=./flux', '--mixed_precision=bf16', '--resolution=512', '--learning_rate=1e-5', '--max_train_steps=15000', '--validation_steps=5', '--checkpointing_steps=500', '--validation_image', '/workspace/val/0009_particle_board_baked_rot0.png', '/workspace/val/0015_shingles_baked_rot270.png', '/workspace/val/KB3D_WRK_CeramicTileFloorDamagedA_baked_rot0.png', '/workspace/val/patterned_shiny_metal_04_baked_rot90.png', '/workspace/val/realistic_ground_forest_sbhipwp0_baked_rot180.png', '/workspace/val/realistic_stone_floor_tkqhbcldy_baked_rot0.png', '/workspace/val/realistic_WoodPlanks_014_WoodPlanks 014_baked_rot0.png', '/workspace/val/rusty_patterned_metal_04_baked_rot0.png', '/workspace/val/Sci-fi_Armor_001_4K_baked_rot0.png', '/workspace/val/Misc_MorgueStorage_baked_rot0.png', '--validation_prompt', 'texture, basketball backboard, grass', 'texture, scale, pattern, close-up, metal', 'texture, tile wall, bathroom, crack, tile', 'texture, metal, pattern', 'texture, vegetation, weed, fairground, land, autumn forest, fern, stone, greenery, plant, patch', 'texture, dance floor, stone, gravestone, brick', 'texture, fence, wood, plank, wood wall, board', 'texture, square, pattern, rust, brick, metal', 'texture, design, pattern, action film, material, armor, disciple', 'texture, tile wall, lid, briefcase, lock, hinge, silver', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--report_to=wandb', '--num_double_layers=4', '--num_single_layers=0', '--seed=42', '--push_to_hub']' died with <Signals.SIGSEGV: 11>.

sayakpaul · 2024-09-29T03:01:40Z

Cc @PromeAIpro in that case.

illyafan · 2024-09-29T03:27:00Z

@sayakpaul i tried newest commit and now flux training will crash before training

09/28/2024 20:53:43 - INFO - __main__ - Initializing controlnet weights from transformer
09/28/2024 20:53:54 - INFO - __main__ - all models loaded successfully
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1174, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 769, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python', 'train_controlnet_flux.py', '--pretrained_model_name_or_path=black-forest-labs/FLUX.1-dev', '--dataset_name=NightRaven109/DeBaketest', '--conditioning_image_column=conditioning_image', '--image_column=ground_truth_image', '--caption_column=caption', '--output_dir=./flux', '--mixed_precision=bf16', '--resolution=512', '--learning_rate=1e-5', '--max_train_steps=15000', '--validation_steps=5', '--checkpointing_steps=500', '--validation_image', '/workspace/val/0009_particle_board_baked_rot0.png', '/workspace/val/0015_shingles_baked_rot270.png', '/workspace/val/KB3D_WRK_CeramicTileFloorDamagedA_baked_rot0.png', '/workspace/val/patterned_shiny_metal_04_baked_rot90.png', '/workspace/val/realistic_ground_forest_sbhipwp0_baked_rot180.png', '/workspace/val/realistic_stone_floor_tkqhbcldy_baked_rot0.png', '/workspace/val/realistic_WoodPlanks_014_WoodPlanks 014_baked_rot0.png', '/workspace/val/rusty_patterned_metal_04_baked_rot0.png', '/workspace/val/Sci-fi_Armor_001_4K_baked_rot0.png', '/workspace/val/Misc_MorgueStorage_baked_rot0.png', '--validation_prompt', 'texture, basketball backboard, grass', 'texture, scale, pattern, close-up, metal', 'texture, tile wall, bathroom, crack, tile', 'texture, metal, pattern', 'texture, vegetation, weed, fairground, land, autumn forest, fern, stone, greenery, plant, patch', 'texture, dance floor, stone, gravestone, brick', 'texture, fence, wood, plank, wood wall, board', 'texture, square, pattern, rust, brick, metal', 'texture, design, pattern, action film, material, armor, disciple', 'texture, tile wall, lid, briefcase, lock, hinge, silver', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--report_to=wandb', '--num_double_layers=4', '--num_single_layers=0', '--seed=42', '--push_to_hub']' died with <Signals.SIGSEGV: 11>.

same, did you find the reason?

PromeAIpro · 2024-09-30T06:52:51Z

sorry respond late, i just test a few minutes earlier. tried with 10 validation images and didn't need more CUDA memory. So can you provide more detailed config? especially resolution, num_double_layers, num_single_layers. @Night1099
As for the crash, i tested on commit 8e7d6c0 (#9543) today, and works good, guessing it was fixed as @sayakpaul metioned clear_objs_and_retain_memory

Night1099 · 2024-09-30T17:00:25Z

@PromeAIpro Training no longer crashes on start but OOM is still happening even with one 512x512 val image

I follow flux readme to the T

here is readout and launch params

root@6a12e343cd3d:/workspace/diffusers/examples/controlnet# accelerate launch train_controlnet_flux.py \
    --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \
    --dataset_name=NightRaven109/DeBaketest \
    --conditioning_image_column=conditioning_image \
    --image_column=ground_truth_image \
    --caption_column=caption \
    --output_dir="./flux" \
    --mixed_precision="bf16" \
    --resolution=512 \
    --learning_rate=1e-5 \
    --max_train_steps=15000 \
    --validation_steps=5 \
    --checkpointing_steps=500 \
    --validation_image "/workspace/val/22.png" \
    --validation_prompt "texture, bark, tree" \
    --train_batch_size=1 \
    --gradient_accumulation_steps=4 \
    --report_to="wandb" \
    --num_double_layers=4 \
    --num_single_layers=0 \
    --seed=42 \
    --push_to_hub
09/30/2024 16:55:32 - INFO - __main__ - Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7981.55it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.11s/it]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 6502.80it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
09/30/2024 16:56:10 - INFO - __main__ - Initializing controlnet weights from transformer
09/30/2024 16:56:18 - INFO - __main__ - all models loaded successfully
Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00,  2.86 examples/s]
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: ben10gregg. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.2
wandb: Run data is saved locally in /workspace/diffusers/examples/controlnet/wandb/run-20240930_165638-46hnu2ap
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run northern-elevator-13
wandb: ⭐️ View project at https://wandb.ai/ben10gregg/flux_train_controlnet
wandb: 🚀 View run at https://wandb.ai/ben10gregg/flux_train_controlnet/runs/46hnu2ap
09/30/2024 16:56:38 - INFO - __main__ - ***** Running training *****
09/30/2024 16:56:38 - INFO - __main__ -   Num examples = 8
09/30/2024 16:56:38 - INFO - __main__ -   Num batches each epoch = 8
09/30/2024 16:56:38 - INFO - __main__ -   Num Epochs = 7500
09/30/2024 16:56:38 - INFO - __main__ -   Instantaneous batch size per device = 1
09/30/2024 16:56:38 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
09/30/2024 16:56:38 - INFO - __main__ -   Gradient Accumulation steps = 4
09/30/2024 16:56:38 - INFO - __main__ -   Total optimization steps = 15000
Steps:   0%|                                                                                                                                                                                                                                                    | 5/15000 [00:27<22:47:18,  5.47s/it, loss=0.64, lr=1e-5]09/30/2024 16:57:06 - INFO - __main__ - Running validation... 
{'controlnet'} was not found in config. Values will be initialized to default values.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.71it/s]
Loaded text_encoder_2 as T5EncoderModel from `text_encoder_2` subfolder of black-forest-labs/FLUX.1-dev.                                                                                                                                                                                           | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.85it/sLoaded text_encoder as CLIPTextModel from `text_encoder` subfolder of black-forest-labs/FLUX.1-dev.                                                                                                                                                                                         | 1/7 [00:00<00:02,  2.91it/s]
                                                                                                                                                                                                                                                                                                                        Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of black-forest-labs/FLUX.1-dev.███████████████▎                                                                                                                                                                               | 2/7 [00:00<00:01,  4.39it/s]
Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of black-forest-labs/FLUX.1-dev.
                                                                                                                                                                                                                                                                                                                        Loaded vae as AutoencoderKL from `vae` subfolder of black-forest-labs/FLUX.1-dev.█████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                         | 4/7 [00:00<00:00,  7.07it/s]
Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  9.99it/s]
Traceback (most recent call last):
  File "/workspace/diffusers/examples/controlnet/train_controlnet_flux.py", line 1436, in <module>
    main(args)
  File "/workspace/diffusers/examples/controlnet/train_controlnet_flux.py", line 1372, in main
    image_logs = log_validation(
                 ^^^^^^^^^^^^^^^
  File "/workspace/diffusers/examples/controlnet/train_controlnet_flux.py", line 146, in log_validation
    image = pipeline(
            ^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/pipelines/flux/pipeline_flux_controlnet.py", line 860, in __call__
    controlnet_block_samples, controlnet_single_block_samples = self.controlnet(
                                                                ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/controlnet_flux.py", line 336, in forward
    encoder_hidden_states, hidden_states = block(
                                           ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 185, in forward
    ff_output = self.ff(norm_hidden_states)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/attention.py", line 1201, in forward
    hidden_states = module(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/activations.py", line 89, in forward
    hidden_states = self.gelu(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/diffusers/src/diffusers/models/activations.py", line 83, in gelu
    return F.gelu(gate, approximate=self.approximate)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 79.26 GiB of which 74.75 MiB is free. Process 3970744 has 79.17 GiB memory in use. Of the allocated memory 78.40 GiB is allocated by PyTorch, and 263.55 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

PromeAIpro · 2024-10-01T03:29:08Z

really confused me, ive tried install same transformer\accelerate versions of yours and works good.
may i ask that what's your total gpu number? are you testing training on multi gpu?
@Night1099

Night1099 · 2024-10-01T03:52:59Z

@PromeAIpro 1 A100 on runpod, pytorch 2.4

PromeAIpro · 2024-10-01T04:15:39Z

maybe it is about precision. I guess accelerate try convert params to bf16 but fail and remain fp32? it is a guess it be device-related issue, need to test on runpod device. can you do a simple test if a bf16 convert take effect ? and paste a nvidia-smi results

@PromeAIpro 1 A100 on runpod, pytorch 2.4

Trav1slaflame · 2024-10-12T12:48:15Z

Dear team (@sayakpaul @PromeAIpro):

I follow the instruction in #9543 but still get OOM issue when running the train_controlnet_sd3.py on 1 A100 80g gpu. Any ideas?

Trav1slaflame · 2024-10-12T12:55:25Z

Dear team (@sayakpaul @PromeAIpro):

I follow the instruction in #9543 but still get OOM issue when running the train_controlnet_sd3.py on 1 A100 80g gpu. Any ideas?

also meet this OOM issue during validation stage

PromeAIpro · 2024-10-14T03:13:59Z

Consider using these two options --use_adafactor --gradient_checkpointing @Night1099 @Trav1slaflame

universewill · 2024-10-16T02:41:59Z

same problem here, flux controlnet always oom when running log_validation. I test it on A100(80g). How to solve it?
@sayakpaul @PromeAIpro

PromeAIpro · 2024-10-16T03:14:03Z

Consider using these two options --use_adafactor --gradient_checkpointing @Night1099 @Trav1slaflame

@universewill

github-actions · 2024-11-09T15:03:01Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul · 2024-11-09T15:38:16Z

Closing due to inactivity.

Night1099 added the bug Something isn't working label Sep 28, 2024

github-actions bot added the stale Issues that haven't received updates label Nov 9, 2024

sayakpaul removed the stale Issues that haven't received updates label Nov 9, 2024

sayakpaul closed this as completed Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux Controlnet Train Example, will run out of memory on validation step #9546

Flux Controlnet Train Example, will run out of memory on validation step #9546

Night1099 commented Sep 28, 2024 •

edited

Loading

sayakpaul commented Sep 28, 2024

Night1099 commented Sep 28, 2024 •

edited

Loading

sayakpaul commented Sep 29, 2024

illyafan commented Sep 29, 2024

PromeAIpro commented Sep 30, 2024

Night1099 commented Sep 30, 2024

PromeAIpro commented Oct 1, 2024

Night1099 commented Oct 1, 2024 •

edited

Loading

PromeAIpro commented Oct 1, 2024 •

edited

Loading

Trav1slaflame commented Oct 12, 2024

Trav1slaflame commented Oct 12, 2024

PromeAIpro commented Oct 14, 2024

universewill commented Oct 16, 2024 •

edited

Loading

PromeAIpro commented Oct 16, 2024

github-actions bot commented Nov 9, 2024

sayakpaul commented Nov 9, 2024

Flux Controlnet Train Example, will run out of memory on validation step #9546

Flux Controlnet Train Example, will run out of memory on validation step #9546

Comments

Night1099 commented Sep 28, 2024 • edited Loading

Describe the bug

Reproduction

Logs

System Info

Who can help?

sayakpaul commented Sep 28, 2024

Night1099 commented Sep 28, 2024 • edited Loading

sayakpaul commented Sep 29, 2024

illyafan commented Sep 29, 2024

PromeAIpro commented Sep 30, 2024

Night1099 commented Sep 30, 2024

PromeAIpro commented Oct 1, 2024

Night1099 commented Oct 1, 2024 • edited Loading

PromeAIpro commented Oct 1, 2024 • edited Loading

Trav1slaflame commented Oct 12, 2024

Trav1slaflame commented Oct 12, 2024

PromeAIpro commented Oct 14, 2024

universewill commented Oct 16, 2024 • edited Loading

PromeAIpro commented Oct 16, 2024

github-actions bot commented Nov 9, 2024

sayakpaul commented Nov 9, 2024

Night1099 commented Sep 28, 2024 •

edited

Loading

Night1099 commented Sep 28, 2024 •

edited

Loading

Night1099 commented Oct 1, 2024 •

edited

Loading

PromeAIpro commented Oct 1, 2024 •

edited

Loading

universewill commented Oct 16, 2024 •

edited

Loading