Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5LayerNorm error #673

Closed
lweingart opened this issue Aug 17, 2024 · 6 comments
Closed

T5LayerNorm error #673

lweingart opened this issue Aug 17, 2024 · 6 comments
Labels

Comments

@lweingart
Copy link

Hi guys,

I'm finally able to start the training, but I'm encountering these errors:

AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48

and the following as well:

RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3

Would you have any idea what could be done here by chance ?

Here is the command, followed by the log trace:

torchrun --standalone --nproc_per_node 1 -m scripts.train \
    configs/opensora-v1-2/train/stage1.py \
    --data-path {ROOT_META}/meta_clips_caption_cleaned.csv \
    --ckpt-path {MODEL_OUTPUT}/my_sora.pt
/usr/local/lib/python3.10/dist-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/usr/local/lib/python3.10/dist-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
[2024-08-17 18:45:05] Experiment directory created at outputs/005-STDiT3-XL-2
[2024-08-17 18:45:05] Training configuration:
 {'adam_eps': 1e-15,
 'bucket_config': {'1024': {1: (0.05, 36)},
                   '1080p': {1: (0.1, 5)},
                   '144p': {1: (1.0, 475),
                            51: (1.0, 51),
                            102: ((1.0, 0.33), 27),
                            204: ((1.0, 0.1), 13),
                            408: ((1.0, 0.1), 6)},
                   '2048': {1: (0.1, 5)},
                   '240p': {1: (0.3, 297),
                            51: (0.4, 20),
                            102: ((0.4, 0.33), 10),
                            204: ((0.4, 0.1), 5),
                            408: ((0.4, 0.1), 2)},
                   '256': {1: (0.4, 297),
                           51: (0.5, 20),
                           102: ((0.5, 0.33), 10),
                           204: ((0.5, 0.1), 5),
                           408: ((0.5, 0.1), 2)},
                   '360p': {1: (0.2, 141),
                            51: (0.15, 8),
                            102: ((0.15, 0.33), 4),
                            204: ((0.15, 0.1), 2),
                            408: ((0.15, 0.1), 1)},
                   '480p': {1: (0.1, 89)},
                   '512': {1: (0.1, 141)},
                   '720p': {1: (0.05, 36)}},
 'ckpt_every': 200,
 'config': 'configs/opensora-v1-2/train/stage1.py',
 'dataset': {'data_path': '/content/drive/MyDrive/Open-Sora/opensora/data/meta/meta_clips_caption_cleaned.csv',
             'transform_name': 'resize_crop',
             'type': 'VariableVideoTextDataset'},
 'dtype': 'bf16',
 'ema_decay': 0.99,
 'epochs': 1000,
 'grad_checkpoint': True,
 'grad_clip': 1.0,
 'load': None,
 'log_every': 10,
 'lr': 0.0001,
 'mask_ratios': {'image_head': 0.05,
                 'image_head_tail': 0.025,
                 'image_random': 0.025,
                 'image_tail': 0.025,
                 'intepolate': 0.005,
                 'quarter_head': 0.005,
                 'quarter_head_tail': 0.005,
                 'quarter_random': 0.005,
                 'quarter_tail': 0.005,
                 'random': 0.05},
 'model': {'enable_flash_attn': True,
           'enable_layernorm_kernel': True,
           'freeze_y_embedder': True,
           'from_pretrained': '/content/drive/MyDrive/Open-Sora/opensora/output/my_sora.pt',
           'qk_norm': True,
           'type': 'STDiT3-XL/2'},
 'num_bucket_build_workers': 16,
 'num_workers': 8,
 'outputs': 'outputs',
 'plugin': 'zero2',
 'record_time': False,
 'scheduler': {'sample_method': 'logit-normal',
               'type': 'rflow',
               'use_timestep_transform': True},
 'seed': 42,
 'start_from_scratch': False,
 'text_encoder': {'from_pretrained': 'DeepFloyd/t5-v1_1-xxl',
                  'model_max_length': 300,
                  'shardformer': True,
                  'type': 't5'},
 'vae': {'from_pretrained': 'hpcai-tech/OpenSora-VAE-v1.2',
         'micro_batch_size': 4,
         'micro_frame_size': 17,
         'type': 'OpenSoraVAE_V1_2'},
 'wandb': False,
 'warmup_steps': 1000}
2024-08-17 18:45:05.718442: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-17 18:45:05.740201: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-17 18:45:05.746831: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-17 18:45:06.875860: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2024-08-17 18:45:07] Building dataset...
[2024-08-17 18:45:07] Dataset contains 941 samples.
[2024-08-17 18:45:07] Number of buckets: 626
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-08-17 18:45:07] Building buckets...
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
[2024-08-17 18:45:08] Bucket Info:
[2024-08-17 18:45:08] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [614, 13]}
[2024-08-17 18:45:08] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-08-17 18:45:08] Video Bucket [#sample, #batch] by HxWxT:
{('144p', 408): [1, 0],
 ('144p', 204): [10, 0],
 ('144p', 102): [126, 4],
 ('144p', 51): [477, 9]}
[2024-08-17 18:45:08] #training batch: 13, #training sample: 614, #non empty bucket: 4
[2024-08-17 18:45:08] Building models...
[2024-08-17 18:45:08] WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.4.0+cu121 with CUDA 1201 (you have 2.3.0+cu121)
    Python  3.10.14 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_2d.py:34: FutureWarning: `Transformer2DModelOutput` is deprecated and will be removed in version 1.0.0. Importing `Transformer2DModelOutput` from `diffusers.models.transformer_2d` is deprecated and this will be removed in a future version. Please use `from diffusers.models.modeling_outputs import Transformer2DModelOutput`, instead.
  deprecate("Transformer2DModelOutput", "1.0.0", deprecation_message)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
tokenizer_config.json: 100% 1.86k/1.86k [00:00<00:00, 12.5MB/s]
config.json: 100% 752/752 [00:00<00:00, 5.11MB/s]
spiece.model: 100% 792k/792k [00:00<00:00, 42.9MB/s]
special_tokens_map.json: 100% 1.79k/1.79k [00:00<00:00, 13.0MB/s]
pytorch_model.bin.index.json: 100% 20.0k/20.0k [00:00<00:00, 70.0MB/s]
Downloading shards:   0% 0/2 [00:00<?, ?it/s]
pytorch_model-00001-of-00002.bin:   0% 0.00/9.45G [00:00<?, ?B/s]
pytorch_model-00001-of-00002.bin:   0% 31.5M/9.45G [00:00<00:34, 270MB/s]
pytorch_model-00001-of-00002.bin:   1% 83.9M/9.45G [00:00<00:24, 389MB/s]
...
pytorch_model-00002-of-00002.bin:  99% 9.52G/9.60G [00:37<00:01, 60.6MB/s]
pytorch_model-00002-of-00002.bin: 100% 9.60G/9.60G [00:38<00:00, 252MB/s] 
Downloading shards: 100% 2/2 [01:07<00:00, 33.70s/it]
Loading checkpoint shards: 100% 2/2 [00:23<00:00, 11.86s/it]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module
[rank0]:     replace_layer = target_module.from_native_module(
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/acceleration/shardformer/modeling/t5.py", line 31, in from_native_module
[rank0]:     assert module.__class__.__name__ == "FusedRMSNorm", (
[rank0]: AssertionError: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 412, in <module>
[rank0]:     main()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 118, in main
[rank0]:     text_encoder = build_module(cfg.get("text_encoder", None), MODELS, device=device, dtype=dtype)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/registry.py", line 24, in build_module
[rank0]:     return builder.build(cfg)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py", line 570, in build
[rank0]:     return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]:     obj = obj_cls(**args)  # type: ignore
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/t5.py", line 164, in __init__
[rank0]:     self.shardformer_t5()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/models/text_encoder/t5.py", line 183, in shardformer_t5
[rank0]:     optim_model, _ = shard_former.optimize(self.t5.model, policy=T5EncoderPolicy())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/shardformer.py", line 55, in optimize
[rank0]:     shared_params = sharder.shard()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard
[rank0]:     self._replace_module(include=held_layers)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer
[rank0]:     self._recursive_replace_layer(
[rank0]:   [Previous line repeated 2 more times]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer
[rank0]:     self._replace_sub_module(module, sub_module_replacement, include)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
E0817 18:46:55.563000 134001257882240 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 7268) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-17_18:46:55
  host      : 33ca7b3d91f9
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7268)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Aug 25, 2024
@luthandomaqondo-95
Copy link

I'm faced with the same issue.

@github-actions github-actions bot removed the stale label Aug 26, 2024
Copy link

github-actions bot commented Sep 2, 2024

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Sep 2, 2024
@lweingart
Copy link
Author

Hi, has anyone been able to overcome this issue yet?

Cheers

@github-actions github-actions bot removed the stale label Sep 9, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Sep 16, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants