Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rank1]: AttributeError: 'NoneType' object has no attribute 'get' (finetuning Mamba Hybrid) #10285

Open
SkanderBS2024 opened this issue Aug 28, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@SkanderBS2024
Copy link

Describe the bug

As described in the title, error when launching the fine tuning script in Here

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Epoch 0: :   0%|                                                                                                                                                                                                         | 0/700 [00:00<?]Error executing job with overrides: ['trainer.devices=2', 'trainer.precision=bf16', 'trainer.accelerator=gpu', 'trainer.log_every_n_steps=1', 'trainer.val_check_interval=100', 'trainer.limit_val_batches=50', '+trainer.num_sanity_val_steps=0', '+trainer.accumulate_grad_batches=1', 'trainer.max_steps=700', 'trainer.gradient_clip_val=1.0', 'exp_manager.exp_dir=/workspace/checkpoints/finetuned', 'exp_manager.resume_if_exists=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', '++model.hidden_dropout=0.1', '++model.attention_dropout=0.1', 'model.tensor_model_parallel_size=1', 'model.sequence_parallel=False', 'model.peft.peft_scheme=none', 'model.megatron_amp_O2=True', 'model.encoder_seq_length=2048', 'model.data.validation_ds.pad_to_max_length=True', 'model.data.train_ds.pad_to_max_length=True', 'model.optim.name=distributed_fused_adam', 'model.data.train_ds.max_seq_length=2048', 'model.data.validation_ds.max_seq_length=2048', 'model.micro_batch_size=4', 'model.global_batch_size=128', '++data.micro_batch_size=4', '++data.global_batch_size=128', 'model.restore_from_path=/workspace/checkpoints/mamba-7b.nemo', 'model.data.train_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/training.jsonl]', 'model.data.validation_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/validation.jsonl]', 'model.optim.lr=5e-6', 'model.optim.sched.min_lr=1e-7']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 60, in <module>
[rank0]:     main()
[rank0]:   File "/workspace/nemo/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank0]:     _run_hydra(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank0]:     _run_app(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank0]:     run_and_report(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank0]:     raise ex
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank0]:     return func()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank0]:     lambda: hydra.run(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
[rank0]:     _ = ret.return_value
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
[rank0]:     raise self._return_value
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
[rank0]:     ret.return_value = task_function(task_cfg)
[rank0]:   File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 56, in main
[rank0]:     trainer.fit(model)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
[rank0]:     results = self._run_stage()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
[rank0]:     self.fit_loop.run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank0]:     self.advance()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank0]:     self.epoch_loop.run(self._data_fetcher)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
[rank0]:     self.advance(data_fetcher)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 250, in advance
[rank0]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 190, in run
[rank0]:     self._optimizer_step(batch_idx, closure)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 268, in _optimizer_step
[rank0]:     call._call_lightning_module_hook(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1297, in optimizer_step
[rank0]:     super().optimizer_step(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1308, in optimizer_step
[rank0]:     optimizer.step(closure=optimizer_closure)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 153, in step
[rank0]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step
[rank0]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 238, in optimizer_step
[rank0]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/amp.py", line 74, in optimizer_step
[rank0]:     return super().optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
[rank0]:     return optimizer.step(closure=closure, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank0]:     return wrapped(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/opt/apex/apex/contrib/optimizers/distributed_fused_adam.py", line 2292, in step
[rank0]:     loss = closure()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
[rank0]:     closure_result = closure()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
[rank0]:     self._result = self.closure(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
[rank0]:     step_output = self._step_fn()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 317, in _training_step
[rank0]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 390, in training_step
[rank0]:     return self.lightning_module.training_step(*args, **kwargs)
[rank0]:   File "/workspace/nemo/NeMo/nemo/utils/model_utils.py", line 434, in wrap_training_step
[rank0]:     output_dict = wrapped(*args, **kwargs)
[rank0]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 860, in training_step
[rank0]:     loss_mean = self.training_step_fwd_bwd_step_call(dataloader_iter, forward_only=False)
[rank0]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 781, in training_step_fwd_bwd_step_call
[rank0]:     loss_mean = self.fwd_bwd_step(dataloader_iter, forward_only)
[rank0]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 365, in fwd_bwd_step
[rank0]:     data_iter = get_iterator_k_split(batch, get_num_microbatches(), self.enforce_divisible_batch)
[rank0]:   File "/opt/apex/apex/transformer/pipeline_parallel/utils.py", line 93, in get_num_microbatches
[rank0]:     return _GLOBAL_NUM_MICROBATCHES_CALCULATOR.get()
[rank0]: AttributeError: 'NoneType' object has no attribute 'get'
Error executing job with overrides: ['trainer.devices=2', 'trainer.precision=bf16', 'trainer.accelerator=gpu', 'trainer.log_every_n_steps=1', 'trainer.val_check_interval=100', 'trainer.limit_val_batches=50', '+trainer.num_sanity_val_steps=0', '+trainer.accumulate_grad_batches=1', 'trainer.max_steps=700', 'trainer.gradient_clip_val=1.0', 'exp_manager.exp_dir=/workspace/checkpoints/finetuned', 'exp_manager.resume_if_exists=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', '++model.hidden_dropout=0.1', '++model.attention_dropout=0.1', 'model.tensor_model_parallel_size=1', 'model.sequence_parallel=False', 'model.peft.peft_scheme=none', 'model.megatron_amp_O2=True', 'model.encoder_seq_length=2048', 'model.data.validation_ds.pad_to_max_length=True', 'model.data.train_ds.pad_to_max_length=True', 'model.optim.name=distributed_fused_adam', 'model.data.train_ds.max_seq_length=2048', 'model.data.validation_ds.max_seq_length=2048', 'model.micro_batch_size=4', 'model.global_batch_size=128', '++data.micro_batch_size=4', '++data.global_batch_size=128', 'model.restore_from_path=/workspace/checkpoints/mamba-7b.nemo', 'model.data.train_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/training.jsonl]', 'model.data.validation_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/validation.jsonl]', 'model.optim.lr=5e-6', 'model.optim.sched.min_lr=1e-7']
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 60, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/nemo/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper
[rank1]:     _run_hydra(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank1]:     _run_app(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank1]:     run_and_report(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank1]:     raise ex
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank1]:     return func()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank1]:     lambda: hydra.run(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
[rank1]:     _ = ret.return_value
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
[rank1]:     raise self._return_value
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
[rank1]:     ret.return_value = task_function(task_cfg)
[rank1]:   File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 56, in main
[rank1]:     trainer.fit(model)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
[rank1]:     call._call_and_handle_interrupt(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank1]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]:     return function(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
[rank1]:     self._run(model, ckpt_path=ckpt_path)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
[rank1]:     results = self._run_stage()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
[rank1]:     self.fit_loop.run()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run
[rank1]:     self.advance()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance
[rank1]:     self.epoch_loop.run(self._data_fetcher)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run
[rank1]:     self.advance(data_fetcher)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 250, in advance
[rank1]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 190, in run
[rank1]:     self._optimizer_step(batch_idx, closure)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 268, in _optimizer_step
[rank1]:     call._call_lightning_module_hook(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook
[rank1]:     output = fn(*args, **kwargs)
[rank1]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1297, in optimizer_step
[rank1]:     super().optimizer_step(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1308, in optimizer_step
[rank1]:     optimizer.step(closure=optimizer_closure)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 153, in step
[rank1]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step
[rank1]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 238, in optimizer_step
[rank1]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/amp.py", line 74, in optimizer_step
[rank1]:     return super().optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step
[rank1]:     return optimizer.step(closure=closure, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank1]:     return wrapped(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:   File "/opt/apex/apex/contrib/optimizers/distributed_fused_adam.py", line 2292, in step
[rank1]:     loss = closure()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure
[rank1]:     closure_result = closure()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__
[rank1]:     self._result = self.closure(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure
[rank1]:     step_output = self._step_fn()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 317, in _training_step
[rank1]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook
[rank1]:     output = fn(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 390, in training_step
[rank1]:     return self.lightning_module.training_step(*args, **kwargs)
[rank1]:   File "/workspace/nemo/NeMo/nemo/utils/model_utils.py", line 434, in wrap_training_step
[rank1]:     output_dict = wrapped(*args, **kwargs)
[rank1]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 860, in training_step
[rank1]:     loss_mean = self.training_step_fwd_bwd_step_call(dataloader_iter, forward_only=False)
[rank1]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 781, in training_step_fwd_bwd_step_call
[rank1]:     loss_mean = self.fwd_bwd_step(dataloader_iter, forward_only)
[rank1]:   File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 365, in fwd_bwd_step
[rank1]:     data_iter = get_iterator_k_split(batch, get_num_microbatches(), self.enforce_divisible_batch)
[rank1]:   File "/opt/apex/apex/transformer/pipeline_parallel/utils.py", line 93, in get_num_microbatches
[rank1]:     return _GLOBAL_NUM_MICROBATCHES_CALCULATOR.get()
[rank1]: AttributeError: 'NoneType' object has no attribute 'get'
Epoch 0: :   0%|          | 0/700 [00:09<?]
[2024-08-28 13:45:50,925] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5090) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.3.0a0+40ec155e58.nv24.3', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 834, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 825, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 137, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 271, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-08-28_13:45:50
  host      : fs-api-66209ad2-6222-4e62-bbbf-e3de667aab8e
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 5091)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-28_13:45:50
  host      : fs-api-66209ad2-6222-4e62-bbbf-e3de667aab8e
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5090)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior

Expected to start fine tuning directly before the error appears.

Environment overview (please complete the following information)

  • Environment location: Docker, VM Cloud
  • Method of NeMo install: Docker Image : nvcr.io/nvidia/nemo:24.07
  • If method of install is [Docker], provide docker pull & docker run commands used :
docker run --gpus all --shm-size=80g --net=host --ulimit memlock=-1 --rm -it \
    -v /ephemeral/navinenv/work/megatron:/workspace/megatron \
    -v /ephemeral/navinenv/work/data:/workspace/dataset/data \
    -v /ephemeral/navinenv/work/outfix:/workspace/dataset/outfix \
    -v /ephemeral/navinenv/work/tok:/workspace/dataset/tok \
    -v /ephemeral/navinenv/work/checkpoints:/workspace/checkpoints \
    -v /ephemeral/navinenv/work/nemo:/workspace/nemo \
    -v /ephemeral/navinenv/work/tmp:/tmp \
    nvcr.io/nvidia/nemo:24.07

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

GPUS : 2*A100 80 GB

@SkanderBS2024 SkanderBS2024 added the bug Something isn't working label Aug 28, 2024
@longxudou
Copy link

Encounter the same problem here, with the latest main branch.
It works well with https://github.com/NVIDIA/NeMo/tree/r2.0.0rc1 version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants