We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug
As described in the title, error when launching the fine tuning script in Here
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Epoch 0: : 0%| | 0/700 [00:00<?]Error executing job with overrides: ['trainer.devices=2', 'trainer.precision=bf16', 'trainer.accelerator=gpu', 'trainer.log_every_n_steps=1', 'trainer.val_check_interval=100', 'trainer.limit_val_batches=50', '+trainer.num_sanity_val_steps=0', '+trainer.accumulate_grad_batches=1', 'trainer.max_steps=700', 'trainer.gradient_clip_val=1.0', 'exp_manager.exp_dir=/workspace/checkpoints/finetuned', 'exp_manager.resume_if_exists=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', '++model.hidden_dropout=0.1', '++model.attention_dropout=0.1', 'model.tensor_model_parallel_size=1', 'model.sequence_parallel=False', 'model.peft.peft_scheme=none', 'model.megatron_amp_O2=True', 'model.encoder_seq_length=2048', 'model.data.validation_ds.pad_to_max_length=True', 'model.data.train_ds.pad_to_max_length=True', 'model.optim.name=distributed_fused_adam', 'model.data.train_ds.max_seq_length=2048', 'model.data.validation_ds.max_seq_length=2048', 'model.micro_batch_size=4', 'model.global_batch_size=128', '++data.micro_batch_size=4', '++data.global_batch_size=128', 'model.restore_from_path=/workspace/checkpoints/mamba-7b.nemo', 'model.data.train_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/training.jsonl]', 'model.data.validation_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/validation.jsonl]', 'model.optim.lr=5e-6', 'model.optim.sched.min_lr=1e-7'] [rank0]: Traceback (most recent call last): [rank0]: File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 60, in <module> [rank0]: main() [rank0]: File "/workspace/nemo/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper [rank0]: _run_hydra( [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra [rank0]: _run_app( [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app [rank0]: run_and_report( [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report [rank0]: raise ex [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report [rank0]: return func() [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda> [rank0]: lambda: hydra.run( [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run [rank0]: _ = ret.return_value [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value [rank0]: raise self._return_value [rank0]: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job [rank0]: ret.return_value = task_function(task_cfg) [rank0]: File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 56, in main [rank0]: trainer.fit(model) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit [rank0]: call._call_and_handle_interrupt( [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt [rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch [rank0]: return function(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl [rank0]: self._run(model, ckpt_path=ckpt_path) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run [rank0]: results = self._run_stage() [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage [rank0]: self.fit_loop.run() [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run [rank0]: self.advance() [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance [rank0]: self.epoch_loop.run(self._data_fetcher) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run [rank0]: self.advance(data_fetcher) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 250, in advance [rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 190, in run [rank0]: self._optimizer_step(batch_idx, closure) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 268, in _optimizer_step [rank0]: call._call_lightning_module_hook( [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook [rank0]: output = fn(*args, **kwargs) [rank0]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1297, in optimizer_step [rank0]: super().optimizer_step(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1308, in optimizer_step [rank0]: optimizer.step(closure=optimizer_closure) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 153, in step [rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step [rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 238, in optimizer_step [rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/amp.py", line 74, in optimizer_step [rank0]: return super().optimizer_step(optimizer, model=model, closure=closure, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step [rank0]: return optimizer.step(closure=closure, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper [rank0]: return wrapped(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper [rank0]: out = func(*args, **kwargs) [rank0]: File "/opt/apex/apex/contrib/optimizers/distributed_fused_adam.py", line 2292, in step [rank0]: loss = closure() [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure [rank0]: closure_result = closure() [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__ [rank0]: self._result = self.closure(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure [rank0]: step_output = self._step_fn() [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 317, in _training_step [rank0]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook [rank0]: output = fn(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 390, in training_step [rank0]: return self.lightning_module.training_step(*args, **kwargs) [rank0]: File "/workspace/nemo/NeMo/nemo/utils/model_utils.py", line 434, in wrap_training_step [rank0]: output_dict = wrapped(*args, **kwargs) [rank0]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 860, in training_step [rank0]: loss_mean = self.training_step_fwd_bwd_step_call(dataloader_iter, forward_only=False) [rank0]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 781, in training_step_fwd_bwd_step_call [rank0]: loss_mean = self.fwd_bwd_step(dataloader_iter, forward_only) [rank0]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 365, in fwd_bwd_step [rank0]: data_iter = get_iterator_k_split(batch, get_num_microbatches(), self.enforce_divisible_batch) [rank0]: File "/opt/apex/apex/transformer/pipeline_parallel/utils.py", line 93, in get_num_microbatches [rank0]: return _GLOBAL_NUM_MICROBATCHES_CALCULATOR.get() [rank0]: AttributeError: 'NoneType' object has no attribute 'get' Error executing job with overrides: ['trainer.devices=2', 'trainer.precision=bf16', 'trainer.accelerator=gpu', 'trainer.log_every_n_steps=1', 'trainer.val_check_interval=100', 'trainer.limit_val_batches=50', '+trainer.num_sanity_val_steps=0', '+trainer.accumulate_grad_batches=1', 'trainer.max_steps=700', 'trainer.gradient_clip_val=1.0', 'exp_manager.exp_dir=/workspace/checkpoints/finetuned', 'exp_manager.resume_if_exists=True', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', '++model.hidden_dropout=0.1', '++model.attention_dropout=0.1', 'model.tensor_model_parallel_size=1', 'model.sequence_parallel=False', 'model.peft.peft_scheme=none', 'model.megatron_amp_O2=True', 'model.encoder_seq_length=2048', 'model.data.validation_ds.pad_to_max_length=True', 'model.data.train_ds.pad_to_max_length=True', 'model.optim.name=distributed_fused_adam', 'model.data.train_ds.max_seq_length=2048', 'model.data.validation_ds.max_seq_length=2048', 'model.micro_batch_size=4', 'model.global_batch_size=128', '++data.micro_batch_size=4', '++data.global_batch_size=128', 'model.restore_from_path=/workspace/checkpoints/mamba-7b.nemo', 'model.data.train_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/training.jsonl]', 'model.data.validation_ds.file_names=[/workspace/nemo/NeMo/databricks-dolly-15k/validation.jsonl]', 'model.optim.lr=5e-6', 'model.optim.sched.min_lr=1e-7'] [rank1]: Traceback (most recent call last): [rank1]: File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 60, in <module> [rank1]: main() [rank1]: File "/workspace/nemo/NeMo/nemo/core/config/hydra_runner.py", line 129, in wrapper [rank1]: _run_hydra( [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra [rank1]: _run_app( [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app [rank1]: run_and_report( [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report [rank1]: raise ex [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report [rank1]: return func() [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda> [rank1]: lambda: hydra.run( [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run [rank1]: _ = ret.return_value [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value [rank1]: raise self._return_value [rank1]: File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job [rank1]: ret.return_value = task_function(task_cfg) [rank1]: File "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py", line 56, in main [rank1]: trainer.fit(model) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit [rank1]: call._call_and_handle_interrupt( [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt [rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch [rank1]: return function(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl [rank1]: self._run(model, ckpt_path=ckpt_path) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run [rank1]: results = self._run_stage() [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage [rank1]: self.fit_loop.run() [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 205, in run [rank1]: self.advance() [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 363, in advance [rank1]: self.epoch_loop.run(self._data_fetcher) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 140, in run [rank1]: self.advance(data_fetcher) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/training_epoch_loop.py", line 250, in advance [rank1]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 190, in run [rank1]: self._optimizer_step(batch_idx, closure) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 268, in _optimizer_step [rank1]: call._call_lightning_module_hook( [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 159, in _call_lightning_module_hook [rank1]: output = fn(*args, **kwargs) [rank1]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 1297, in optimizer_step [rank1]: super().optimizer_step(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1308, in optimizer_step [rank1]: optimizer.step(closure=optimizer_closure) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 153, in step [rank1]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step [rank1]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 238, in optimizer_step [rank1]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/amp.py", line 74, in optimizer_step [rank1]: return super().optimizer_step(optimizer, model=model, closure=closure, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 122, in optimizer_step [rank1]: return optimizer.step(closure=closure, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper [rank1]: return wrapped(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper [rank1]: out = func(*args, **kwargs) [rank1]: File "/opt/apex/apex/contrib/optimizers/distributed_fused_adam.py", line 2292, in step [rank1]: loss = closure() [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/precision.py", line 108, in _wrap_closure [rank1]: closure_result = closure() [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 144, in __call__ [rank1]: self._result = self.closure(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank1]: return func(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 129, in closure [rank1]: step_output = self._step_fn() [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/automatic.py", line 317, in _training_step [rank1]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 311, in _call_strategy_hook [rank1]: output = fn(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 390, in training_step [rank1]: return self.lightning_module.training_step(*args, **kwargs) [rank1]: File "/workspace/nemo/NeMo/nemo/utils/model_utils.py", line 434, in wrap_training_step [rank1]: output_dict = wrapped(*args, **kwargs) [rank1]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 860, in training_step [rank1]: loss_mean = self.training_step_fwd_bwd_step_call(dataloader_iter, forward_only=False) [rank1]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 781, in training_step_fwd_bwd_step_call [rank1]: loss_mean = self.fwd_bwd_step(dataloader_iter, forward_only) [rank1]: File "/workspace/nemo/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py", line 365, in fwd_bwd_step [rank1]: data_iter = get_iterator_k_split(batch, get_num_microbatches(), self.enforce_divisible_batch) [rank1]: File "/opt/apex/apex/transformer/pipeline_parallel/utils.py", line 93, in get_num_microbatches [rank1]: return _GLOBAL_NUM_MICROBATCHES_CALCULATOR.get() [rank1]: AttributeError: 'NoneType' object has no attribute 'get' Epoch 0: : 0%| | 0/700 [00:09<?] [2024-08-28 13:45:50,925] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5090) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.3.0a0+40ec155e58.nv24.3', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 834, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 825, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 137, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 271, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_mamba_finetuning.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-08-28_13:45:50 host : fs-api-66209ad2-6222-4e62-bbbf-e3de667aab8e rank : 1 (local_rank: 1) exitcode : 1 (pid: 5091) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-08-28_13:45:50 host : fs-api-66209ad2-6222-4e62-bbbf-e3de667aab8e rank : 0 (local_rank: 0) exitcode : 1 (pid: 5090) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Expected behavior
Expected to start fine tuning directly before the error appears.
Environment overview (please complete the following information)
docker pull
docker run
docker run --gpus all --shm-size=80g --net=host --ulimit memlock=-1 --rm -it \ -v /ephemeral/navinenv/work/megatron:/workspace/megatron \ -v /ephemeral/navinenv/work/data:/workspace/dataset/data \ -v /ephemeral/navinenv/work/outfix:/workspace/dataset/outfix \ -v /ephemeral/navinenv/work/tok:/workspace/dataset/tok \ -v /ephemeral/navinenv/work/checkpoints:/workspace/checkpoints \ -v /ephemeral/navinenv/work/nemo:/workspace/nemo \ -v /ephemeral/navinenv/work/tmp:/tmp \ nvcr.io/nvidia/nemo:24.07
Environment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
GPUS : 2*A100 80 GB
The text was updated successfully, but these errors were encountered:
Encounter the same problem here, with the latest main branch. It works well with https://github.com/NVIDIA/NeMo/tree/r2.0.0rc1 version.
Sorry, something went wrong.
No branches or pull requests
Describe the bug
As described in the title, error when launching the fine tuning script in Here
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
Expected to start fine tuning directly before the error appears.
Environment overview (please complete the following information)
docker pull
&docker run
commands used :Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Additional context
GPUS : 2*A100 80 GB
The text was updated successfully, but these errors were encountered: