create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

anderson101866 · 2024-06-24T13:07:11Z

Container:

nvcr.io/nvidia/pytorch:24.05-py3

Machine:

x86 CPU with A100 node

Reproduce:

python -m torch.distributed.run --nproc-per-node=2 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000

It'll probably crash or not because at line 102, the operator= will trigger destructor to delete the old value std::function of _alloc_copy_allgather, which is actually an uninit value.
see also the definition of struct communicator

struct communicator {
...  
std::function<void(void **, void *, size_t, ExtComm)> _alloc_copy_allgather; //will not be initialized by malloc
std::function<void(ExtComm)> _barrier; //will not be initialized by malloc
std::function<void(void *)> _free; //will not be initialized by malloc

Hope this hint helps the progress.

The text was updated successfully, but these errors were encountered:

anderson101866 · 2024-06-28T15:28:17Z

The commit on this fork will fix this.
denera@7a9522b

denera · 2024-08-16T20:29:25Z

@anderson101866 This should be fixed now in TE/main as of PR #1087. Could you check and close the issue if resolved?

anderson101866 · 2024-08-17T05:42:58Z

Yes, it’s resolved. Thanks for your greatly help!

anderson101866 closed this as completed Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

anderson101866 commented Jun 24, 2024 •

edited

Loading

anderson101866 commented Jun 28, 2024

denera commented Aug 16, 2024

anderson101866 commented Aug 17, 2024

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

Comments

anderson101866 commented Jun 24, 2024 • edited Loading

Container:

Machine:

Reproduce:

anderson101866 commented Jun 28, 2024

denera commented Aug 16, 2024

anderson101866 commented Aug 17, 2024

anderson101866 commented Jun 24, 2024 •

edited

Loading