Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

Closed
anderson101866 opened this issue Jun 24, 2024 · 3 comments

Comments

@anderson101866
Copy link

anderson101866 commented Jun 24, 2024

Container:

nvcr.io/nvidia/pytorch:24.05-py3

Machine:

x86 CPU with A100 node

Reproduce:

python -m torch.distributed.run --nproc-per-node=2 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000

It'll probably crash or not because at line 102, the operator= will trigger destructor to delete the old value std::function of _alloc_copy_allgather, which is actually an uninit value.
see also the definition of struct communicator

struct communicator {
...  
std::function<void(void **, void *, size_t, ExtComm)> _alloc_copy_allgather; //will not be initialized by malloc
std::function<void(ExtComm)> _barrier; //will not be initialized by malloc
std::function<void(void *)> _free; //will not be initialized by malloc

image
Hope this hint helps the progress.

@anderson101866
Copy link
Author

The commit on this fork will fix this.
denera@7a9522b

@denera
Copy link
Collaborator

denera commented Aug 16, 2024

@anderson101866 This should be fixed now in TE/main as of PR #1087. Could you check and close the issue if resolved?

@anderson101866
Copy link
Author

Yes, it’s resolved. Thanks for your greatly help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants