-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] all_reduce_grads() fails with a Transformer model for number of nodes > 1 #1226
Comments
Could you please share the code or some simplified version to reproduce this? |
Did that via direct email (code to complex to provide context here) |
One more piece of information. I wanted to make sure that the dictionaries of batch_grads that are fed to all_reduce_grads() are valid even in the case when there are multiple nodes. I put together this little check
and then inside the training loop:
The result seems to clean for both nodes:
|
I don't know if it's related but the code @sck-at-ucy shared does not run on a single machine (M2 Ultra) for the multiprocess case. I believe it's a GPU timeout issue. I haven't tried it on multiple machines yet though. See related issue #1231 |
I have made some progress with the code. It is still the case that |
@sck-at-ucy I think this may be a network instability or something along those lines. I can train over Ethernet for hundreds of thousands of iterations so I can't reproduce this. Let me know if there is more information on the original bug report otherwise I am inclined to close this. |
@angeloskath My attention has been on the issues I have been having with continuing training after reloading states from file and so I have been running on a single node trying to understand that. The status with distributed computing as of a couple of weeks ago was the following for me: if I decreased significantly the size of the Transformer model |
I think this failure had to do with IP over thunderbolt across many machines so there is not something we can do from MLX. Over ethernet it works quite reliably. If this is not the case then feel free to reopen. |
Describe the bug
While all_reduce_grads defined as per the documentation example
worked for a simple model, when I try to use it a fairly large Transformer model it only works if I am running on a single node (i.e.
N=1
). If I try to use it withN>1
, the code goes into a zombie state and the GPU stops being utilized effectively and the code never recovers from that state. To debug I tried to print out the reduced grads returned byall_reduce_grads()
. With a single note, I get what I expected and looks fine, see attached file:GradsforSingleNode.txt .
With N>2, the print operation itself causes the code to crash with
To Reproduce
Include code snippet
Expected behavior
I expected to get back the averaged grads and the next step would have been to use optimizer.update() with the all_reduced grads.
Desktop (please complete the following information):
Additional context
After successfully running a simpler model on up to 4 nodes, I tried to do the same with the Transformer model, but then I run into this trouble that prevents me from reducing the grads across nodes. The issue happens with any number of nodes > 2.
The text was updated successfully, but these errors were encountered: