-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some testing from me #407
Comments
This might be because internet access is not available?
You can try this
This shouldn't be the case for steps on the same rank.
This seems OK. For issue 4, 5, 6, I'm not sure what's going on. I wonder if @lessw2020 has any insights with these slurm jobs. |
The node has connection to the internet, since it's able to log things to the wandb cloud (at least, rank 0 is). I don't think it was a blip in connection, since I tried it a few times over 12 hours. My cluster is full so I can't test it at the moment. It could be some random HF issue, in which case it's somebody else's problem.
The trick used in our internal codebase is:
Information is logged only once with |
C4 HuggingFace issues are related to multi-GPU jobs in some way. Single GPU, works: torchtitan_multi_node5885.txt Multi GPU, errors: torchtitan_multi_node5886.txt I don't personally care about this HF issue, so it's up to you if fixing it is worth anyone's time. |
I tried torchtitan. Here's a grab bag of issues. My setup is CoreWeave-provided PyTorch nightly image, on a CW-hosted HGX in slurm. PyTorch and torchtitan builds are nightlies from a few days ago.
Build information
This PyTorch nightly does not work with my own codebase; the loss is a flat line. However, torchtitan does not experience this issue, and is able to train. (I tried PyTorch nightly 7 times over a few months, but it failed from a different issue each time, so this isn't unusual.)
Using the
c4
dataset doesn't work. Usingc4_mini
does.Error message with c4
For my final run, I changed these in the llama3 8B config:
Here's the slurm script I used, a merge of torchtitan's multinode_trainer.slurm and CoreWeave's slurm script:
multinode_trainer.txt
Some changes were important on my setup, but I'm not sure which ones matter or would cause problems.
rdzv_endpoint=localhost:0
was needed, and so was disabling thedcgmi
calls. I understand almost nothing about the changes I made to this slurm script. It runs on one HGX (8 H100s).Here's the output log: torchtitan_multi_node5645.txt
Observations:
PYTHONUNBUFFERED=1
, but this seems like a bad idea. This may be a reason for why (3) is happening.flavor = "8B"
, then ranks 1-7 OOM, but root rank 0 trains properly after the slew of error messages:2024-06-17 04:57:31,148 - root - INFO - step: 40 loss: 7.3387 memory: 41.62GiB(52.61%) wps: 6,268 mfu: 36.71%
. (It's with full-AC, remember the config change above.)compile = false
, then the root rank 0 also crashes so I can't see any mfu numbers. (In my own codebase, torch.compileon/off
causes no change in TFLOPS when activation checkpointing is active, which is the issue I was hoping to investigate using torchtitan.)The text was updated successfully, but these errors were encountered: