-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training fails Signal 7 (SIGBUS) when running on A10G GPU #392
Comments
Thank you for reporting. As the A10G has significantly less memory than the A100, it could be a simple OOM Error causing the issue. Did you try to run the experiment on a single GPU? |
Bus error suggests a memory access violation. Could you please add the commit that you are using and/or docker image version. |
The LLM-Studio version is 0.0.6 Still working on running the 1 x GPU test as he is traveling this week. |
Success!! I got a successful run with tiiuae/falcon-7b and meta-llama/Llama-2-7b-chat-hf . However, I still can't get meta-llama/Llama-2-13b-chat-hf (getting OOM) because it looks like the whole model needs to fit into a single gpu's memory. Does LLM_Studio support splitting a model into the memory of multiple gpus, i.e. true distributed learning? Nonetheless, this is great progress!! |
Thank you for reporting. Nevertheless, we are actively working on actual model sharding using deepspeed. This will only support float16 and requires good GPU interconnect (NVlink), though. |
When running on A10G GPU's training fails.
Please see attached log.
The A10G does not support p2p so disabled that using environment variables but it fails with the same error.
This does work on A100G, but we want to get it running on A10G's due to the cost difference.
Attached is the A10G and A100G nivida-smi output.
Error-withNCCL_P2P_DISABLED.txt
The text was updated successfully, but these errors were encountered: