Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training fails Signal 7 (SIGBUS) when running on A10G GPU #392

Closed
EricGudgion opened this issue Aug 25, 2023 · 5 comments · Fixed by #288
Closed

Training fails Signal 7 (SIGBUS) when running on A10G GPU #392

EricGudgion opened this issue Aug 25, 2023 · 5 comments · Fixed by #288
Labels
type/bug Bug in code

Comments

@EricGudgion
Copy link

When running on A10G GPU's training fails.

Please see attached log.

The A10G does not support p2p so disabled that using environment variables but it fails with the same error.

This does work on A100G, but we want to get it running on A10G's due to the cost difference.

Attached is the A10G and A100G nivida-smi output.
A10G-p2P-not-supported
A100G-p2p-supported
Error-withNCCL_P2P_DISABLED.txt

@EricGudgion EricGudgion added the type/bug Bug in code label Aug 25, 2023
@pascal-pfeiffer
Copy link
Collaborator

Thank you for reporting.

As the A10G has significantly less memory than the A100, it could be a simple OOM Error causing the issue. Did you try to run the experiment on a single GPU?

@pascal-pfeiffer
Copy link
Collaborator

Bus error suggests a memory access violation.

Could you please add the commit that you are using and/or docker image version.
If running with docker, did you set the shm-size param as shown in the Readme?

@EricGudgion
Copy link
Author

The LLM-Studio version is 0.0.6

Still working on running the 1 x GPU test as he is traveling this week.

@EricGudgion
Copy link
Author

Success!! I got a successful run with tiiuae/falcon-7b and meta-llama/Llama-2-7b-chat-hf . However, I still can't get meta-llama/Llama-2-13b-chat-hf (getting OOM) because it looks like the whole model needs to fit into a single gpu's memory. Does LLM_Studio support splitting a model into the memory of multiple gpus, i.e. true distributed learning?

Nonetheless, this is great progress!!

@pascal-pfeiffer
Copy link
Collaborator

Thank you for reporting.
With 4bit quantization and low batch_size count, experiments with 13B llama2 should also run on the A10G.

Nevertheless, we are actively working on actual model sharding using deepspeed. This will only support float16 and requires good GPU interconnect (NVlink), though.
related:
PR: #288
issue #390
issue #98
issue #239

@pascal-pfeiffer pascal-pfeiffer linked a pull request Sep 5, 2023 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants