Training fails Signal 7 (SIGBUS) when running on A10G GPU #392

EricGudgion · 2023-08-25T21:44:10Z

When running on A10G GPU's training fails.

Please see attached log.

The A10G does not support p2p so disabled that using environment variables but it fails with the same error.

This does work on A100G, but we want to get it running on A10G's due to the cost difference.

Attached is the A10G and A100G nivida-smi output.

Error-withNCCL_P2P_DISABLED.txt

pascal-pfeiffer · 2023-08-26T07:42:59Z

Thank you for reporting.

As the A10G has significantly less memory than the A100, it could be a simple OOM Error causing the issue. Did you try to run the experiment on a single GPU?

pascal-pfeiffer · 2023-08-26T07:47:44Z

Bus error suggests a memory access violation.

Could you please add the commit that you are using and/or docker image version.
If running with docker, did you set the shm-size param as shown in the Readme?

EricGudgion · 2023-08-29T12:16:06Z

The LLM-Studio version is 0.0.6

Still working on running the 1 x GPU test as he is traveling this week.

EricGudgion · 2023-08-29T15:33:33Z

Success!! I got a successful run with tiiuae/falcon-7b and meta-llama/Llama-2-7b-chat-hf . However, I still can't get meta-llama/Llama-2-13b-chat-hf (getting OOM) because it looks like the whole model needs to fit into a single gpu's memory. Does LLM_Studio support splitting a model into the memory of multiple gpus, i.e. true distributed learning?

Nonetheless, this is great progress!!

pascal-pfeiffer · 2023-08-30T06:34:47Z

Thank you for reporting.
With 4bit quantization and low batch_size count, experiments with 13B llama2 should also run on the A10G.

Nevertheless, we are actively working on actual model sharding using deepspeed. This will only support float16 and requires good GPU interconnect (NVlink), though.
related:
PR: #288
issue #390
issue #98
issue #239

EricGudgion added the type/bug Bug in code label Aug 25, 2023

pascal-pfeiffer linked a pull request Sep 5, 2023 that will close this issue

deepspeed #288

Merged

2 tasks

haqishen closed this as completed in #288 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails Signal 7 (SIGBUS) when running on A10G GPU #392

Training fails Signal 7 (SIGBUS) when running on A10G GPU #392

EricGudgion commented Aug 25, 2023

pascal-pfeiffer commented Aug 26, 2023

pascal-pfeiffer commented Aug 26, 2023

EricGudgion commented Aug 29, 2023

EricGudgion commented Aug 29, 2023

pascal-pfeiffer commented Aug 30, 2023

Training fails Signal 7 (SIGBUS) when running on A10G GPU #392

Training fails Signal 7 (SIGBUS) when running on A10G GPU #392

Comments

EricGudgion commented Aug 25, 2023

pascal-pfeiffer commented Aug 26, 2023

pascal-pfeiffer commented Aug 26, 2023

EricGudgion commented Aug 29, 2023

EricGudgion commented Aug 29, 2023

pascal-pfeiffer commented Aug 30, 2023