Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed #288

Merged
merged 67 commits into from
Oct 24, 2023
Merged

deepspeed #288

merged 67 commits into from
Oct 24, 2023

Conversation

haqishen
Copy link
Contributor

@haqishen haqishen commented Jul 18, 2023

Feature

  • deepspeed zero3 training
  • deepspeed zero3 /w lora training
  • deepspeed zero3 /w offload optimizer training

Experiment Result

Using 3 x RTX6000 (24GB), Batchsize = 1

Full params experiments

-- backbone dtype LORA deepspeed offload optimizer (live param 1e9) RAM usage per GPU runtime(data sample 0.1) Perplexity
exp1 EleutherAI/pythia-1b-deduped bfloat16 False False False 15GB 00:03:28 11.9258
exp2 EleutherAI/pythia-1b-deduped (lr 0.00001) float16 False True False 11.5GB 00:04:31 10.1721
exp3 EleutherAI/pythia-1b-deduped (lr 0.00001) float16 False True True 5GB 00:29:07 10.1965
exp4 EleutherAI/pythia-2.8b-deduped bfloat16 False False False OOM N/A N/A
exp5 EleutherAI/pythia-2.8b-deduped float16 False True False OOM N/A N/A
exp6 EleutherAI/pythia-2.8b-deduped float16 False True True 10.5G 01:18:44 8.7803
exp7 EleutherAI/pythia-6.9b-deduped float16 False True True (live param 1e10) 23G OOM (cpu) OOM (cpu)

LORA experiments

-- backbone dtype LORA deepspeed RAM usage per GPU runtime(data sample 0.1) Perplexity
exp1 EleutherAI/pythia-2.8b-deduped float16 True False 11.5GB 00:00:57 9.7002
exp2 EleutherAI/pythia-2.8b-deduped float16 True True 4.5GB 00:08:19 9.8184
exp3 EleutherAI/pythia-6.9b-deduped float16 True False 16.5GB 00:01:40 9.4829
exp4 EleutherAI/pythia-6.9b-deduped float16 True True 8.5GB 00:23:35 9.6707
exp5 EleutherAI/pythia-12b-deduped float16 True False OOM NA NA
exp6 EleutherAI/pythia-12b-deduped float16 True True 12.5GB 00:46:32 9.1973
exp7 EleutherAI/pythia-12b-deduped int8 True False 17GB 00:06:58 10.1232
exp8 EleutherAI/pythia-20b-deduped float16 True True 18.5GB 00:56:51 8.4031

Using 8 x V100 w/ NVLink (16GB), Batchsize = 1

-- backbone dtype LORA deepspeed RAM usage per GPU runtime(data sample 0.1) Perplexity
exp1 EleutherAI/pythia-20b-deduped int4 True False 15.5GB 00:02:29 9.3201
exp2 EleutherAI/pythia-20b-deduped float16 True True 10.5GB 00:04:19 8.7182

Using 8 x A6000 w/o NVLink (48GB), Batchsize = 1

-- backbone dtype LORA deepspeed RAM usage per GPU runtime(data sample 0.1) Perplexity
exp1 tiiuae/falcon-40b int4 True False 45GB 00:12:41 5.7722
exp2 tiiuae/falcon-40b float16 True True 22GB 02:30:52 5.7743
exp3 TheBloke/Llama-2-70B-Chat-fp16 int4 True False 46GB 00:20:56 4.4524
exp4 TheBloke/Llama-2-70B-Chat-fp16 float16 True True 28GB 04:30:58 4.4221

NVLINK works:

-- backbone dtype LORA deepspeed RAM usage per GPU runtime(data sample 0.1)
w/ NVLINK EleutherAI/pythia-20b-deduped float16 True True 10.5GB 00:04:19
w/o NVLINK EleutherAI/pythia-20b-deduped float16 True True 10.5GB 00:45:37

Using 8 x A100 SMX4

-- backbone dtype LORA deepspeed RAM usage per GPU runtime
exp1 TheBloke/Llama-2-70B-Chat-fp16 (4k) int4 True False ~80GB 35h
exp2 TheBloke/Llama-2-70B-Chat-fp16 (4k) float16 True True ~80GB 6.5h
exp3 h2oai/h2ogpt-4096-llama2-13b-chat float16 True True 11GB 16min
exp4 h2oai/h2ogpt-4096-llama2-13b-chat float16 True False 38GB 13min

Check

  • Chat
  • Upload model weight

Future Work

  • zero ++
  • zero3 /w lora and offload optimizer
  • zero3 /w offload params

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this tricky topic, @haqishen

I think we can remove offloading to CPU for the time being, as the speed apparently drops way too much.

I already added a few comments, will need to do some proper testing on a multi GPU setup and also with continued training, upload of weights, etc. I assume, you checked these already?

train.py Show resolved Hide resolved
train.py Outdated Show resolved Hide resolved
train.py Outdated Show resolved Hide resolved
train.py Outdated Show resolved Hide resolved
llm_studio/src/utils/modeling_utils.py Outdated Show resolved Hide resolved
llm_studio/src/utils/modeling_utils.py Show resolved Hide resolved
llm_studio/src/utils/modeling_utils.py Show resolved Hide resolved
Pipfile Outdated Show resolved Hide resolved
@psinger
Copy link
Collaborator

psinger commented Sep 11, 2023

One more find, could this be useful for saving?
https://deepspeed.readthedocs.io/en/stable/zero3.html#deepspeed.runtime.zero.config.DeepSpeedZeroConfig.gather_16bit_weights_on_model_save

@haqishen
Copy link
Contributor Author

Do I read the table right, that after NVLINK / SMX4 the runtime goes down a lot with deepspeed? Does it match ddp runtime then?

Just updated the table by adding some exp result to compare ddp and deepspeed runtime. It shows that deepspeed still slower than ddp fp16 by around 15~20%, but much faster than ddp int8 or int4.

Can we make these sliders?

How to do this? I search the keyword data_sample but cannot find why it's a slider bar in webui.

Let's fully remove FSDP in favor of Deepspeed

Let's do it in a new PR.

@psinger
Copy link
Collaborator

psinger commented Oct 9, 2023

@haqishen I believe we still have not solved the desync issue, when long-running generate is running, and then the checkpoint is saved afterwards. I believe we should move the order there as discussed earlier. So save the checkpoint before running eval, if not best epoch saving is enabled.

@haqishen
Copy link
Contributor Author

@haqishen I believe we still have not solved the desync issue, when long-running generate is running, and then the checkpoint is saved afterwards. I believe we should move the order there as discussed earlier. So save the checkpoint before running eval, if not best epoch saving is enabled.

btw what's your experiment setting?

@psinger
Copy link
Collaborator

psinger commented Oct 12, 2023

@haqishen I believe we still have not solved the desync issue, when long-running generate is running, and then the checkpoint is saved afterwards. I believe we should move the order there as discussed earlier. So save the checkpoint before running eval, if not best epoch saving is enabled.

btw what's your experiment setting?

default with gpt metric

Copy link
Collaborator

@psinger psinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed another issue.

Need to merge main, solve conflicts and then can merge.

After that, please open subsequent issues for things not tackled yet in this pr and potential future improvements.

We also potentiall need a section in README / Docs and might be also useful to share your benchmarks there.

Thanks!

@haqishen haqishen merged commit 67d3a3c into main Oct 24, 2023
5 checks passed
@haqishen haqishen deleted the deepspeed branch October 24, 2023 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants