-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed #288
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this tricky topic, @haqishen
I think we can remove offloading to CPU for the time being, as the speed apparently drops way too much.
I already added a few comments, will need to do some proper testing on a multi GPU setup and also with continued training, upload of weights, etc. I assume, you checked these already?
One more find, could this be useful for saving? |
Just updated the table by adding some exp result to compare ddp and deepspeed runtime. It shows that deepspeed still slower than ddp fp16 by around 15~20%, but much faster than ddp int8 or int4.
How to do this? I search the keyword
Let's do it in a new PR. |
@haqishen I believe we still have not solved the desync issue, when long-running generate is running, and then the checkpoint is saved afterwards. I believe we should move the order there as discussed earlier. So save the checkpoint before running eval, if not best epoch saving is enabled. |
btw what's your experiment setting? |
default with gpt metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed another issue.
Need to merge main, solve conflicts and then can merge.
After that, please open subsequent issues for things not tackled yet in this pr and potential future improvements.
We also potentiall need a section in README / Docs and might be also useful to share your benchmarks there.
Thanks!
Feature
Experiment Result
Using 3 x RTX6000 (24GB), Batchsize = 1
Full params experiments
LORA experiments
Using 8 x V100 w/ NVLink (16GB), Batchsize = 1
Using 8 x A6000 w/o NVLink (48GB), Batchsize = 1
NVLINK works:
Using 8 x A100 SMX4
Check
Future Work