Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TASK] Seperate AutoTP workflow #4894

Open
delock opened this issue Jan 4, 2024 · 7 comments
Open

[TASK] Seperate AutoTP workflow #4894

delock opened this issue Jan 4, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@delock
Copy link
Collaborator

delock commented Jan 4, 2024

As discussed in this PR (#4721), we need to increase test coverage for AutoTP to cover more models. Such workflow can help avoid regressions such as #4774

This is a challenge in current UT scope because of the following points:

  1. Popular models has very large model checkpoints (~6B to ~180B), we need an instance large enough to be able to download and run these large models.
  2. To test effectiveness of AutoTP, certain metric i.e. accuracy or perplexity will be needed to verify the effectiveness of AutoTP
  3. The workflow needs to be expandable to new model supported by DeepSpeed.

The workflow may also run the following variants:

  1. load checkpoint with from_config as memory efficient form.
  2. Quantization form of the model.
  3. 3 devices to test uneven sharding.

The expected result of this task is:

  1. A workflow that can regularly test the AutoTP status of each model and post test result. (pass/fail, accuracy, etc.) This can complement the manually maintained list (https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/automatic-tensor-parallelism.md) which is often out of sync.
  2. A test script that people can use to reproduce and report AutoTP related issues.
  3. Better integration process of new model AutoTP support (what was broken, which PR fixed it, etc.)
@delock delock added the enhancement New feature or request label Jan 4, 2024
@delock
Copy link
Collaborator Author

delock commented Jan 9, 2024

@delock
Copy link
Collaborator Author

delock commented Jan 9, 2024

One learning is there needs to be a persistent storage for downloaded models, otherwise these models will be downloaded again and again and is a waste of time.

@mrwyattii mrwyattii self-assigned this Jan 9, 2024
@delock
Copy link
Collaborator Author

delock commented Jan 10, 2024

Latest workflow run result. When model checkpoint fully cached, test two models (opt1.3b and bloom3b) took around 5 minutes.
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7472909814

@delock
Copy link
Collaborator Author

delock commented Jan 11, 2024

next step is to explore accuracy metric. There are two choices: perplexity and accuracy over certain task.

@delock
Copy link
Collaborator Author

delock commented Jan 11, 2024

For correctness check DeepSpeedExamples have ds-hf-compare which should be a good start point. This script set inference to kernel injection however will need to modify it to fit this usage.

@delock
Copy link
Collaborator Author

delock commented Jan 15, 2024

A workflow with ds-hf-compare.py test. The script need some change to be able to run with two ranks however.
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7519543233/job/20468193515.

This workflow is ready to be tested on DeepSpeed self-hosted runner. @mrwyattii I wonder how persistency works. If I set HF_HOME to /blob, will model checkpoint be downloaded to persistent storage and be reused for next run?

@delock
Copy link
Collaborator Author

delock commented Jan 16, 2024

@mrwyattii PR #4961 had been added. Initially there are two models in it but I plan to add more models to the list. (I can't run more models on runner on my desktop because of limited memory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants