Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Satyaog/feature/covalent #217

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Satyaog/feature/covalent #217

wants to merge 3 commits into from

Conversation

satyaog
Copy link
Member

@satyaog satyaog commented May 22, 2024

milabench cloud --setup

It creates a system config file and takes a target cloud platform with --run-on.

This starts a local covalent server which is used to manage python code that will be executed on the remote. For now this is only somewhat useful since milabench is mostly using ssh commands anyway and it would take a bit of time to refactor the pipeline I think to instead use the covalent interface to run code. I think this could be an interesting approach but it's a nice to have for now.

So milabench cloud --setup setup the remote and install basic stuff on it like the correct python version (necessary to ensure good serialization/deserialization of python objects between the local and remote machine), pip and venv. venv is used to separate the covalent env and milabench env which have incompatible package requirements versions (sqlalchemy caused problems). On this is done , the covalent server becomes useless

Then system config file should be used in the install, prepare and run commands. In those commands it creates a new standalone config for the tests that will be executed and copies it to the remote before the rest of the pipeline is executed.

At the end of the run command the results are copied to the local machine to allow the generation of a report

At the very end, milabench cloud --teardown should be used to release the cloud resources. The --all argument will release all resources of a target cloud platform specified with --run-on.

Check docs/usage.rst for more info

milabench with slurm

The milabench cloud --setup works as well with a slurm system configuration but does not support the --all argument with milabench cloud --teardown.

Check docs/usage.rst for more info

milabench report --push

Push the results to a reports branch which as well stores the status svg and summary

Example of reports : #210

@satyaog satyaog mentioned this pull request May 22, 2024
@satyaog satyaog force-pushed the satyaog/feature/covalent branch 3 times, most recently from f9a8c6e to 89898ee Compare May 23, 2024 15:11
@satyaog satyaog requested a deployment to cloud-ci August 21, 2024 07:34 — with GitHub Actions Abandoned
@satyaog satyaog changed the base branch from master to staging September 6, 2024 03:28
@satyaog satyaog requested a deployment to cloud-ci September 6, 2024 04:48 — with GitHub Actions Abandoned
@satyaog satyaog requested a deployment to cloud-ci September 6, 2024 04:48 — with GitHub Actions Abandoned
@satyaog satyaog requested a deployment to cloud-ci September 6, 2024 08:08 — with GitHub Actions Abandoned
covalent is not compatible with milabench as it requires sqlalchemy<2.0.0

Update .github/workflows/cloud-ci.yml
Apply suggestions from code review
Update .github/workflows/cloud-ci.yml
Add azure covalent cloud infra

Add multi-node on cloud

* VM on the cloud might not have enough space on all partitions. Add a workaround which should cover most cases
* Use branch and commit name to versionize reports directories
* Fix parsing error when temperature is not available in nvidia-smi outputs
* export MILABENCH_* env vars to remote

Add docs

Fix cloud instance name conflict

This would prevent the CI or multiple contributors to run tests with the same config
Fix github push in CI

* Copy ssh key to allow connections from master to workers
* Use local ip for manager's ip such that workers can find it and connect to it
@satyaog
Copy link
Member Author

satyaog commented Sep 20, 2024

Added the slurm covalent plugin to help debug the cloud setups

@satyaog
Copy link
Member Author

satyaog commented Sep 24, 2024

Tested slurm with:

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    1
memory:   81920.0

Breakdown
---------
bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |          score | weight
diffusion-single         |    0 |   1 |    1 |          28.13 |   0.1% |   0.9% |       53815 |          28.13 |   1.00
dimenet                  |    0 |   1 |    1 |         482.46 |   1.8% |   5.4% |         nan |         482.46 |   1.00
dinov2-giant-single      |    0 |   1 |    1 |          54.12 |   0.6% |   2.1% |       69569 |          54.12 |   1.00
dqn                      |    0 |   1 |    1 | 22934535905.03 |   3.3% |  91.1% |         nan | 22934535905.03 |   1.00
bf16                     |    0 |   1 |    1 |         296.65 |   0.0% |   0.2% |        1609 |         296.65 |   0.00
fp16                     |    0 |   1 |    1 |         295.35 |   0.0% |   0.3% |        1609 |         295.35 |   0.00
fp32                     |    0 |   1 |    1 |          19.17 |   0.0% |   0.0% |        1987 |          19.17 |   0.00
tf32                     |    0 |   1 |    1 |         148.64 |   0.0% |   0.1% |        1987 |         148.64 |   0.00
bert-fp16                |    0 |   1 |    1 |         275.25 |   0.0% |   0.2% |         nan |         275.25 |   0.00
bert-fp32                |    0 |   1 |    1 |          45.64 |   0.0% |   0.1% |       20991 |          45.64 |   0.00
bert-tf32                |    0 |   1 |    1 |         147.32 |   0.1% |   0.4% |         nan |         147.32 |   0.00
bert-tf32-fp16           |    0 |   1 |    1 |         274.37 |   0.2% |   1.3% |         nan |         274.37 |   3.00
reformer                 |    0 |   1 |    1 |          62.86 |   0.1% |   0.4% |         nan |          62.86 |   1.00
t5                       |    0 |   1 |    1 |          52.16 |   0.3% |   0.8% |         nan |          52.16 |   2.00
whisper                  |    0 |   1 |    1 |         520.24 |   1.0% |   3.0% |         nan |         520.24 |   1.00
lightning                |    0 |   1 |    1 |         712.70 |   0.5% |   5.0% |       27183 |         712.70 |   1.00
llava-single             |    0 |   1 |    1 |           2.39 |   0.2% |   1.6% |       72377 |           2.39 |   1.00
llama                    |    0 |   1 |    1 |         466.14 |  11.5% |  72.0% |       27641 |         466.14 |   1.00
llm-lora-single          |    0 |   1 |    1 |        3517.85 |   0.1% |   0.7% |       32995 |        3517.85 |   1.00
pna                      |    0 |   1 |    1 |        5079.10 |   1.9% |   5.6% |       39543 |        5079.10 |   1.00
ppo                      |    0 |   1 |    1 |    32372024.27 |   1.5% |  57.6% |       62159 |    32372024.27 |   1.00
recursiongfn             |    0 |   1 |    1 |        9035.14 |   3.5% |  10.5% |        6935 |        9035.14 |   1.00
rlhf-single              |    0 |   1 |    1 |        2573.66 |   0.3% |   2.8% |       19181 |        2573.66 |   1.00
focalnet                 |    0 |   1 |    1 |         389.95 |   0.7% |   2.3% |        3847 |         389.95 |   2.00
torchatari               |    0 |   1 |    1 |        3592.50 |   1.4% |   5.0% |        3655 |        3592.50 |   1.00
convnext_large-fp16      |    0 |   1 |    1 |         354.76 |   0.5% |   2.6% |         nan |         354.76 |   0.00
convnext_large-fp32      |    0 |   1 |    1 |          60.63 |   0.1% |   0.3% |       55771 |          60.63 |   0.00
convnext_large-tf32      |    0 |   1 |    1 |         160.49 |   0.0% |   0.1% |       49471 |         160.49 |   0.00
convnext_large-tf32-fp16 |    0 |   1 |    1 |         357.23 |   0.2% |   1.2% |         nan |         357.23 |   3.00
regnet_y_128gf           |    0 |   1 |    1 |         123.15 |   0.3% |   0.9% |         nan |         123.15 |   2.00
resnet50                 |    0 |   1 |    1 |        1199.53 |   2.4% |   7.3% |         nan |        1199.53 |   1.00
resnet50-noio            |    0 |   1 |    1 |        1177.09 |   0.0% |   0.2% |       27301 |        1177.09 |   0.00
vjepa-single             |    0 |   1 |    1 |          22.22 |   1.8% |  14.0% |       56005 |          22.22 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:             821.42

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    4
memory:   81920.0

Breakdown
---------
bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
brax               |    0 |   1 |    4 |  636209.06 |   0.3% |   0.8% |        2609 |  636209.06 |   1.00
diffusion-gpus     |    0 |   1 |    4 |     109.52 |   0.1% |   0.5% |       58283 |     109.52 |   1.00
dinov2-giant-gpus  |    0 |   1 |    4 |     229.23 |   0.3% |   0.9% |       70961 |     229.23 |   1.00
lightning-gpus     |    0 |   1 |    4 |    2898.55 |   0.3% |   2.6% |       28055 |    2898.55 |   1.00
llm-lora-ddp-gpus  |    0 |   1 |    4 |   10472.82 |   0.6% |   3.1% |       36227 |   10472.82 |   1.00
rlhf-gpus          |    0 |   1 |    4 |    7560.51 |   0.3% |   2.4% |       21489 |    7560.51 |   1.00
resnet152-ddp-gpus |    0 |   1 |    4 |    2438.15 |   0.0% |   0.4% |       27849 |    2438.15 |   0.00
vjepa-gpus         |    0 |   1 |    4 |      78.81 |   3.6% |  28.9% |       63831 |      78.81 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:            2246.64

=================
Benchmark results
=================

System
------
cpu:      AMD EPYC 7543 32-Core Processor
n_cpu:    64
product:  NVIDIA A100-SXM4-80GB
n_gpu:    2
memory:   81920.0

Breakdown
---------
bench              | fail |   n | ngpu |       perf |   sem% |   std% | peak_memory |      score | weight
diffusion-nodes    |    0 |   2 |    4 |      23.50 |   0.5% |   3.7% |       57299 |      23.50 |   1.00
llm-lora-ddp-nodes |    0 |   2 |    4 |    1043.47 |   0.6% |   3.4% |       35199 |    1043.47 |   1.00

Scores
------
Failure rate:       0.00% (PASS)
Score:             156.58

Large llm models (llama3 70B) have been excluded as I don't have the resources to test yet

It should work as well on azure which I'll test next week

Base automatically changed from staging to master October 2, 2024 17:00
@satyaog satyaog requested a deployment to cloud-ci October 3, 2024 20:47 — with GitHub Actions Abandoned
@satyaog satyaog requested a deployment to cloud-ci October 3, 2024 20:47 — with GitHub Actions Abandoned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant