Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed training tutorial #22

Merged
merged 6 commits into from
Mar 31, 2022
Merged

Conversation

axkoenig
Copy link
Contributor

@axkoenig axkoenig commented Mar 29, 2022

Description

This is a tutorial that showcases how squirrel can be used in a distributed setting with multiple GPUs. Originally, @jotterbach authored this PR, I made some slight adjustments, mainly to the documentation. Details on how to run this script and with which exact system setup it was tested are detailed in the script after line 220.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring including code style reformatting
  • Other (please describe):

Checklist:

  • I have read the [contributing guideline doc] () (external only)
  • I have signed the [CLA] () (external only)
  • Lint and unit tests pass locally with my changes
  • I have kept the PR small so that it can be easily reviewed
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • All dependency changes have been reflected in the pip requirement files.

Output

root@debug-pod:/workspace# torchrun --nproc_per_node=2 10.Distributed_MNIST.py 
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
<rst-doc>:45: (WARNING/2) Cannot analyze code. Pygments package not found.
<rst-doc>:45: (WARNING/2) Cannot analyze code. Pygments package not found.
<rst-doc>:43: (WARNING/2) Cannot analyze code. Pygments package not found.
<rst-doc>:43: (WARNING/2) Cannot analyze code. Pygments package not found.
initing ... 0/2
initing ... 1/2
done initing rank 0
Rank 0 initialized? True
done initing rank 1
Rank 1 initialized? True
rank: 0, step: 000, accuracy: 0.099
rank: 1, step: 000, accuracy: 0.081
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
rank: 1, step: 020, accuracy: 0.582
rank: 0, step: 020, accuracy: 0.608
...
rank: 0, step: 480, accuracy: 0.916
rank: 1, step: 499, accuracy: 0.93
rank: 0, step: 499, accuracy: 0.911

@winfried-ripken
Copy link
Contributor

Thanks a lot Alex!

Feel free to copy the pre-commit config from here. It worked when I tested it locally.

I am wondering if we should add a few lines of documentation for a better understanding of the tutorial? Maybe not very needed as the code looks very clean but wdyt?

@axkoenig axkoenig force-pushed the ak-distributed-training-tutorial branch from f01018d to 2b62d10 Compare March 29, 2022 16:00
- torch 1.10.2+cu113, torchvision 0.11.3+cu113 installed with
`pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113
-f https://download.pytorch.org/whl/cu113/torch_stable.html`
- squirrel-core==0.11.1, squirrel-datasets-core==0.0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merantix-momentum/squirrel-core starts with version 0.12.x. I think it is worth testing with that version and updating the note here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, I updated this comment to 0.12 and tested again, still works

examples/10.Distributed_MNIST.py Outdated Show resolved Hide resolved
examples/10.Distributed_MNIST.py Outdated Show resolved Hide resolved
examples/10.Distributed_MNIST.py Show resolved Hide resolved
examples/10.Distributed_MNIST.py Outdated Show resolved Hide resolved
examples/10.Distributed_MNIST.py Outdated Show resolved Hide resolved
@axkoenig axkoenig force-pushed the ak-distributed-training-tutorial branch from 34b1c49 to af35f8a Compare March 30, 2022 15:23
@github-actions
Copy link

github-actions bot commented Mar 30, 2022

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@axkoenig
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Mar 30, 2022
Copy link
Contributor

@winfried-ripken winfried-ripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex, looks very good now. Feel free to merge after the lint check passed

@axkoenig
Copy link
Contributor Author

Hi all,
I addressed all of your comments and added a bit more documentation to the script in the if __name__ part and at the .compose(...) calls. Feel free to leave more feedback and please double-check if I used the terms Rank and Worker correctly.
Thanks,
Alex

@axkoenig
Copy link
Contributor Author

axkoenig commented Mar 31, 2022

thanks @winfried-loetzsch, will merge once @AlirezaSohofi gives is ok. :)

Copy link
Contributor

@AlirezaSohofi AlirezaSohofi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, thanks 👍

CATALOG = Catalog.from_plugins()


def collate(records: t.List[t.Dict[str, t.Any]]) -> t.Dict[str, t.List[t.Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking: Any can be replaced with a more particular type I guess.

Copy link
Contributor Author

@axkoenig axkoenig Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it could be made more concrete here because we know the dtypes of mnist, but actually this function is more generic than that it works with multiple input datatypes (e.g. int, lists, NumPy arrays, torch tensors, torch int, ...), I think listing them all is a bit cluttered.
I added the type of the output dict t.Dict[str, t.List[torch.Tensor]], because we know that for sure due to the torch.from_numpy call.
@AlirezaSohofi let me know your opinion on the input types and whether I can merge now :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of curse, it was a nitpicking anyways :))

@axkoenig axkoenig merged commit 9507acf into main Mar 31, 2022
@axkoenig axkoenig deleted the ak-distributed-training-tutorial branch March 31, 2022 12:29
@github-actions github-actions bot locked and limited conversation to collaborators Mar 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants