Add distributed training tutorial #22

axkoenig · 2022-03-29T15:19:18Z

Description

This is a tutorial that showcases how squirrel can be used in a distributed setting with multiple GPUs. Originally, @jotterbach authored this PR, I made some slight adjustments, mainly to the documentation. Details on how to run this script and with which exact system setup it was tested are detailed in the script after line 220.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring including code style reformatting
Other (please describe):

Checklist:

I have read the [contributing guideline doc] () (external only)
I have signed the [CLA] () (external only)
Lint and unit tests pass locally with my changes
I have kept the PR small so that it can be easily reviewed
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
All dependency changes have been reflected in the pip requirement files.

Output

root@debug-pod:/workspace# torchrun --nproc_per_node=2 10.Distributed_MNIST.py 
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
<rst-doc>:45: (WARNING/2) Cannot analyze code. Pygments package not found.
<rst-doc>:45: (WARNING/2) Cannot analyze code. Pygments package not found.
<rst-doc>:43: (WARNING/2) Cannot analyze code. Pygments package not found.
<rst-doc>:43: (WARNING/2) Cannot analyze code. Pygments package not found.
initing ... 0/2
initing ... 1/2
done initing rank 0
Rank 0 initialized? True
done initing rank 1
Rank 1 initialized? True
rank: 0, step: 000, accuracy: 0.099
rank: 1, step: 000, accuracy: 0.081
Reducer buckets have been rebuilt in this iteration.
Reducer buckets have been rebuilt in this iteration.
rank: 1, step: 020, accuracy: 0.582
rank: 0, step: 020, accuracy: 0.608
...
rank: 0, step: 480, accuracy: 0.916
rank: 1, step: 499, accuracy: 0.93
rank: 0, step: 499, accuracy: 0.911

winfried-ripken · 2022-03-29T15:44:00Z

Thanks a lot Alex!

Feel free to copy the pre-commit config from here. It worked when I tested it locally.

I am wondering if we should add a few lines of documentation for a better understanding of the tutorial? Maybe not very needed as the code looks very clean but wdyt?

AlpAribal · 2022-03-29T16:24:39Z

examples/10.Distributed_MNIST.py

+        - torch 1.10.2+cu113, torchvision 0.11.3+cu113 installed with
+            `pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113
+            -f https://download.pytorch.org/whl/cu113/torch_stable.html`
+        - squirrel-core==0.11.1, squirrel-datasets-core==0.0.1


merantix-momentum/squirrel-core starts with version 0.12.x. I think it is worth testing with that version and updating the note here

nice catch, I updated this comment to 0.12 and tested again, still works

examples/10.Distributed_MNIST.py

github-actions · 2022-03-30T15:23:22Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

axkoenig · 2022-03-30T15:39:24Z

I have read the CLA Document and I hereby sign the CLA

winfried-ripken

Thanks Alex, looks very good now. Feel free to merge after the lint check passed

axkoenig · 2022-03-31T08:57:56Z

Hi all,
I addressed all of your comments and added a bit more documentation to the script in the if __name__ part and at the .compose(...) calls. Feel free to leave more feedback and please double-check if I used the terms Rank and Worker correctly.
Thanks,
Alex

axkoenig · 2022-03-31T08:58:38Z

thanks @winfried-loetzsch, will merge once @AlirezaSohofi gives is ok. :)

AlirezaSohofi

Looks very good, thanks 👍

AlirezaSohofi · 2022-03-31T09:23:36Z

examples/10.Distributed_MNIST.py

+CATALOG = Catalog.from_plugins()
+
+
+def collate(records: t.List[t.Dict[str, t.Any]]) -> t.Dict[str, t.List[t.Any]]:


nitpicking: Any can be replaced with a more particular type I guess.

yes, it could be made more concrete here because we know the dtypes of mnist, but actually this function is more generic than that it works with multiple input datatypes (e.g. int, lists, NumPy arrays, torch tensors, torch int, ...), I think listing them all is a bit cluttered.
I added the type of the output dict t.Dict[str, t.List[torch.Tensor]], because we know that for sure due to the torch.from_numpy call.
@AlirezaSohofi let me know your opinion on the input types and whether I can merge now :)

Of curse, it was a nitpicking anyways :))

axkoenig requested review from jotterbach and winfried-ripken March 29, 2022 15:20

axkoenig force-pushed the ak-distributed-training-tutorial branch from f01018d to 2b62d10 Compare March 29, 2022 16:00

AlpAribal reviewed Mar 29, 2022

View reviewed changes

AlirezaSohofi suggested changes Mar 30, 2022

View reviewed changes

axkoenig added 3 commits March 30, 2022 17:21

Add distributed training tutorial

20af933

Update black formatter

7d77143

Integrate PR comments

af35f8a

axkoenig force-pushed the ak-distributed-training-tutorial branch from 34b1c49 to af35f8a Compare March 30, 2022 15:23

github-actions bot added a commit that referenced this pull request Mar 30, 2022

@axkoenig has signed the CLA from Pull Request #22

b70023c

Fix linting

e815bc5

winfried-ripken approved these changes Mar 31, 2022

View reviewed changes

AlirezaSohofi approved these changes Mar 31, 2022

View reviewed changes

axkoenig added 2 commits March 31, 2022 11:43

Add output type

697e1b0

Fix wrong output type

89cba81

axkoenig merged commit 9507acf into main Mar 31, 2022

axkoenig deleted the ak-distributed-training-tutorial branch March 31, 2022 12:29

github-actions bot locked and limited conversation to collaborators Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed training tutorial #22

Add distributed training tutorial #22

axkoenig commented Mar 29, 2022 •

edited

Loading

winfried-ripken commented Mar 29, 2022

AlpAribal Mar 29, 2022

axkoenig Mar 31, 2022

github-actions bot commented Mar 30, 2022 •

edited

Loading

axkoenig commented Mar 30, 2022

winfried-ripken left a comment

axkoenig commented Mar 31, 2022

axkoenig commented Mar 31, 2022 •

edited

Loading

AlirezaSohofi left a comment

AlirezaSohofi Mar 31, 2022

axkoenig Mar 31, 2022 •

edited

Loading

AlirezaSohofi Mar 31, 2022

		CATALOG = Catalog.from_plugins()


		def collate(records: t.List[t.Dict[str, t.Any]]) -> t.Dict[str, t.List[t.Any]]:

Add distributed training tutorial #22

Add distributed training tutorial #22

Conversation

axkoenig commented Mar 29, 2022 • edited Loading

Description

Type of change

Checklist:

Output

winfried-ripken commented Mar 29, 2022

AlpAribal Mar 29, 2022

Choose a reason for hiding this comment

axkoenig Mar 31, 2022

Choose a reason for hiding this comment

github-actions bot commented Mar 30, 2022 • edited Loading

axkoenig commented Mar 30, 2022

winfried-ripken left a comment

Choose a reason for hiding this comment

axkoenig commented Mar 31, 2022

axkoenig commented Mar 31, 2022 • edited Loading

AlirezaSohofi left a comment

Choose a reason for hiding this comment

AlirezaSohofi Mar 31, 2022

Choose a reason for hiding this comment

axkoenig Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

AlirezaSohofi Mar 31, 2022

Choose a reason for hiding this comment

axkoenig commented Mar 29, 2022 •

edited

Loading

github-actions bot commented Mar 30, 2022 •

edited

Loading

axkoenig commented Mar 31, 2022 •

edited

Loading

axkoenig Mar 31, 2022 •

edited

Loading