-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed training tutorial #22
Conversation
Thanks a lot Alex! Feel free to copy the pre-commit config from here. It worked when I tested it locally. I am wondering if we should add a few lines of documentation for a better understanding of the tutorial? Maybe not very needed as the code looks very clean but wdyt? |
f01018d
to
2b62d10
Compare
examples/10.Distributed_MNIST.py
Outdated
- torch 1.10.2+cu113, torchvision 0.11.3+cu113 installed with | ||
`pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 | ||
-f https://download.pytorch.org/whl/cu113/torch_stable.html` | ||
- squirrel-core==0.11.1, squirrel-datasets-core==0.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merantix-momentum/squirrel-core
starts with version 0.12.x. I think it is worth testing with that version and updating the note here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, I updated this comment to 0.12 and tested again, still works
34b1c49
to
af35f8a
Compare
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alex, looks very good now. Feel free to merge after the lint check passed
Hi all, |
thanks @winfried-loetzsch, will merge once @AlirezaSohofi gives is ok. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good, thanks 👍
examples/10.Distributed_MNIST.py
Outdated
CATALOG = Catalog.from_plugins() | ||
|
||
|
||
def collate(records: t.List[t.Dict[str, t.Any]]) -> t.Dict[str, t.List[t.Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicking: Any
can be replaced with a more particular type I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it could be made more concrete here because we know the dtypes of mnist, but actually this function is more generic than that it works with multiple input datatypes (e.g. int, lists, NumPy arrays, torch tensors, torch int, ...), I think listing them all is a bit cluttered.
I added the type of the output dict t.Dict[str, t.List[torch.Tensor]]
, because we know that for sure due to the torch.from_numpy
call.
@AlirezaSohofi let me know your opinion on the input types and whether I can merge now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of curse, it was a nitpicking anyways :))
Description
This is a tutorial that showcases how squirrel can be used in a distributed setting with multiple GPUs. Originally, @jotterbach authored this PR, I made some slight adjustments, mainly to the documentation. Details on how to run this script and with which exact system setup it was tested are detailed in the script after line 220.
Type of change
Checklist:
Output