-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Distributed Data Samplers in PyTorch Examples #2012
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Signed-off-by: Andrey Velichkevich <[email protected]>
bc251a1
to
537ce7e
Compare
Pull Request Test Coverage Report for Build 8163673973Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -10,141 +10,216 @@ | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
import torch.optim as optim | |||
|
|||
WORLD_SIZE = int(os.environ.get('WORLD_SIZE', 1)) | |||
from torch.utils.data import DistributedSampler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we apply the same approach to Katib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will submit separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to run these examples in CI to verify if these examples are valid, but we can track it in another issues.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Andrey Velichkevich <[email protected]> (cherry picked from commit 57aa34d)
Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: deepanker13 <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
I fixed PyTorch training examples where we forgot to distribute data across PyTorch workers using
DistributedSampler(dataset)
. After that change each PyTorch worker will correctly process chunk of training data.Also, for FashionMNIST example I removed check if this script is running in distributed mode since for PyTorch we can just set arbitrary values for env variables and run this script in 1 Worker.
/assign @johnugeorge @tenzen-y @kuizhiqing