-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: port over PyTorch example to use Trainer API [MLG-1181] #8292
Conversation
✅ Deploy Preview for determined-ui canceled.
|
# Use a file lock so that only one worker on each node downloads. | ||
with filelock.FileLock(str(data_path / "lock")): | ||
return datasets.MNIST( | ||
root=str(data_dir), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already a str
The number of str -> pathlib -> str conversions here could use some reconsideration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
passing in a pathlib.Path now... not much of an improvement, but looks slightly better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my take is that this doesn't prolly need its own file anymore, but up to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i kept it separate to demonstrate separating out the different parts of your workflow. but yeah in this example it's rather useless.
@@ -29,5 +26,5 @@ searcher: | |||
smaller_is_better: true | |||
max_trials: 16 | |||
max_length: | |||
batches: 937 #60,000 training images with batch size 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure that asha + max length of 1 is going to behave poorly. The master doesn't do fractions, so asha will just emit everything with epoch length of 1.
Use batches here instead, I think.
for on-cluster experiments (several examples are included in the directory). | ||
|
||
Then the code can be submitted to Determined for on-cluster training by running this command from the current directory: | ||
`det experiment create const.yaml .`. The other configurations can be run by specifying the desired | ||
configuration file in place of `const.yaml`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are steering people towards torch.distributed (a decision I agree with), you might want to document what entrypoint changes are needed there.
Also, I don't actually know: does det.launch.torch_distributed work with slots_per_trial=1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a section for experiment config changes.
det.launch.torch_distributed work with slots_per_trial=1?
yeah, it does. don't really want to put it in theconst.yaml
example though cause it seems like a waste to start torch distributed if it's not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least document that ddp works with a single slot too. "can I have one entrypoint to rule them all" is like an immediate question for the end user, I think.
0f60cf3
to
79db082
Compare
Description
since the introduction of Trainer API in 0.21.0, harness codepaths are now considered “legacy” for PyTorchTrial. this ports over existing PyTorch examples (only one now after examples pruning) in the examples repository to showcase the new Trainer API codepath.
Test Plan
Commentary (optional)
Checklist
docs/release-notes/
.See Release Note for details.
Ticket