Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: port over PyTorch example to use Trainer API [MLG-1181] #8292

Merged
merged 18 commits into from
Nov 7, 2023

Conversation

azhou-determined
Copy link
Contributor

Description

since the introduction of Trainer API in 0.21.0, harness codepaths are now considered “legacy” for PyTorchTrial. this ports over existing PyTorch examples (only one now after examples pruning) in the examples repository to showcase the new Trainer API codepath.

Test Plan

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

Copy link

netlify bot commented Nov 1, 2023

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit b8f97ed
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/654aa16e7062560008c5fc96

# Use a file lock so that only one worker on each node downloads.
with filelock.FileLock(str(data_path / "lock")):
return datasets.MNIST(
root=str(data_dir),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already a str

The number of str -> pathlib -> str conversions here could use some reconsideration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing in a pathlib.Path now... not much of an improvement, but looks slightly better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my take is that this doesn't prolly need its own file anymore, but up to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kept it separate to demonstrate separating out the different parts of your workflow. but yeah in this example it's rather useless.

@@ -29,5 +26,5 @@ searcher:
smaller_is_better: true
max_trials: 16
max_length:
batches: 937 #60,000 training images with batch size 64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure that asha + max length of 1 is going to behave poorly. The master doesn't do fractions, so asha will just emit everything with epoch length of 1.

Use batches here instead, I think.

for on-cluster experiments (several examples are included in the directory).

Then the code can be submitted to Determined for on-cluster training by running this command from the current directory:
`det experiment create const.yaml .`. The other configurations can be run by specifying the desired
configuration file in place of `const.yaml`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are steering people towards torch.distributed (a decision I agree with), you might want to document what entrypoint changes are needed there.

Also, I don't actually know: does det.launch.torch_distributed work with slots_per_trial=1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a section for experiment config changes.

det.launch.torch_distributed work with slots_per_trial=1?
yeah, it does. don't really want to put it in the const.yaml example though cause it seems like a waste to start torch distributed if it's not necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least document that ddp works with a single slot too. "can I have one entrypoint to rule them all" is like an immediate question for the end user, I think.

@azhou-determined azhou-determined requested a review from a team as a code owner November 1, 2023 20:42
@azhou-determined azhou-determined merged commit baf5c96 into main Nov 7, 2023
71 of 81 checks passed
@azhou-determined azhou-determined deleted the examples-pytorch-trainer branch November 7, 2023 21:27
@dannysauer dannysauer added this to the 0.26.4 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants