chore: port over PyTorch example to use Trainer API [MLG-1181] #8292

azhou-determined · 2023-11-01T17:17:17Z

Description

since the introduction of Trainer API in 0.21.0, harness codepaths are now considered “legacy” for PyTorchTrial. this ports over existing PyTorch examples (only one now after examples pruning) in the examples repository to showcase the new Trainer API codepath.

Test Plan

Commentary (optional)

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

netlify · 2023-11-01T17:17:23Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`b8f97ed`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/654aa16e7062560008c5fc96

rb-determined-ai · 2023-11-01T17:23:26Z

examples/tutorials/mnist_pytorch/data.py

+    # Use a file lock so that only one worker on each node downloads.
+    with filelock.FileLock(str(data_path / "lock")):
+        return datasets.MNIST(
+            root=str(data_dir),


already a str

The number of str -> pathlib -> str conversions here could use some reconsideration

passing in a pathlib.Path now... not much of an improvement, but looks slightly better.

rb-determined-ai · 2023-11-01T17:24:02Z

examples/tutorials/mnist_pytorch/data.py

my take is that this doesn't prolly need its own file anymore, but up to you

i kept it separate to demonstrate separating out the different parts of your workflow. but yeah in this example it's rather useless.

rb-determined-ai · 2023-11-01T17:25:01Z

examples/tutorials/mnist_pytorch/adaptive.yaml

@@ -29,5 +26,5 @@ searcher:
  smaller_is_better: true
  max_trials: 16
  max_length:
-    batches: 937 #60,000 training images with batch size 64


pretty sure that asha + max length of 1 is going to behave poorly. The master doesn't do fractions, so asha will just emit everything with epoch length of 1.

Use batches here instead, I think.

rb-determined-ai · 2023-11-01T17:28:11Z

examples/tutorials/mnist_pytorch/README.md

+for on-cluster experiments (several examples are included in the directory).
+
+Then the code can be submitted to Determined for on-cluster training by running this command from the current directory:
+`det experiment create const.yaml .`. The other configurations can be run by specifying the desired 
 configuration file in place of `const.yaml`.



Since you are steering people towards torch.distributed (a decision I agree with), you might want to document what entrypoint changes are needed there.

Also, I don't actually know: does det.launch.torch_distributed work with slots_per_trial=1?

added a section for experiment config changes.

det.launch.torch_distributed work with slots_per_trial=1?
yeah, it does. don't really want to put it in the const.yaml example though cause it seems like a waste to start torch distributed if it's not necessary.

At least document that ddp works with a single slot too. "can I have one entrypoint to rule them all" is like an immediate question for the end user, I think.

azhou-determined requested a review from rb-determined-ai November 1, 2023 17:17

cla-bot bot added the cla-signed label Nov 1, 2023

rb-determined-ai reviewed Nov 1, 2023

View reviewed changes

rb-determined-ai approved these changes Nov 1, 2023

View reviewed changes

azhou-determined requested a review from a team as a code owner November 1, 2023 20:42

azhou-determined added 7 commits November 1, 2023 13:42

wip

207a864

wip: port mnist_pytorch to trainer API

49029b1

wip

c1380a7

wording tweaks

4412aee

pr comments

7cd1311

readme update

d4bd7da

train.py

79db082

azhou-determined force-pushed the examples-pytorch-trainer branch from 0f60cf3 to 79db082 Compare November 1, 2023 20:42

azhou-determined added 5 commits November 2, 2023 11:14

fix tests

2bddcb8

fix tests

5424862

fmt

9454df3

fix imports

d940fe5

fix test

1a342b6

azhou-determined requested a review from a team as a code owner November 7, 2023 16:29

azhou-determined requested a review from carolinaecalderon November 7, 2023 16:29

azhou-determined added 6 commits November 7, 2023 08:54

fix pytorch_amp test

368d15d

fmt

1a783de

fix test

bef2033

refactor test

e519ef0

test

afc001c

fmt

b8f97ed

azhou-determined merged commit baf5c96 into main Nov 7, 2023
71 of 81 checks passed

azhou-determined deleted the examples-pytorch-trainer branch November 7, 2023 21:27

azhou-determined mentioned this pull request Nov 9, 2023

chore: fix profiler test in CI #8382

Merged

4 tasks

dannysauer added this to the 0.26.4 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: port over PyTorch example to use Trainer API [MLG-1181] #8292

chore: port over PyTorch example to use Trainer API [MLG-1181] #8292

azhou-determined commented Nov 1, 2023

netlify bot commented Nov 1, 2023 •

edited

Loading

rb-determined-ai Nov 1, 2023

azhou-determined Nov 1, 2023

rb-determined-ai Nov 1, 2023

azhou-determined Nov 1, 2023

rb-determined-ai Nov 1, 2023

rb-determined-ai Nov 1, 2023

azhou-determined Nov 1, 2023

rb-determined-ai Nov 1, 2023

chore: port over PyTorch example to use Trainer API [MLG-1181] #8292

chore: port over PyTorch example to use Trainer API [MLG-1181] #8292

Conversation

azhou-determined commented Nov 1, 2023

Description

Test Plan

Commentary (optional)

Checklist

Ticket

netlify bot commented Nov 1, 2023 • edited Loading

✅ Deploy Preview for determined-ui canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Nov 1, 2023 •

edited

Loading