[RAY AIR][DOC][TorchTrainer] Rewrote the TorchTrainer code snippet as a working example #30492

dmatrix · 2022-11-18T23:58:58Z

Signed-off-by: Jules Damji [email protected]

Rewrote the code snippet as it was not working
Removed python-code block directives; instead use testcode and testoutput. This will test code if it runs in the CI
Ignore the output since we get loads of output from the three workers
Assert that the loss converges with the training data within specified epochs
Tested code end-to-end with this script

Checks

[x ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x ] I've run scripts/format.sh to lint the changes in this PR.

…nd use testcode, and ignore long output from train Signed-off-by: Jules Damji <[email protected]>

Signed-off-by: Jules Damji <[email protected]>

pcmoritz · 2022-11-21T20:22:11Z

python/ray/train/torch/torch_trainer.py

@@ -22,13 +22,14 @@ class TorchTrainer(DataParallelTrainer):
    The ``train_loop_per_worker`` function is expected to take in either 0 or 1


In the paragraph above this, it says "already" twice in the sentence -- it would be great to also fix this :)

good catch. Fixed

pcmoritz · 2022-11-21T20:22:54Z

python/ray/train/torch/torch_trainer.py


+        from typing import Dict
        def train_loop_per_worker(config: Dict):


Ideally this would have a bit more typing like Dict[str, Any] (not sure what exactly the format here is) and also link to the format of the dict if possible :)

Yeah, we can add some typing.

pcmoritz · 2022-11-21T20:25:25Z

python/ray/train/torch/torch_trainer.py

@@ -45,32 +46,33 @@ def train_loop_per_worker(config: Dict):
    Inside the ``train_loop_per_worker`` function, you can use any of the


Ideally there would also be an example for the above paragraph somewhere, we can feel free to do that in another PR.

(You can discard this, I saw the usage is already shown in the example below -- maybe add (see example below).

pcmoritz · 2022-11-21T20:35:38Z

python/ray/train/torch/torch_trainer.py


        def train_loop_per_worker():
            # Report intermediate results for callbacks or logging and
            # checkpoint data.
+            #


I feel like it was better without this line but if you prefer feel free to keep it :)

Since the code is incomplete session.report(...) and session.get_checkpoint(), nice to explain it with a comment

pcmoritz · 2022-11-21T20:36:29Z

python/ray/train/torch/torch_trainer.py

            session.report(...)

-            # Returns dict of last saved checkpoint.
+            # Session returns dict of last saved checkpoint.


Say Get dict of last saved checkpoint. here (same below)? Session returns is a little confusing I think, since technically session is a python module here and it doesn't return anything :)

Yes, Get x makes sense then returns, since it's an explicit method call to sessin.get_xxx

pcmoritz · 2022-11-21T20:37:37Z

python/ray/train/torch/torch_trainer.py

                    self.layer2 = nn.Linear(layer_size, output_size)

                def forward(self, input):
-                    return self.layer2(self.relu(self.layer1(input)))


I would either keep the ReLU layer here or have only one linear layer -- composing two linear layers doesn't do anything and it would likely be confusing to users :)

Keeping ReLU does not make sense. Why add non-linearity to a linear data relationship With ReLU the model's does not converge, it goes on like a seesaw. Having two linear layers is not uncommon. We can put in a comment, you can also use one layer if you relationship between your data and outcome (target) is linear.

pcmoritz · 2022-11-21T20:40:25Z

python/ray/train/torch/torch_trainer.py

+
+                    # Report and record metrics, checkpoint model at end of each
+                    # epoch
+                    session.report({"loss": loss.item(), "epoch": epoch},


This is confusing since epoch is both here and below, @amogkam can you recommend how to do this? Most users will follow the example, so we should make sure we do this well :)

One is reporting the loss per epoch as metrics, the other is there for checkpoint per epoch. Nice to have that metrics per epoch. If @amogkam feels strongly that we should not include "epoch" in the metrics to report, then I can remove that entity.

pcmoritz · 2022-11-21T20:41:45Z

python/ray/train/torch/torch_trainer.py

            result = trainer.fit()

+            # Get the loss metric from TorchCheckpoint tuple data dictionary
+            best_checkpoint_loss = result.metrics['loss']
+            # print(f"best loss: {best_checkpoint_loss:.4f}")


Should you remove the # here?

Yeah a bit redundant since the code is self-explanatory.

pcmoritz · 2022-11-21T20:43:45Z

python/ray/train/torch/torch_trainer.py

        train_loop_per_worker: The training function to execute.
            This can either take in no arguments or a ``config`` dict.
-        train_loop_config: Configurations to pass into
+            train_loop_config: Configurations to pass into


The indentation should be kept here, right? Otherwise it won't render correctly :)

The Args <parameter_name>: should be indented on the same level. That is:

Args: arg_1: ... arg_2: ...

They seem to render properlyl

pcmoritz

Thanks for doing this, this is great! There are a few small comments you should address before merging :)

Signed-off-by: Jules Damji <[email protected]>

…ppress output, all tests pass, incorporated most feedback Signed-off-by: Jules Damji <[email protected]>

dmatrix · 2022-11-27T16:51:13Z

@pcmoritz can we merge this and 30637 if you don't see any issues.

Signed-off-by: Philipp Moritz <[email protected]>

pcmoritz · 2022-11-28T05:43:16Z

Looks like some of the hugging face servers are down, which is independent of this PR, we can merge it after the tests ran.

pcmoritz · 2022-11-28T07:17:23Z

Signed-off-by: Jules Damji [email protected]

Rewrote the code snippet as it was not working
Removed python-code block directives; instead use testcode and testoutput. This will test code if it runs in the CI
Ignore the output since we get loads of output from the three workers
Assert that the loss converges with the training data within specified epochs
Tested code end-to-end with this script

… a working example (ray-project#30492) Signed-off-by: Jules Damji [email protected] - Rewrote the code snippet as it was not working - Removed python-code block directives; instead use testcode and testoutput. This will test code if it runs in the CI - Ignore the output since we get loads of output from the three workers - Assert that the loss converges with the training data within specified epochs - Tested code end-to-end Signed-off-by: Weichen Xu <[email protected]>

… a working example (ray-project#30492) Signed-off-by: Jules Damji [email protected] - Rewrote the code snippet as it was not working - Removed python-code block directives; instead use testcode and testoutput. This will test code if it runs in the CI - Ignore the output since we get loads of output from the three workers - Assert that the loss converges with the training data within specified epochs - Tested code end-to-end Signed-off-by: tmynn <[email protected]>

rewrote code snippet to fix some bugs; remote code-block directives a…

ede73e5

…nd use testcode, and ignore long output from train Signed-off-by: Jules Damji <[email protected]>

dmatrix requested review from pcmoritz and amogkam November 18, 2022 23:58

reformatted for docstyle changes

30b5d68

Signed-off-by: Jules Damji <[email protected]>

dmatrix changed the title ~~[RAY AIR][TRAIN][DOC][TorchTrainer] Rewrote the TorchTrainer code snippet as a working example~~ [RAY AIR][DOC][TorchTrainer] Rewrote the TorchTrainer code snippet as a working example Nov 19, 2022

pcmoritz reviewed Nov 21, 2022

View reviewed changes

pcmoritz approved these changes Nov 21, 2022

View reviewed changes

Jules Damji added 6 commits November 21, 2022 14:46

Addressed most feedback from review

1164186

Signed-off-by: Jules Damji <[email protected]>

fixed Any type errors

d4bb5a3

Signed-off-by: Jules Damji <[email protected]>

Merge branch 'master' into br_jsd_improve_code_snippets_pr_3

46338ab

removed extra line

64db02b

Signed-off-by: Jules Damji <[email protected]>

added ReLU layer, keep the original data generator, figured how to su…

d7dcebf

…ppress output, all tests pass, incorporated most feedback Signed-off-by: Jules Damji <[email protected]>

Merge branch 'master' into br_jsd_improve_code_snippets_pr_3

2f1b306

pcmoritz added 3 commits November 27, 2022 20:08

Update torch_trainer.py

e4187c2

Signed-off-by: Philipp Moritz <[email protected]>

fix indentation

7a4b2b9

Signed-off-by: Philipp Moritz <[email protected]>

indentation

35de73c

Signed-off-by: Philipp Moritz <[email protected]>

pcmoritz merged commit 19aadd4 into master Nov 28, 2022

pcmoritz deleted the br_jsd_improve_code_snippets_pr_3 branch November 28, 2022 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RAY AIR][DOC][TorchTrainer] Rewrote the TorchTrainer code snippet as a working example #30492

[RAY AIR][DOC][TorchTrainer] Rewrote the TorchTrainer code snippet as a working example #30492

dmatrix commented Nov 18, 2022 •

edited

Loading

pcmoritz Nov 21, 2022

dmatrix Nov 21, 2022

pcmoritz Nov 21, 2022

dmatrix Nov 21, 2022

pcmoritz Nov 21, 2022

pcmoritz Nov 21, 2022

pcmoritz Nov 21, 2022

dmatrix Nov 21, 2022

pcmoritz Nov 21, 2022 •

edited

Loading

dmatrix Nov 21, 2022

pcmoritz Nov 21, 2022

dmatrix Nov 21, 2022

pcmoritz Nov 21, 2022

dmatrix Nov 21, 2022

pcmoritz Nov 21, 2022

dmatrix Nov 21, 2022

pcmoritz Nov 21, 2022

dmatrix Nov 21, 2022 •

edited

Loading

dmatrix Nov 21, 2022

pcmoritz left a comment •

edited

Loading

dmatrix commented Nov 27, 2022

pcmoritz commented Nov 28, 2022

pcmoritz commented Nov 28, 2022

		@@ -22,13 +22,14 @@ class TorchTrainer(DataParallelTrainer):
		The ``train_loop_per_worker`` function is expected to take in either 0 or 1


		from typing import Dict
		def train_loop_per_worker(config: Dict):

		@@ -45,32 +46,33 @@ def train_loop_per_worker(config: Dict):
		Inside the ``train_loop_per_worker`` function, you can use any of the

[RAY AIR][DOC][TorchTrainer] Rewrote the TorchTrainer code snippet as a working example #30492

[RAY AIR][DOC][TorchTrainer] Rewrote the TorchTrainer code snippet as a working example #30492

Conversation

dmatrix commented Nov 18, 2022 • edited Loading

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz Nov 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmatrix Nov 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcmoritz left a comment • edited Loading

Choose a reason for hiding this comment

dmatrix commented Nov 27, 2022

pcmoritz commented Nov 28, 2022

pcmoritz commented Nov 28, 2022

dmatrix commented Nov 18, 2022 •

edited

Loading

pcmoritz Nov 21, 2022 •

edited

Loading

dmatrix Nov 21, 2022 •

edited

Loading

pcmoritz left a comment •

edited

Loading