[AIR] Added Ray Logging to MosaicTrainer #29620

ilee300a · 2022-10-24T20:25:13Z

Added RayLogger to MosaicTrainer to relay all reported information.

RayLogger is a subclass of LoggerDestination, just like all other native composer loggers. The information to be logged is given via log_metrics call, which is saved in the RayLogger object. The logger reports the logged information every batch checkpoint and epoch checkpoint. All other composer loggers besides RayLogger loggers are removed from the trainer.

Note that because at the moment, the result metrics_dataframe will only include the keys that are reported in the very first report call, to have metrics that are not reported every batch in the final metrics dataframe, the keys should be passed in via 'log_keys' in the trainer_init_config.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: ilee300a <[email protected]>

…/ray into init_mosaic_trainer_api Signed-off-by: ilee300a <[email protected]>

…/ray into init_mosaic_trainer_api

Signed-off-by: ilee300a <[email protected]>

python/ray/train/mosaic/mosaic_trainer.py

Signed-off-by: ilee300a <[email protected]>

amogkam · 2022-10-26T22:32:05Z

python/ray/train/mosaic/_mosaic_utils.py

+
+    Because ray's metric dataframe will not include new keys that is reported after the
+    very first report call, any logged information with the keys not included in the
+    first batch checkpoint would not be retrievable after training. In other words, if


Are users expected to know what these keys upfront? Looking at the Mosaic code, it seems that these keys are automatically added by Mosaic algorithms and callbacks, so I don't think users are aware of what these keys are in order to provide them here.

I think we should fix the underlying bug

amogkam · 2022-10-26T22:34:17Z

python/ray/train/tests/test_mosaic_trainer.py

+    assert "lr-DecoupledSGDW/group0" in metrics_columns
+    assert "grad_l2_norm/step" in metrics_columns
+
+


can we make these newly added tests more robust?

do the number of rows in the dataframe match with what we expect?

we should add a dummy callback that reports to the logger, and then check to make sure the values in the dataframe match with what we expect.

are there any other edge cases you can think of?

Tests have been updated to check

the number of rows in the dataframe

value reported by a dummy callback

whether null value exists for the reported composer monitoring callbacks

python/ray/train/tests/test_mosaic_trainer.py

amogkam · 2022-10-26T22:38:28Z

python/ray/train/mosaic/_mosaic_utils.py

+
+    def epoch_checkpoint(self, state: State, logger: Logger) -> None:
+        del logger  # unused
+        session.report(self.data)


We shouldn't report on at both the batch level and the epoch level. Each call to session.report should be 1 iteration, so if we log at both, we will be double counting.

For now, I would say let's just log only at every epoch. We can see in the future if we want to give users the ability to configure this.

I think we'll definitely have users want to do it on either level - that was the case with HF, where we started with epochs only and had to add steps too.

Completely agree @Yard1. I’m thinking we can default to epoch for now and then add batch support in a follow up.

This has been updated!
We are reporting each epoch now, but we also report after the fit call, in case the training ends before an epoch checkpoint call could be made. This adds an extra report call, in which an epoch checkpoint can be double counted. -- but we can also make it so that this last call is made only if there are extra batch runs after the last epoch run.

The mentioned change above has been applied.

amogkam · 2022-10-26T22:40:27Z

Thanks @ilee300a! Left some comments on improving the UX and on the testing

Signed-off-by: ilee300a <[email protected]>

jiaodong

can we update test plan of the pr with logged data sample and final charts ?

jiaodong · 2022-10-27T18:40:39Z

python/ray/train/examples/mosaic_cifar10_example.py

    )
    test_dataset = torch.utils.data.Subset(
        datasets.CIFAR10(
            data_directory, train=False, download=True, transform=cifar10_transforms
        ),
-        list(range(64)),
+        list(range(2048)),


why is batch size for training const and this inline number ?

Updated so that it is BATCH_SIZE *10 just like the train dataset

Signed-off-by: ilee300a <[email protected]>

python/ray/train/mosaic/_mosaic_utils.py

Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: ilee300a <[email protected]>

amogkam

Thanks @ilee300a! lgtm overall, just left some minor comments

python/ray/train/mosaic/_mosaic_utils.py

Signed-off-by: ilee300a <[email protected]>

…y_logging

amogkam · 2022-10-27T21:21:34Z

Thanks @ilee300a! Please ping again once tests are passing for merge

Added RayLogger to MosaicTrainer to relay all reported information. RayLogger is a subclass of LoggerDestination, just like all other native composer loggers. The information to be logged is given via log_metrics call, which is saved in the RayLogger object. The logger reports the logged information every batch checkpoint and epoch checkpoint. All other composer loggers besides RayLogger loggers are removed from the trainer. Note that because at the moment, the result metrics_dataframe will only include the keys that are reported in the very first report call, to have metrics that are not reported every batch in the final metrics dataframe, the keys should be passed in via 'log_keys' in the trainer_init_config. Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: ilee300a <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

ilee300a and others added 30 commits October 11, 2022 14:22

add initial trainer API files

364d7f6

created a basic composer trainer wrapper and a test

cfa593a

Signed-off-by: ilee300a <[email protected]>

Merge branch 'ray-project:master' into init_mosaic_trainer_api

919aea4

formatted

7f3b551

Signed-off-by: ilee300a <[email protected]>

removed unused config key

0871fa4

Merge branch 'init_mosaic_trainer_api' of https://github.com/ilee300a…

f083e1b

…/ray into init_mosaic_trainer_api Signed-off-by: ilee300a <[email protected]>

merged

536a381

Merge branch 'init_mosaic_trainer_api' of https://github.com/ilee300a…

9d15416

…/ray into init_mosaic_trainer_api

addressed comments

7b509da

Signed-off-by: ilee300a <[email protected]>

addressed comments

2712fda

Signed-off-by: ilee300a <[email protected]>

Merge branch 'ray-project:master' into init_mosaic_trainer_api

1851322

change the data directory

ae1b9a6

Signed-off-by: ilee300a <[email protected]>

removed progress bar by default

5d599e4

Signed-off-by: ilee300a <[email protected]>

move dataset preparation inside the trainer init function

5da865d

Signed-off-by: ilee300a <[email protected]>

add mosaic trainer to the api ref

0e42dec

Signed-off-by: ilee300a <[email protected]>

add fit config

104075f

Signed-off-by: ilee300a <[email protected]>

remove datasets in the trainer worker loop

161009e

Signed-off-by: ilee300a <[email protected]>

added a test for fit config

72dc74f

Signed-off-by: ilee300a <[email protected]>

added example MosaicTrainer script and a test

faa2e9b

Signed-off-by: ilee300a <[email protected]>

added mosaic trainer test to the build script

810428c

Signed-off-by: ilee300a <[email protected]>

removed unnecessary line

3d32969

Signed-off-by: ilee300a <[email protected]>

reformatted passing in config and use regex for error message testing

b96f384

Signed-off-by: ilee300a <[email protected]>

Fixed typo

1fe7989

Signed-off-by: ilee300a <[email protected]>

remove composer loggers in training callback

d39b63d

Signed-off-by: ilee300a <[email protected]>

added a test for logger removal

de77a64

Signed-off-by: ilee300a <[email protected]>

addressed comments and removed fit_config

b297926

Signed-off-by: ilee300a <[email protected]>

fixed a typo

00bd1f8

Signed-off-by: ilee300a <[email protected]>

added further description for trainier_init_config

be4c311

Signed-off-by: ilee300a <[email protected]>

fixed typo

a93ddf9

Signed-off-by: ilee300a <[email protected]>

fixed mosaic test_example

8b324af

Signed-off-by: ilee300a <[email protected]>

ilee300a added 2 commits October 25, 2022 11:32

Merge branch 'init_mosaic_trainer_api' into ray_logging

1fc42f4

added custom directives for mosaic util library

acbcc2a

Signed-off-by: ilee300a <[email protected]>

Yard1 reviewed Oct 25, 2022

View reviewed changes

python/ray/train/mosaic/mosaic_trainer.py Outdated Show resolved Hide resolved

ilee300a added 2 commits October 25, 2022 15:29

updated removing composer loggers

0c79025

Signed-off-by: ilee300a <[email protected]>

updated -- add ray logger after trainer init without using loggers key

3d84dcb

Signed-off-by: ilee300a <[email protected]>

amogkam requested changes Oct 26, 2022

View reviewed changes

ilee300a added 3 commits October 27, 2022 10:44

fixed typo

2ce2b41

Signed-off-by: ilee300a <[email protected]>

updated ray logging interval

1993c0f

Signed-off-by: ilee300a <[email protected]>

updated ray logging tests with suggested changes

3f15a5a

Signed-off-by: ilee300a <[email protected]>

jiaodong reviewed Oct 27, 2022

View reviewed changes

ilee300a added 2 commits October 27, 2022 13:44

made test dataset size scaled by predefined batch size

36393e9

Signed-off-by: ilee300a <[email protected]>

updated logging interval

09acf07

Signed-off-by: ilee300a <[email protected]>

ilee300a requested review from amogkam and removed request for maxpumperla, matthewdeng, richardliaw, krfricke and xwjiang2010 October 27, 2022 20:59

amogkam reviewed Oct 27, 2022

View reviewed changes

python/ray/train/mosaic/_mosaic_utils.py Outdated Show resolved Hide resolved

Update python/ray/train/mosaic/_mosaic_utils.py

870b8c5

Co-authored-by: Amog Kamsetty <[email protected]> Signed-off-by: ilee300a <[email protected]>

amogkam reviewed Oct 27, 2022

View reviewed changes

python/ray/train/mosaic/_mosaic_utils.py Show resolved Hide resolved

python/ray/train/mosaic/_mosaic_utils.py Show resolved Hide resolved

ilee300a added 2 commits October 27, 2022 14:18

updated logging -- flush data and added comments

18392ff

Signed-off-by: ilee300a <[email protected]>

Merge branch 'ray_logging' of https://github.com/ilee300a/ray into ra…

875dacc

…y_logging

amogkam approved these changes Oct 27, 2022

View reviewed changes

Merge branch 'ray-project:master' into ray_logging

ac07acc

richardliaw approved these changes Oct 27, 2022

View reviewed changes

amogkam merged commit 28e84b8 into ray-project:master Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Added Ray Logging to MosaicTrainer #29620

[AIR] Added Ray Logging to MosaicTrainer #29620

ilee300a commented Oct 24, 2022

amogkam Oct 26, 2022

amogkam Oct 26, 2022

ilee300a Oct 27, 2022

amogkam Oct 26, 2022 •

edited

Loading

ilee300a Oct 27, 2022

amogkam Oct 26, 2022

Yard1 Oct 27, 2022

amogkam Oct 27, 2022

ilee300a Oct 27, 2022

ilee300a Oct 27, 2022 •

edited

Loading

amogkam commented Oct 26, 2022

jiaodong left a comment

jiaodong Oct 27, 2022

ilee300a Oct 27, 2022

amogkam left a comment

amogkam commented Oct 27, 2022

		assert "lr-DecoupledSGDW/group0" in metrics_columns
		assert "grad_l2_norm/step" in metrics_columns

[AIR] Added Ray Logging to MosaicTrainer #29620

[AIR] Added Ray Logging to MosaicTrainer #29620

Conversation

ilee300a commented Oct 24, 2022

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilee300a Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

amogkam commented Oct 26, 2022

jiaodong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

amogkam commented Oct 27, 2022

amogkam Oct 26, 2022 •

edited

Loading

ilee300a Oct 27, 2022 •

edited

Loading