[AIR/Train] Make Dataset ingest configurable #24066

amogkam · 2022-04-21T00:27:48Z

Refactors Dataset splitting to make it less hacky and address the TODO. Also makes Dataset ingest in general configurable for Ray Train. This is an internal only change for now, but will set the stage for the proposed ingest API

Customizable ingest for GBDT Trainers is out of scope for this PR.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/train/utils.py

Co-authored-by: Eric Liang <[email protected]>

ericl · 2022-04-21T01:29:11Z

python/ray/ml/train/data_parallel_trainer.py

-            else:
-                # Ray Train will strip out the added string before exposing to users.
-                updated_dataset_dict[key + "_NO-SHARD"] = value
+        def dataset_split_fn(dataset_dict, training_worker_handles):


Could we pull this out into a default splitting function in the util module?

If we're pulling this out into a default splitting function, could you add a docstring? Would allow readers to understand the function without having to reference _RayDatasetSpec.

Separated into its own function, but left it in data_parallel_trainer for now as it is only being used in DataParallelTrainer. Let's revisit the location if we need it in more trainers in the future.

…nto air-dataset-split-refactor

ericl · 2022-04-21T01:30:14Z

python/ray/train/utils.py

+            )
+            if not len(splits) == len(training_worker_handles):
+                raise RuntimeError(
+                    "The list of Datasets returned by the "


How about moving this class into a separate file, such as ml/train/impl/dataset_spec.py?

(also, this should go into ml/train for now right? given that ray/train is deprecated).

+1 on moving dataset spec to it's own module.

Moved to its own module!

But kept it as part of ray/train. It's being used by current Ray Train, and as discussed offline, the end state is to eventually move ray/ml/train to ray/train anyways.

ericl · 2022-04-21T01:31:21Z

python/ray/train/trainer.py

@@ -320,12 +325,14 @@ def run(

        train_func = construct_train_func(train_func, config)

+        dataset_spec = _RayDatasetSpec(dataset_or_dict=dataset)


At some point we should move / copy the trainer files into ml/train right? In preparation for replacing the old train module.

I thought we decided the end state should be to move ray/ml/train to ray/train right?

But yes agreed, we should definitely cleanup the current ray/train in a future PR.

Definitely need the cleanup. I think the easiest way is to copy files over and decouple them, but open to other approaches.

bveeramani

Some readability comments, but overall looks good.

bveeramani · 2022-04-25T19:28:33Z

python/ray/ml/train/data_parallel_trainer.py

-            else:
-                # Ray Train will strip out the added string before exposing to users.
-                updated_dataset_dict[key + "_NO-SHARD"] = value
+        def dataset_split_fn(dataset_dict, training_worker_handles):


Could we add a type annotation to training_worker_handles? The type wasn't obvious until I read _RayDatasetSpec.

bveeramani · 2022-04-25T19:33:15Z

python/ray/ml/train/data_parallel_trainer.py

-            else:
-                # Ray Train will strip out the added string before exposing to users.
-                updated_dataset_dict[key + "_NO-SHARD"] = value
+        def dataset_split_fn(dataset_dict, training_worker_handles):


If we're pulling this out into a default splitting function, could you add a docstring? Would allow readers to understand the function without having to reference _RayDatasetSpec.

bveeramani · 2022-04-25T19:33:42Z

python/ray/ml/train/data_parallel_trainer.py

+                        locality_hints=training_worker_handles,
+                    )
+                else:
+                    # Only shard the training dataset.


Could you add a comment explaining why we're only sharding the training dataset?

python/ray/train/utils.py

bveeramani · 2022-04-25T19:40:18Z

python/ray/train/utils.py

+            )
+            if not len(splits) == len(training_worker_handles):
+                raise RuntimeError(
+                    "The list of Datasets returned by the "


+1 on moving dataset spec to it's own module.

bveeramani · 2022-04-25T19:45:19Z

python/ray/train/utils.py

+        ]
+    ] = None
+
+    def _default_split_fn(


I'm confused why we need both _RayDataSpec._default_split_fn and dataset_split_fn in training_loop. Isn't detaset_split_fn the default for training?

dataset_split_fn is the implementation that DataParallelTrainer uses, but is not the default for RayDatasetSpec in general.

More specifically, the default implementation for RayDatasetSpec is to split all datasets.

DataParallelTrainer is overriding this behavior to split just the train dataset, but not split the other datasets.

In the future, users should be able to override the behavior for DataParallelTrainer.

This reverts commit e2348ba.

ericl · 2022-04-25T23:16:45Z

python/ray/ml/train/data_parallel_trainer.py

@@ -348,3 +343,39 @@ def write_checkpoint(self, checkpoint: Dict):
    @property
    def latest_checkpoint_dir(self) -> Optional[Path]:
        raise NotImplementedError
+
+
+def _default_dataset_split_fn(


Could we move this into the dataset spec file?

This function is specific to DataParallelTrainer, not to DatasetSpec in general.

python/ray/train/dataset_spec.py

bveeramani

LGTM

…dataset-split-refactor

After #24066, some release tests are running into: ``` ModuleNotFoundError: No module named 'ray.train.impl' ``` This PR simply adds a `__init__.py` file to resolve this. We also add a 5 wecond delay for client runners in release test to give clusters a bit of slack to come up (and avoid ray client connection errors)

refactor dataset splitting

5275cbe

amogkam assigned ericl, matthewdeng, Yard1 and bveeramani Apr 21, 2022

ericl reviewed Apr 21, 2022

View reviewed changes

python/ray/train/utils.py Outdated Show resolved Hide resolved

python/ray/train/utils.py Outdated Show resolved Hide resolved

Update python/ray/train/utils.py

7e7c927

Co-authored-by: Eric Liang <[email protected]>

ericl reviewed Apr 21, 2022

View reviewed changes

amogkam added 2 commits April 20, 2022 18:29

make private

13537ca

Merge branch 'air-dataset-split-refactor' of github.com:amogkam/ray i…

8783055

…nto air-dataset-split-refactor

ericl reviewed Apr 21, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 21, 2022

jovany-wang self-assigned this Apr 21, 2022

bveeramani reviewed Apr 25, 2022

View reviewed changes

amogkam added 6 commits April 25, 2022 14:37

separate to own module

e2348ba

add file

cf9a077

format

7c345ba

Revert "separate to own module"

8b25c73

This reverts commit e2348ba.

fix

9bb124e

separate function

d5819c7

amogkam requested review from ericl and bveeramani April 25, 2022 22:23

amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 25, 2022

ericl reviewed Apr 25, 2022

View reviewed changes

python/ray/train/dataset_spec.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 25, 2022

move to impl

12664c3

amogkam requested a review from ericl April 25, 2022 23:30

amogkam removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 25, 2022

ericl approved these changes Apr 26, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 26, 2022

fix

6cdc8f0

bveeramani approved these changes Apr 26, 2022

View reviewed changes

amogkam added 6 commits April 25, 2022 19:02

fix

8ab12da

fix

8ed0843

Merge branch 'master' of https://github.com/ray-project/ray into air-…

90ef2fa

…dataset-split-refactor

Merge branch 'master' of https://github.com/ray-project/ray into air-…

31d0501

…dataset-split-refactor

fix

f1e4e31

update tests

f42044a

amogkam added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Apr 28, 2022

amogkam merged commit 629424f into ray-project:master Apr 28, 2022

amogkam deleted the air-dataset-split-refactor branch April 28, 2022 04:41

krfricke mentioned this pull request Apr 29, 2022

[ci/release] Fix module import errors in release tests #24334

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR/Train] Make Dataset ingest configurable #24066

[AIR/Train] Make Dataset ingest configurable #24066

amogkam commented Apr 21, 2022 •

edited

Loading

ericl Apr 21, 2022

bveeramani Apr 25, 2022

amogkam Apr 25, 2022

ericl Apr 21, 2022

bveeramani Apr 25, 2022

amogkam Apr 25, 2022

ericl Apr 21, 2022

amogkam Apr 21, 2022

ericl Apr 21, 2022

bveeramani left a comment

bveeramani Apr 25, 2022

amogkam Apr 25, 2022

bveeramani Apr 25, 2022

bveeramani Apr 25, 2022

amogkam Apr 25, 2022

bveeramani Apr 25, 2022

bveeramani Apr 25, 2022

amogkam Apr 25, 2022

amogkam Apr 25, 2022

ericl Apr 25, 2022

amogkam Apr 25, 2022

bveeramani left a comment

		@@ -320,12 +325,14 @@ def run(

		train_func = construct_train_func(train_func, config)

		dataset_spec = _RayDatasetSpec(dataset_or_dict=dataset)

[AIR/Train] Make Dataset ingest configurable #24066

[AIR/Train] Make Dataset ingest configurable #24066

Conversation

amogkam commented Apr 21, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani left a comment

Choose a reason for hiding this comment

amogkam commented Apr 21, 2022 •

edited

Loading