[Datasets] Raise error message if user calls `Dataset.iter` #30575

bveeramani · 2022-11-22T05:36:17Z

Why are these changes needed?

New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches.

Related issue number

Closes #30399

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2022-11-28T17:58:50Z

python/ray/data/dataset.py

@@ -4203,6 +4203,12 @@ def __len__(self) -> int:
            "This may be an expensive operation."
        )

+    def __iter__(self):
+        raise TypeError(
+            "`Dataset` objects aren't iterable. If you want to inspect records, call "


would it be better to mention the iteration APIs instead of take?

`Dataset` objects aren't iterable. If you want to iterate records, call `ds.iter_rows()` or `ds.iter_batches()`. See more information at https://docs.ray.io/en/latest/data/consuming-datasets.html

Yeah, I think that'd make more sense. Will update the message.

Signed-off-by: Balaji Veeramani <[email protected]>

…#31079) Signed-off-by: amogkam <[email protected]> Closes #31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after #30575 is merged. #30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.

…project#30575) New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches. Signed-off-by: Weichen Xu <[email protected]>

…ray-project#31079) Signed-off-by: amogkam <[email protected]> Closes ray-project#31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after ray-project#30575 is merged. ray-project#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release. Signed-off-by: Weichen Xu <[email protected]>

…#31079) Signed-off-by: amogkam <[email protected]> Closes #31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after #30575 is merged. #30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.

…project#30575) New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches. Signed-off-by: tmynn <[email protected]>

…ray-project#31079) Signed-off-by: amogkam <[email protected]> Closes ray-project#31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after ray-project#30575 is merged. ray-project#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release. Signed-off-by: tmynn <[email protected]>

Update dataset.py

54d5c75

bveeramani requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners November 22, 2022 05:36

bveeramani assigned clarkzinzow Nov 22, 2022

c21 reviewed Nov 28, 2022

View reviewed changes

Update dataset.py

ad44905

Signed-off-by: Balaji Veeramani <[email protected]>

clarkzinzow approved these changes Dec 7, 2022

View reviewed changes

c21 approved these changes Dec 7, 2022

View reviewed changes

clarkzinzow merged commit df13a1d into ray-project:master Dec 12, 2022

amogkam mentioned this pull request Dec 13, 2022

[Train] Force GBDTTrainer to use distributed loading for Ray Datasets #31079

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Raise error message if user calls `Dataset.iter` #30575

[Datasets] Raise error message if user calls `Dataset.iter` #30575

bveeramani commented Nov 22, 2022 •

edited

Loading

c21 Nov 28, 2022

bveeramani Nov 28, 2022

[Datasets] Raise error message if user calls Dataset.__iter__ #30575

[Datasets] Raise error message if user calls Dataset.__iter__ #30575

Conversation

bveeramani commented Nov 22, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

c21 Nov 28, 2022

Choose a reason for hiding this comment

bveeramani Nov 28, 2022

Choose a reason for hiding this comment

[Datasets] Raise error message if user calls `Dataset.iter` #30575

[Datasets] Raise error message if user calls `Dataset.iter` #30575

bveeramani commented Nov 22, 2022 •

edited

Loading