Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Raise error message if user calls Dataset.__iter__ #30575

Merged
merged 2 commits into from
Dec 12, 2022

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Nov 22, 2022

Signed-off-by: Balaji [email protected]

Why are these changes needed?

New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches.

Related issue number

Closes #30399

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -4203,6 +4203,12 @@ def __len__(self) -> int:
"This may be an expensive operation."
)

def __iter__(self):
raise TypeError(
"`Dataset` objects aren't iterable. If you want to inspect records, call "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to mention the iteration APIs instead of take?

`Dataset` objects aren't iterable. If you want to iterate records, call `ds.iter_rows()` or `ds.iter_batches()`.
See more information at https://docs.ray.io/en/latest/data/consuming-datasets.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that'd make more sense. Will update the message.

Signed-off-by: Balaji Veeramani <[email protected]>
@clarkzinzow clarkzinzow merged commit df13a1d into ray-project:master Dec 12, 2022
amogkam added a commit that referenced this pull request Dec 14, 2022
…#31079)

Signed-off-by: amogkam <[email protected]>

Closes #31068

xgboost_ray has 2 modes for data loading:

A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load.
A distributed mode where the remote training actors load in the data partitions directly.
When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after #30575 is merged.

#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True.

This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading.

This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training.

In this PR, we force distributed data loading when using the AIR GBDTTrainers.

In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…project#30575)

New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches.

Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ray-project#31079)

Signed-off-by: amogkam <[email protected]>

Closes ray-project#31068

xgboost_ray has 2 modes for data loading:

A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load.
A distributed mode where the remote training actors load in the data partitions directly.
When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after ray-project#30575 is merged.

ray-project#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True.

This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading.

This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training.

In this PR, we force distributed data loading when using the AIR GBDTTrainers.

In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.

Signed-off-by: Weichen Xu <[email protected]>
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…#31079)

Signed-off-by: amogkam <[email protected]>

Closes #31068

xgboost_ray has 2 modes for data loading:

A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load.
A distributed mode where the remote training actors load in the data partitions directly.
When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after #30575 is merged.

#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True.

This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading.

This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training.

In this PR, we force distributed data loading when using the AIR GBDTTrainers.

In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…project#30575)

New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches.

Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ray-project#31079)

Signed-off-by: amogkam <[email protected]>

Closes ray-project#31068

xgboost_ray has 2 modes for data loading:

A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load.
A distributed mode where the remote training actors load in the data partitions directly.
When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after ray-project#30575 is merged.

ray-project#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True.

This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading.

This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training.

In this PR, we force distributed data loading when using the AIR GBDTTrainers.

In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.

Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Improve iter(dataset) error message
3 participants