-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Raise error message if user calls Dataset.__iter__
#30575
Merged
clarkzinzow
merged 2 commits into
ray-project:master
from
bveeramani:iter-error-message
Dec 12, 2022
Merged
[Datasets] Raise error message if user calls Dataset.__iter__
#30575
clarkzinzow
merged 2 commits into
ray-project:master
from
bveeramani:iter-error-message
Dec 12, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bveeramani
requested review from
ericl,
scv119,
clarkzinzow,
jjyao,
jianoaix and
c21
as code owners
November 22, 2022 05:36
c21
reviewed
Nov 28, 2022
python/ray/data/dataset.py
Outdated
@@ -4203,6 +4203,12 @@ def __len__(self) -> int: | |||
"This may be an expensive operation." | |||
) | |||
|
|||
def __iter__(self): | |||
raise TypeError( | |||
"`Dataset` objects aren't iterable. If you want to inspect records, call " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to mention the iteration APIs instead of take
?
`Dataset` objects aren't iterable. If you want to iterate records, call `ds.iter_rows()` or `ds.iter_batches()`.
See more information at https://docs.ray.io/en/latest/data/consuming-datasets.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that'd make more sense. Will update the message.
Signed-off-by: Balaji Veeramani <[email protected]>
clarkzinzow
approved these changes
Dec 7, 2022
c21
approved these changes
Dec 7, 2022
7 tasks
amogkam
added a commit
that referenced
this pull request
Dec 14, 2022
…#31079) Signed-off-by: amogkam <[email protected]> Closes #31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after #30575 is merged. #30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.
WeichenXu123
pushed a commit
to WeichenXu123/ray
that referenced
this pull request
Dec 19, 2022
…project#30575) New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches. Signed-off-by: Weichen Xu <[email protected]>
WeichenXu123
pushed a commit
to WeichenXu123/ray
that referenced
this pull request
Dec 19, 2022
…ray-project#31079) Signed-off-by: amogkam <[email protected]> Closes ray-project#31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after ray-project#30575 is merged. ray-project#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release. Signed-off-by: Weichen Xu <[email protected]>
AmeerHajAli
pushed a commit
that referenced
this pull request
Jan 12, 2023
…#31079) Signed-off-by: amogkam <[email protected]> Closes #31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after #30575 is merged. #30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release.
tamohannes
pushed a commit
to ju2ez/ray
that referenced
this pull request
Jan 25, 2023
…project#30575) New users might try for item in dataset and get confused when they receive the default error message. This PR adds a more descriptive error that points users towards Dataset.take or Dataset.map_batches. Signed-off-by: tmynn <[email protected]>
tamohannes
pushed a commit
to ju2ez/ray
that referenced
this pull request
Jan 25, 2023
…ray-project#31079) Signed-off-by: amogkam <[email protected]> Closes ray-project#31068 xgboost_ray has 2 modes for data loading: A centralized mode where the driver first loads in all the data and then partitions it for the remote training actors to load. A distributed mode where the remote training actors load in the data partitions directly. When using Ray Datasets with xgboost_ray, we should always do distributed data loading (option 2). However, this is no longer the case after ray-project#30575 is merged. ray-project#30575 adds an __iter__ method to Ray Datasets causing isinstance(dataset, Iterable) to return True. This causes Ray Dataset inputs to enter this if statement: https://github.com/ray-project/xgboost_ray/blob/v0.1.12/xgboost_ray/matrix.py#L943-L949, causing xgboost-ray to think that Ray Datasets are not distributed and therefore going with option 1 for loading. This centralized loading leads to excessive object spilling and ultimately crashes large scale xgboost training. In this PR, we force distributed data loading when using the AIR GBDTTrainers. In a follow up, we should clean up the distributed detection logic directly in xgboost-ray, removing input formats that are no longer supported, and then do a new release. Signed-off-by: tmynn <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Balaji [email protected]
Why are these changes needed?
New users might try
for item in dataset
and get confused when they receive the default error message. This PR adds a more descriptive error that points users towardsDataset.take
orDataset.map_batches
.Related issue number
Closes #30399
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.