-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Add DataIterator.materialize
#43210
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
feature_column_dtypes[key] | ||
if isinstance(feature_column_dtypes, dict) | ||
else feature_column_dtypes, | ||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just a lint change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My auto-linting caught this for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @justinvyu!
Can we also add an end-to-end test for XGBoostTrainer
? It can wait for followup PR if you plan to do it later.
|
||
block_iter, stats, owned_by_consumer = self._to_block_iterator() | ||
|
||
block_refs_and_metadata = list(block_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment that this would trigger the execution and materialize all blocks of iterator? Want to make it obvious for people to read in the future.
Oh forget one more thing, please add the new API to documentation - https://github.com/ray-project/ray/blob/master/doc/source/data/api/data_iterator.rst . Thanks. |
@c21 Yes, I plan on adding the e2e test in the follow-up |
…_iter_to_dataset
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Why are these changes needed?
This PR introduces a
DataIterator.materialize
API that fully executes/consumes a data iterator and returns it as aMaterializedDataset
for the user to continue processing it.The reason to add this API is to support model training in Ray Train that requires the full dataset up front. For example,
xgboost
needs to consider the full dataset to fit decision trees and expects that full dataset to be .The
get_dataset_shard
API which bridges Ray Data and Ray Train callsstreaming_split
on the dataset, where the number of splits is the number of training workers. This works well for SGD training schemes (typical for Torch, Tensorflow users), since the typical training procedure is to estimate the gradient on a small batch of data at a time. Fitting decision trees requires searching for the best split over the entire dataset, where the batch by batch dataloading is not suitable.With this change, the following workflow is now possible:
XGBoost training with a data iterator
Note that there actually is support for xgboost training with data iterators, but it is experimental and possibly less performant: https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html#data-iterator
Related PR
This PR is a pre-requisite for #42767
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.