-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Deprecate Dataset.num_blocks()
for non-materialized Dataset
s
#43178
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
…o 0214-numblocks
Signed-off-by: Scott Lee <[email protected]>
@@ -5,10 +5,10 @@ wandb==0.13.4 | |||
|
|||
# ML training frameworks | |||
xgboost==1.7.6 | |||
git+https://github.com/ray-project/xgboost_ray.git | |||
git+https://github.com/ray-project/xgboost_ray@5a840af05d487171883dadbfdd37b138b607bed8#egg=xgboost_ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change?
@@ -2593,26 +2593,6 @@ def columns(self, fetch_if_missing: bool = True) -> Optional[List[str]]: | |||
return schema.names | |||
return None | |||
|
|||
def num_blocks(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is implicitly a stable public API, right? Should we soft deprecate it before removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to throw an error for non-materialized datasets
python/ray/data/dataset.py
Outdated
@@ -988,7 +988,7 @@ def repartition( | |||
Examples: | |||
>>> import ray | |||
>>> ds = ray.data.range(100) | |||
>>> ds.repartition(10).num_blocks() | |||
>>> ds.repartition(10)._plan.initial_num_blocks() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't want to expose num_blocks
to user, I figure we don't want to expose Dataset._plan.initial_num_blocks
either?
@@ -988,7 +988,7 @@ def repartition( | |||
Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we also planning on removing num_blocks
from Dataset.__repr__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only MaterializedDataset should report num_blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for regular Dataset, should we throw an exception or other alternative behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like throw an exception when you repr
? I think we'd just exclude the information from the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah, i would exclude it from the repr. just meant the general case where we are calling num_blocks()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, exclude num_blocks in repr
and throw an error when calling num_blocks()
. this sounds most reasonable.
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Dataset.num_blocks()
Dataset.num_blocks()
for non-materialized Dataset
s
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits and added some cross-referencing.
python/ray/data/dataset.py
Outdated
|
||
Note that during read and transform operations, the number of blocks | ||
This is only implemented for :class:`~ray.data.MaterializedDataset`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only implemented for :class:`~ray.data.MaterializedDataset`, | |
This method is only implemented for :class:`~ray.data.MaterializedDataset`, |
python/ray/data/dataset.py
Outdated
Note that during read and transform operations, the number of blocks | ||
This is only implemented for :class:`~ray.data.MaterializedDataset`, | ||
since the number of blocks may dynamically change during execution. | ||
For instance, during read and transform operations, the number of blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, during read and transform operations, the number of blocks | |
For instance, during read and transform operations, Ray Data may dynamically adjust |
python/ray/data/dataset.py
Outdated
Note that during read and transform operations, the number of blocks | ||
This is only implemented for :class:`~ray.data.MaterializedDataset`, | ||
since the number of blocks may dynamically change during execution. | ||
For instance, during read and transform operations, the number of blocks | ||
may be dynamically adjusted to respect memory limits, increasing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be dynamically adjusted to respect memory limits, increasing the | |
the number of blocks to respect memory limits, increasing the |
python/ray/data/dataset.py
Outdated
10 | ||
|
||
Time complexity: O(1) | ||
|
||
Returns: | ||
The number of blocks of this dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of blocks of this dataset. | |
The number of blocks of this :class:`Dataset`. |
python/ray/data/dataset.py
Outdated
return self._plan.initial_num_blocks() | ||
raise NotImplementedError( | ||
"Number of blocks is only available for `MaterializedDataset`," | ||
"since the number of blocks may dynamically change during execution." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"since the number of blocks may dynamically change during execution." | |
"because the number of blocks may dynamically change during execution." |
python/ray/data/dataset.py
Outdated
Time complexity: O(1) | ||
|
||
Returns: | ||
The number of blocks of this dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of blocks of this dataset. | |
The number of blocks of this :class:`Dataset`. |
python/ray/data/dataset.py
Outdated
@@ -2576,24 +2576,22 @@ def columns(self, fetch_if_missing: bool = True) -> Optional[List[str]]: | |||
return None | |||
|
|||
def num_blocks(self) -> int: | |||
"""Return the number of blocks of this dataset. | |||
"""Return the number of blocks of this Dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Return the number of blocks of this Dataset. | |
"""Return the number of blocks of this :class:`Dataset`. |
python/ray/data/dataset.py
Outdated
@@ -5032,7 +5026,21 @@ class MaterializedDataset(Dataset, Generic[T]): | |||
tasks without re-executing the underlying computations for producing the stream. | |||
""" | |||
|
|||
pass | |||
def num_blocks(self) -> int: | |||
"""Return the number of blocks of this MaterializedDataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Return the number of blocks of this MaterializedDataset. | |
"""Return the number of blocks of this :class:`MaterializedDataset`. |
python/ray/train/gbdt_trainer.py
Outdated
if dataset.size_bytes() > _WARN_REPARTITION_THRESHOLD: | ||
warnings.warn( | ||
f"Dataset '{dataset_key}' has {dataset.num_blocks()} blocks, " | ||
f"Dataset '{dataset_key}' has {dataset_num_blocks} blocks, " | ||
f"which is less than the `num_workers` " | ||
f"{self._ray_params.num_actors}. " | ||
f"This dataset will be automatically repartitioned to " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"This dataset will be automatically repartitioned to " | |
f"This dataset is automatically repartitioned to " |
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Why are these changes needed?
As part of the API simplification work in preparation for Ray Data GA, we are deprecating the
Dataset.num_blocks()
method. This method will only be available toMaterializedDataset
s, and callingDataset.num_blocks()
on a non-materialized Dataset will result in aNotImplementedError
.Additional context behind the motivation for the change: We want to make
Block
s a Ray Data internal concept, so users should typically not need to be concerned with them. Instead, the primary method of choice should beDataset.count()
, which returns the number of rows in the Dataset.The number of blocks is still available from method of the Dataset's internal
ExecutionPlan
object:ds._plan.initial_num_blocks()
.Related issue number
Closes #42184
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.