Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Deprecate Dataset.num_blocks() for non-materialized Datasets #43178

Merged
merged 19 commits into from
Feb 27, 2024

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Feb 14, 2024

Why are these changes needed?

As part of the API simplification work in preparation for Ray Data GA, we are deprecating the Dataset.num_blocks() method. This method will only be available to MaterializedDatasets, and calling Dataset.num_blocks() on a non-materialized Dataset will result in a NotImplementedError.

Additional context behind the motivation for the change: We want to make Blocks a Ray Data internal concept, so users should typically not need to be concerned with them. Instead, the primary method of choice should be Dataset.count(), which returns the number of rows in the Dataset.

The number of blocks is still available from method of the Dataset's internal ExecutionPlan object: ds._plan.initial_num_blocks().

Related issue number

Closes #42184

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@@ -5,10 +5,10 @@ wandb==0.13.4

# ML training frameworks
xgboost==1.7.6
git+https://github.com/ray-project/xgboost_ray.git
git+https://github.com/ray-project/xgboost_ray@5a840af05d487171883dadbfdd37b138b607bed8#egg=xgboost_ray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change?

@@ -2593,26 +2593,6 @@ def columns(self, fetch_if_missing: bool = True) -> Optional[List[str]]:
return schema.names
return None

def num_blocks(self) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is implicitly a stable public API, right? Should we soft deprecate it before removal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to @raulchen @c21 regarding our deprecation policy vs if we want to fast track this removal before GA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to throw an error for non-materialized datasets

@@ -988,7 +988,7 @@ def repartition(
Examples:
>>> import ray
>>> ds = ray.data.range(100)
>>> ds.repartition(10).num_blocks()
>>> ds.repartition(10)._plan.initial_num_blocks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't want to expose num_blocks to user, I figure we don't want to expose Dataset._plan.initial_num_blocks either?

@@ -988,7 +988,7 @@ def repartition(
Examples:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we also planning on removing num_blocks from Dataset.__repr__?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only MaterializedDataset should report num_blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for regular Dataset, should we throw an exception or other alternative behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like throw an exception when you repr? I think we'd just exclude the information from the output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, i would exclude it from the repr. just meant the general case where we are calling num_blocks()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, exclude num_blocks in repr and throw an error when calling num_blocks(). this sounds most reasonable.

@scottjlee scottjlee requested a review from a team as a code owner February 24, 2024 05:45
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@scottjlee scottjlee changed the title [Data] Remove Dataset.num_blocks() [Data] Deprecate Dataset.num_blocks() for non-materialized Datasets Feb 24, 2024
Signed-off-by: Scott Lee <[email protected]>
Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits and added some cross-referencing.


Note that during read and transform operations, the number of blocks
This is only implemented for :class:`~ray.data.MaterializedDataset`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is only implemented for :class:`~ray.data.MaterializedDataset`,
This method is only implemented for :class:`~ray.data.MaterializedDataset`,

Note that during read and transform operations, the number of blocks
This is only implemented for :class:`~ray.data.MaterializedDataset`,
since the number of blocks may dynamically change during execution.
For instance, during read and transform operations, the number of blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For instance, during read and transform operations, the number of blocks
For instance, during read and transform operations, Ray Data may dynamically adjust

Note that during read and transform operations, the number of blocks
This is only implemented for :class:`~ray.data.MaterializedDataset`,
since the number of blocks may dynamically change during execution.
For instance, during read and transform operations, the number of blocks
may be dynamically adjusted to respect memory limits, increasing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
may be dynamically adjusted to respect memory limits, increasing the
the number of blocks to respect memory limits, increasing the

10

Time complexity: O(1)

Returns:
The number of blocks of this dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The number of blocks of this dataset.
The number of blocks of this :class:`Dataset`.

return self._plan.initial_num_blocks()
raise NotImplementedError(
"Number of blocks is only available for `MaterializedDataset`,"
"since the number of blocks may dynamically change during execution."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"since the number of blocks may dynamically change during execution."
"because the number of blocks may dynamically change during execution."

Time complexity: O(1)

Returns:
The number of blocks of this dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The number of blocks of this dataset.
The number of blocks of this :class:`Dataset`.

@@ -2576,24 +2576,22 @@ def columns(self, fetch_if_missing: bool = True) -> Optional[List[str]]:
return None

def num_blocks(self) -> int:
"""Return the number of blocks of this dataset.
"""Return the number of blocks of this Dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return the number of blocks of this Dataset.
"""Return the number of blocks of this :class:`Dataset`.

@@ -5032,7 +5026,21 @@ class MaterializedDataset(Dataset, Generic[T]):
tasks without re-executing the underlying computations for producing the stream.
"""

pass
def num_blocks(self) -> int:
"""Return the number of blocks of this MaterializedDataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Return the number of blocks of this MaterializedDataset.
"""Return the number of blocks of this :class:`MaterializedDataset`.

if dataset.size_bytes() > _WARN_REPARTITION_THRESHOLD:
warnings.warn(
f"Dataset '{dataset_key}' has {dataset.num_blocks()} blocks, "
f"Dataset '{dataset_key}' has {dataset_num_blocks} blocks, "
f"which is less than the `num_workers` "
f"{self._ray_params.num_actors}. "
f"This dataset will be automatically repartitioned to "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"This dataset will be automatically repartitioned to "
f"This dataset is automatically repartitioned to "

@c21 c21 merged commit de08484 into ray-project:master Feb 27, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] deprecate Dataset.num_blocks
5 participants