[Data] Deprecate `Dataset.num_blocks()` for non-materialized `Dataset`s #43178

scottjlee · 2024-02-14T20:41:32Z

Why are these changes needed?

As part of the API simplification work in preparation for Ray Data GA, we are deprecating the Dataset.num_blocks() method. This method will only be available to MaterializedDatasets, and calling Dataset.num_blocks() on a non-materialized Dataset will result in a NotImplementedError.

Additional context behind the motivation for the change: We want to make Blocks a Ray Data internal concept, so users should typically not need to be concerned with them. Instead, the primary method of choice should be Dataset.count(), which returns the number of rows in the Dataset.

The number of blocks is still available from method of the Dataset's internal ExecutionPlan object: ds._plan.initial_num_blocks().

Related issue number

Closes #42184

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>

…o 0214-numblocks

Signed-off-by: Scott Lee <[email protected]>

bveeramani · 2024-02-16T21:33:52Z

python/requirements/ml/core-requirements.txt

@@ -5,10 +5,10 @@ wandb==0.13.4

 # ML training frameworks
 xgboost==1.7.6
-git+https://github.com/ray-project/xgboost_ray.git
+git+https://github.com/ray-project/xgboost_ray@5a840af05d487171883dadbfdd37b138b607bed8#egg=xgboost_ray


Unrelated change?

bveeramani · 2024-02-20T19:52:08Z

python/ray/data/dataset.py

@@ -2593,26 +2593,6 @@ def columns(self, fetch_if_missing: bool = True) -> Optional[List[str]]:
            return schema.names
        return None

-    def num_blocks(self) -> int:


This API is implicitly a stable public API, right? Should we soft deprecate it before removal?

up to @raulchen @c21 regarding our deprecation policy vs if we want to fast track this removal before GA.

I prefer to throw an error for non-materialized datasets

bveeramani · 2024-02-20T19:54:34Z

python/ray/data/dataset.py

@@ -988,7 +988,7 @@ def repartition(
        Examples:
            >>> import ray
            >>> ds = ray.data.range(100)
-            >>> ds.repartition(10).num_blocks()
+            >>> ds.repartition(10)._plan.initial_num_blocks()


If we don't want to expose num_blocks to user, I figure we don't want to expose Dataset._plan.initial_num_blocks either?

bveeramani · 2024-02-20T19:58:53Z

python/ray/data/dataset.py

@@ -988,7 +988,7 @@ def repartition(
        Examples:


Are we also planning on removing num_blocks from Dataset.__repr__?

I think only MaterializedDataset should report num_blocks.

for regular Dataset, should we throw an exception or other alternative behavior?

Like throw an exception when you repr? I think we'd just exclude the information from the output

oh yeah, i would exclude it from the repr. just meant the general case where we are calling num_blocks()

yeah, exclude num_blocks in repr and throw an error when calling num_blocks(). this sounds most reasonable.

Signed-off-by: Scott Lee <[email protected]>

angelinalg

Just some nits and added some cross-referencing.

angelinalg · 2024-02-26T23:28:40Z

python/ray/data/dataset.py


-        Note that during read and transform operations, the number of blocks
+        This is only implemented for :class:`~ray.data.MaterializedDataset`,


Suggested change

This is only implemented for :class:`~ray.data.MaterializedDataset`,

This method is only implemented for :class:`~ray.data.MaterializedDataset`,

angelinalg · 2024-02-26T23:29:12Z

python/ray/data/dataset.py

-        Note that during read and transform operations, the number of blocks
+        This is only implemented for :class:`~ray.data.MaterializedDataset`,
+        since the number of blocks may dynamically change during execution.
+        For instance, during read and transform operations, the number of blocks


Suggested change

For instance, during read and transform operations, the number of blocks

For instance, during read and transform operations, Ray Data may dynamically adjust

angelinalg · 2024-02-26T23:29:23Z

python/ray/data/dataset.py

-        Note that during read and transform operations, the number of blocks
+        This is only implemented for :class:`~ray.data.MaterializedDataset`,
+        since the number of blocks may dynamically change during execution.
+        For instance, during read and transform operations, the number of blocks
        may be dynamically adjusted to respect memory limits, increasing the


Suggested change

may be dynamically adjusted to respect memory limits, increasing the

the number of blocks to respect memory limits, increasing the

angelinalg · 2024-02-26T23:29:37Z

python/ray/data/dataset.py

-            10
-
-        Time complexity: O(1)
-
        Returns:
            The number of blocks of this dataset.


Suggested change

The number of blocks of this dataset.

The number of blocks of this :class:`Dataset`.

angelinalg · 2024-02-26T23:29:51Z

python/ray/data/dataset.py

-        return self._plan.initial_num_blocks()
+        raise NotImplementedError(
+            "Number of blocks is only available for `MaterializedDataset`,"
+            "since the number of blocks may dynamically change during execution."


Suggested change

"since the number of blocks may dynamically change during execution."

"because the number of blocks may dynamically change during execution."

angelinalg · 2024-02-26T23:30:25Z

python/ray/data/dataset.py

+        Time complexity: O(1)
+
+        Returns:
+            The number of blocks of this dataset.


Suggested change

The number of blocks of this dataset.

The number of blocks of this :class:`Dataset`.

angelinalg · 2024-02-26T23:34:50Z

python/ray/data/dataset.py

@@ -2576,24 +2576,22 @@ def columns(self, fetch_if_missing: bool = True) -> Optional[List[str]]:
        return None

    def num_blocks(self) -> int:
-        """Return the number of blocks of this dataset.
+        """Return the number of blocks of this Dataset.


Suggested change

"""Return the number of blocks of this Dataset.

"""Return the number of blocks of this :class:`Dataset`.

angelinalg · 2024-02-26T23:35:38Z

python/ray/data/dataset.py

@@ -5032,7 +5026,21 @@ class MaterializedDataset(Dataset, Generic[T]):
    tasks without re-executing the underlying computations for producing the stream.
    """

-    pass
+    def num_blocks(self) -> int:
+        """Return the number of blocks of this MaterializedDataset.


Suggested change

"""Return the number of blocks of this MaterializedDataset.

"""Return the number of blocks of this :class:`MaterializedDataset`.

angelinalg · 2024-02-26T23:36:48Z

python/ray/train/gbdt_trainer.py

                if dataset.size_bytes() > _WARN_REPARTITION_THRESHOLD:
                    warnings.warn(
-                        f"Dataset '{dataset_key}' has {dataset.num_blocks()} blocks, "
+                        f"Dataset '{dataset_key}' has {dataset_num_blocks} blocks, "
                        f"which is less than the `num_workers` "
                        f"{self._ray_params.num_actors}. "
                        f"This dataset will be automatically repartitioned to "


Suggested change

f"This dataset will be automatically repartitioned to "

f"This dataset is automatically repartitioned to "

Signed-off-by: Scott Lee <[email protected]>

scottjlee added 3 commits February 14, 2024 12:35

remove Dataset.num_blocks

8ec0511

Signed-off-by: Scott Lee <[email protected]>

replace in tests

3c3eae9

Signed-off-by: Scott Lee <[email protected]>

update docs

ea4e4a6

Signed-off-by: Scott Lee <[email protected]>

scottjlee mentioned this pull request Feb 14, 2024

[Data] Remove Dataset.num_blocks() usages ray-project/xgboost_ray#307

Merged

scottjlee and others added 6 commits February 14, 2024 17:03

Merge branch 'master' into 0214-numblocks

ef559e0

Merge branch 'master' into 0214-numblocks

670e6cf

pin xgboost_ray to fix

122555c

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into 0214-numblocks

ba11ff6

Signed-off-by: Scott Lee <[email protected]>

Merge branch '0214-numblocks' of https://github.com/scottjlee/ray int…

8810f64

…o 0214-numblocks

pin xgboostray/lightgbmray

5c4d48a

Signed-off-by: Scott Lee <[email protected]>

scottjlee marked this pull request as ready for review February 16, 2024 20:14

scottjlee requested review from ericl, scv119, c21, amogkam, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners February 16, 2024 20:14

scottjlee assigned raulchen and bveeramani Feb 16, 2024

bveeramani reviewed Feb 20, 2024

View reviewed changes

scottjlee added 2 commits February 23, 2024 21:00

Merge branch 'master' into 0214-numblocks

1809a9e

Signed-off-by: Scott Lee <[email protected]>

keep num_blocks() for MaterializedDataset

fc142b9

Signed-off-by: Scott Lee <[email protected]>

scottjlee requested a review from a team as a code owner February 24, 2024 05:45

scottjlee added 4 commits February 24, 2024 12:12

tests

72c30db

Signed-off-by: Scott Lee <[email protected]>

lint

1a23a73

Signed-off-by: Scott Lee <[email protected]>

update doctests

514d7b0

Signed-off-by: Scott Lee <[email protected]>

undo ml requirements changes

ab7384c

Signed-off-by: Scott Lee <[email protected]>

scottjlee changed the title ~~[Data] Remove Dataset.num_blocks()~~ [Data] Deprecate Dataset.num_blocks() for non-materialized Datasets Feb 24, 2024

scottjlee added 2 commits February 24, 2024 14:22

format

b71b612

Signed-off-by: Scott Lee <[email protected]>

avoid passing dataset object

4f2403b

Signed-off-by: Scott Lee <[email protected]>

raulchen approved these changes Feb 26, 2024

View reviewed changes

c21 assigned angelinalg Feb 26, 2024

angelinalg approved these changes Feb 26, 2024

View reviewed changes

scottjlee added 2 commits February 26, 2024 15:58

address docs comments

193c6a7

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into 0214-numblocks

053dc94

Signed-off-by: Scott Lee <[email protected]>

c21 merged commit de08484 into ray-project:master Feb 27, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Deprecate `Dataset.num_blocks()` for non-materialized `Dataset`s #43178

[Data] Deprecate `Dataset.num_blocks()` for non-materialized `Dataset`s #43178

scottjlee commented Feb 14, 2024 •

edited

Loading

bveeramani Feb 16, 2024

bveeramani Feb 20, 2024

scottjlee Feb 20, 2024

raulchen Feb 21, 2024

bveeramani Feb 20, 2024

bveeramani Feb 20, 2024

raulchen Feb 21, 2024

scottjlee Feb 21, 2024

bveeramani Feb 21, 2024

scottjlee Feb 21, 2024

raulchen Feb 21, 2024

angelinalg left a comment

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024

angelinalg Feb 26, 2024


		Note that during read and transform operations, the number of blocks
		This is only implemented for :class:`~ray.data.MaterializedDataset`,

	This is only implemented for :class:`~ray.data.MaterializedDataset`,
	This method is only implemented for :class:`~ray.data.MaterializedDataset`,

	For instance, during read and transform operations, the number of blocks
	For instance, during read and transform operations, Ray Data may dynamically adjust

	may be dynamically adjusted to respect memory limits, increasing the
	the number of blocks to respect memory limits, increasing the

	The number of blocks of this dataset.
	The number of blocks of this :class:`Dataset`.

	"since the number of blocks may dynamically change during execution."
	"because the number of blocks may dynamically change during execution."

	"""Return the number of blocks of this Dataset.
	"""Return the number of blocks of this :class:`Dataset`.

	"""Return the number of blocks of this MaterializedDataset.
	"""Return the number of blocks of this :class:`MaterializedDataset`.

	f"This dataset will be automatically repartitioned to "
	f"This dataset is automatically repartitioned to "

[Data] Deprecate Dataset.num_blocks() for non-materialized Datasets #43178

[Data] Deprecate Dataset.num_blocks() for non-materialized Datasets #43178

Conversation

scottjlee commented Feb 14, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angelinalg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Data] Deprecate `Dataset.num_blocks()` for non-materialized `Dataset`s #43178

[Data] Deprecate `Dataset.num_blocks()` for non-materialized `Dataset`s #43178

scottjlee commented Feb 14, 2024 •

edited

Loading