[Dataset] Add `FromXXX` operators #32959

scottjlee · 2023-03-02T00:14:15Z

Why are these changes needed?

From the Ray Data Read API (ray.data.read_api), from_xxx methods do not have logical operators. This PR implements the corresponding operators for such from_xxx methods.

Related issue number

Closes #32604

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>

ericl · 2023-03-02T00:17:21Z

python/ray/data/_internal/logical/operators/from_items_operator.py

+from ray.data._internal.logical.operators.map_operator import AbstractMap
+
+
+class FromItems(AbstractMap):


It's not really a map right? Shouldn't this inherit from Read?

Yeah you're correct, it's not a map here. Discussed with @c21 earlier this week, and I think the distinction we wanted between read_xxx and from_xxx was that read is done from an external source, while from is read from in-memory. Should we still inherit from Read? As an alternative, we could instead inherit from the general parent LogicalOperator class?

Yeah, if it's hard to represent as a Read, it seems ok to inherit from logical operator directly.

ericl

The fact that this is distinct from Read seems strange to me, especially since range() generates a Read operator. I don't have a strong objection however.

Signed-off-by: Scott Lee <[email protected]>

scottjlee · 2023-04-06T16:35:42Z

Tests look unrelated to me -- sorry for the long hiatus on this one, should be ready for review now @ericl

ericl · 2023-04-06T17:02:15Z

python/ray/data/_internal/logical/operators/from_items_operator.py

+        self._parallelism = parallelism
+
+
+class FromTF(FromItems):


Does this create Arrow blocks or simple blocks? (I assume arrow?) I am wondering if it makes sense to extend the Arrow logical operator if that's the case.

Yeah, this creates Arrow blocks I believe:

>>> import ray; import tensorflow_datasets as tfds >>> tf_dataset, _ = tfds.load('cifar10', split=["train", "test"]) >>> dataset = ray.data.from_tf(tf_dataset) >>> blocks = dataset._plan.execute() >>> ray.get(blocks._blocks[0]) pyarrow.Table id: binary image: extension<arrow.py_extension_type<ArrowTensorType>> label: int64

Since the existing from_tf implementation relies on from_items on a numpy iterator converted to a list, I was thinking this most closely reflects FromItems. Thoughts?

I think it's fine to refine FromTF to extend FromNumpyRefs, so it's being more specific.

ericl · 2023-04-06T17:02:43Z

python/ray/data/_internal/logical/operators/from_numpy_operator.py

+
+    def __init__(
+        self,
+        ndarrays: Union[List[ObjectRef["np.ndarray"]], List["np.ndarray"]],


Or perhaps from tf / torch should extend this class instead?

yeah I think from_tf can extend this class.

from_torch seems not guarantee to return NumPy ndarray, so it looks to me okay to extend FromItems.

c21

Thanks @scottjlee, major logic looks pretty solid, having some comments.

python/ray/data/tests/test_raydp_dataset.py

python/ray/data/_internal/util.py

python/ray/data/_internal/logical/operators/from_pandas_operator.py

python/ray/data/_internal/planner/plan_from_numpy_op.py

python/ray/data/_internal/planner/plan_from_pandas_op.py

python/ray/data/tests/test_execution_optimizer.py

Signed-off-by: Scott Lee <[email protected]>

python/ray/data/read_api.py

c21 · 2023-04-07T21:01:07Z

python/ray/data/read_api.py

    )


 @PublicAPI
 def from_arrow(
-    tables: Union["pyarrow.Table", bytes, List[Union["pyarrow.Table", bytes]]]
+    tables: Union["pyarrow.Table", bytes, List[Union["pyarrow.Table", bytes]]],
+    logical_op: Optional[FromArrowRefs] = None,


python/ray/data/read_api.py

Signed-off-by: Scott Lee <[email protected]>

c21

LGTM

c21 · 2023-04-08T03:11:10Z

cc @ericl for any comments before merging. thanks.

- #32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases. - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests. - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test. Signed-off-by: Scott Lee <[email protected]>

Signed-off-by: elliottower <[email protected]>

- ray-project#32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases. - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests. - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>

Signed-off-by: Jack He <[email protected]>

- ray-project#32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases. - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests. - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: Jack He <[email protected]>

scottjlee added 5 commits February 28, 2023 13:52

pull in tests

a75f2a7

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into from-operators

c3b9486

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into from-operators

49a29e5

Signed-off-by: Scott Lee <[email protected]>

from_items operator

f4b79f5

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into from-operators

a4899d8

Signed-off-by: Scott Lee <[email protected]>

scottjlee marked this pull request as ready for review March 2, 2023 00:15

scottjlee requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners March 2, 2023 00:15

scottjlee assigned c21 Mar 2, 2023

ericl reviewed Mar 2, 2023

View reviewed changes

scottjlee added 3 commits March 1, 2023 16:38

inherit logicaloperator class instead

5ab0827

Signed-off-by: Scott Lee <[email protected]>

add from_pandas_ref operator

1adee26

Signed-off-by: Scott Lee <[email protected]>

add FromPandas operator

5c948f8

Signed-off-by: Scott Lee <[email protected]>

scottjlee changed the title ~~[Dataset] Add FromItems operator~~ [Dataset] Add FromItems, FromPandasRef, and FromPandas operators Mar 6, 2023

scottjlee added 2 commits March 6, 2023 12:57

clean up

8c43af7

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into from-operators

4f59fa6

Signed-off-by: Scott Lee <[email protected]>

c21 assigned clarkzinzow Mar 6, 2023

scottjlee added 4 commits March 6, 2023 14:16

add FromDask operator

43c9ff7

Signed-off-by: Scott Lee <[email protected]>

add FromModin operator

3b6a6a4

Signed-off-by: Scott Lee <[email protected]>

add untested FromMARS operator

9eb97a4

Signed-off-by: Scott Lee <[email protected]>

add numpy operators

9de3941

Signed-off-by: Scott Lee <[email protected]>

scottjlee changed the title ~~[Dataset] Add FromItems, FromPandasRef, and FromPandas operators~~ [Dataset] Add FromXXX operators Mar 7, 2023

Merge branch 'master' into from-operators

0235a84

Signed-off-by: Scott Lee <[email protected]>

scottjlee marked this pull request as draft March 7, 2023 22:35

add FromArrowRefs and FromArrow operators

4fdfddd

Signed-off-by: Scott Lee <[email protected]>

scottjlee added 3 commits April 5, 2023 14:04

Merge branch 'master' into from-operators

5b4a9fc

Signed-off-by: Scott Lee <[email protected]>

add FromMARS op to whitelist

90db6a3

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into from-operators

f23565d

Signed-off-by: Scott Lee <[email protected]>

scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Apr 6, 2023

scottjlee assigned ericl Apr 6, 2023

ericl reviewed Apr 6, 2023

View reviewed changes

c21 assigned jianoaix and unassigned clarkzinzow Apr 6, 2023

c21 reviewed Apr 6, 2023

View reviewed changes

scottjlee added 5 commits April 6, 2023 12:14

comments part 1

1797970

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into from-operators

2452c6d

Signed-off-by: Scott Lee <[email protected]>

test progress

4971016

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into from-operators

5631dd5

Signed-off-by: Scott Lee <[email protected]>

update tests with take_all() to trigger new backend execution

da6cdf7

Signed-off-by: Scott Lee <[email protected]>

c21 reviewed Apr 7, 2023

View reviewed changes

remove logical_op from public methods

468d4fe

Signed-off-by: Scott Lee <[email protected]>

c21 approved these changes Apr 8, 2023

View reviewed changes

ericl approved these changes Apr 8, 2023

View reviewed changes

ericl merged commit 82434e2 into ray-project:master Apr 8, 2023

scottjlee mentioned this pull request Apr 10, 2023

[Dataset] Fix breaking Data CI tests #34195

Merged

8 tasks

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

[Dataset] Add FromXXX operators (ray-project#32959)

c3d4358

Signed-off-by: elliottower <[email protected]>

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023

[Dataset] Add FromXXX operators (ray-project#32959)

5f31e20

Signed-off-by: Jack He <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset] Add `FromXXX` operators #32959

[Dataset] Add `FromXXX` operators #32959

scottjlee commented Mar 2, 2023 •

edited

Loading

ericl Mar 2, 2023

scottjlee Mar 2, 2023

ericl Mar 2, 2023

ericl left a comment

scottjlee commented Apr 6, 2023

ericl Apr 6, 2023

scottjlee Apr 6, 2023

c21 Apr 6, 2023

ericl Apr 6, 2023

c21 Apr 6, 2023

c21 left a comment

c21 Apr 7, 2023

c21 left a comment

c21 commented Apr 8, 2023

		from ray.data._internal.logical.operators.map_operator import AbstractMap


		class FromItems(AbstractMap):

[Dataset] Add FromXXX operators #32959

[Dataset] Add FromXXX operators #32959

Conversation

scottjlee commented Mar 2, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

scottjlee commented Apr 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

c21 commented Apr 8, 2023

[Dataset] Add `FromXXX` operators #32959

[Dataset] Add `FromXXX` operators #32959

scottjlee commented Mar 2, 2023 •

edited

Loading