-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dataset] Add FromXXX
operators
#32959
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
from ray.data._internal.logical.operators.map_operator import AbstractMap | ||
|
||
|
||
class FromItems(AbstractMap): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really a map right? Shouldn't this inherit from Read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're correct, it's not a map here. Discussed with @c21 earlier this week, and I think the distinction we wanted between read_xxx
and from_xxx
was that read
is done from an external source, while from
is read from in-memory. Should we still inherit from Read
? As an alternative, we could instead inherit from the general parent LogicalOperator
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if it's hard to represent as a Read, it seems ok to inherit from logical operator directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that this is distinct from Read seems strange to me, especially since range() generates a Read operator. I don't have a strong objection however.
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
FromItems
operatorFromItems
, FromPandasRef
, and FromPandas
operators
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
FromItems
, FromPandasRef
, and FromPandas
operatorsFromXXX
operators
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Tests look unrelated to me -- sorry for the long hiatus on this one, should be ready for review now @ericl |
self._parallelism = parallelism | ||
|
||
|
||
class FromTF(FromItems): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this create Arrow blocks or simple blocks? (I assume arrow?) I am wondering if it makes sense to extend the Arrow logical operator if that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this creates Arrow blocks I believe:
>>> import ray; import tensorflow_datasets as tfds
>>> tf_dataset, _ = tfds.load('cifar10', split=["train", "test"])
>>> dataset = ray.data.from_tf(tf_dataset)
>>> blocks = dataset._plan.execute()
>>> ray.get(blocks._blocks[0])
pyarrow.Table
id: binary
image: extension<arrow.py_extension_type<ArrowTensorType>>
label: int64
Since the existing from_tf
implementation relies on from_items
on a numpy iterator converted to a list, I was thinking this most closely reflects FromItems
. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to refine FromTF
to extend FromNumpyRefs
, so it's being more specific.
|
||
def __init__( | ||
self, | ||
ndarrays: Union[List[ObjectRef["np.ndarray"]], List["np.ndarray"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps from tf / torch should extend this class instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think from_tf
can extend this class.
from_torch
seems not guarantee to return NumPy ndarray, so it looks to me okay to extend FromItems
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @scottjlee, major logic looks pretty solid, having some comments.
python/ray/data/_internal/logical/operators/from_pandas_operator.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
python/ray/data/read_api.py
Outdated
) | ||
|
||
|
||
@PublicAPI | ||
def from_arrow( | ||
tables: Union["pyarrow.Table", bytes, List[Union["pyarrow.Table", bytes]]] | ||
tables: Union["pyarrow.Table", bytes, List[Union["pyarrow.Table", bytes]]], | ||
logical_op: Optional[FromArrowRefs] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Signed-off-by: Scott Lee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cc @ericl for any comments before merging. thanks. |
- #32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases. - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests. - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test. Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: elliottower <[email protected]>
- ray-project#32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases. - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests. - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: elliottower <[email protected]>
Signed-off-by: Jack He <[email protected]>
- ray-project#32959 added a good number of tests without changing any timeouts, and as a result, some of the tests will time out occasionally, making the Data CI tests flakey. Therefore, we should increase the timeout for Bazel targets which recently received additional test cases. - In addition, one of the failing tests, `test_from_huggingface_e2e`, was found to have a failure which was not caught in the original PR. `test_stats.test_dataset__repr__` also is flakey sometimes, so I add a fix for these tests. - I also added a blank file, `python/ray/data/tests/block_batching/__init__.py`, which is needed to resolve a pytest error (non-unique test filename) for an existing test. Signed-off-by: Scott Lee <[email protected]> Signed-off-by: Jack He <[email protected]>
Why are these changes needed?
From the Ray Data Read API (
ray.data.read_api
),from_xxx
methods do not have logical operators. This PR implements the corresponding operators for suchfrom_xxx
methods.Related issue number
Closes #32604
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.