-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, some small comments and questions.
python/ray/data/impl/simple_block.py
Outdated
@@ -71,7 +71,7 @@ def random_shuffle(self, random_seed: Optional[int]) -> List[T]: | |||
|
|||
def to_pandas(self) -> "pandas.DataFrame": | |||
import pandas | |||
return pandas.DataFrame(self._items) | |||
return pandas.DataFrame(self._items).rename(columns=str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl What if we applied the ray.data.range_arrow()
and ray.data.range_tensor()
semantics here, where a single column under the column name "value"
is created? I.e.
return pandas.DataFrame(self._items).rename(columns=str) | |
return pandas.DataFrame({"value": self._items}) |
I think that this improves the consistency of how we do the list --> single-column table conversion, and then we don't have to worry about this issue.
@@ -329,7 +329,7 @@ def test_batch_tensors(ray_start_regular_shared): | |||
with pytest.raises(pa.lib.ArrowInvalid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add tests that this is working properly with the flag on and off?
- Test that after
map_batches()
with a UDF returning a pandas DF the _dataset_format is "pandas". - Test that after
from_pandas()
the _dataset_format is "pandas".
Also test that with the flag off, the format is "arrow".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl I've updated the test code. Could you review it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address @clarkzinzow 's comments and adding a unit test with flag on-off behavior--- after that looks good to merge!
assert values == rows | ||
@pytest.mark.parametrize("enable_pandas_block", [False, True]) | ||
def test_from_pandas(ray_start_regular_shared, enable_pandas_block): | ||
ctx = ray.data.context.DatasetContext.get_current() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR obviously, but we should really look at letting DatasetContext
be used as a context manager...
Test failures are unrelated. |
…ion (partial) (ray-project#20988)" This reverts commit 4a55d10.
Any update on getting this re-merged? Seems like a blocker for #21566 |
…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.
…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.
…lementation (partial) (#20988) (#21661)" (#21894) This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`. Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.
Why are these changes needed?
This PR adds pandas block format support by implementing
PandasRow
,PandasBlockBuilder
,PandasBlockAccessor
.Note that
sort_and_partition
,combine
,merge_sorted_blocks
,aggregate_combined_blocks
inPandasBlockAccessor
redirects to arrow block format implementation for now. They'll be implemented in a later PR.Related issue number
#20719
Checks
scripts/format.sh
to lint the changes in this PR.