[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988

kfstorm · 2021-12-09T13:21:24Z

Why are these changes needed?

This PR adds pandas block format support by implementing PandasRow, PandasBlockBuilder, PandasBlockAccessor.

Note that sort_and_partition, combine, merge_sorted_blocks, aggregate_combined_blocks in PandasBlockAccessor redirects to arrow block format implementation for now. They'll be implemented in a later PR.

Related issue number

#20719

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…rame

…rame_2

clarkzinzow

Mostly LGTM, some small comments and questions.

python/ray/data/impl/pandas_block.py

python/ray/data/read_api.py

clarkzinzow · 2021-12-31T23:07:44Z

python/ray/data/impl/simple_block.py

@@ -71,7 +71,7 @@ def random_shuffle(self, random_seed: Optional[int]) -> List[T]:

    def to_pandas(self) -> "pandas.DataFrame":
        import pandas
-        return pandas.DataFrame(self._items)
+        return pandas.DataFrame(self._items).rename(columns=str)


@ericl What if we applied the ray.data.range_arrow() and ray.data.range_tensor() semantics here, where a single column under the column name "value" is created? I.e.

Suggested change

return pandas.DataFrame(self._items).rename(columns=str)

return pandas.DataFrame({"value": self._items})

I think that this improves the consistency of how we do the list --> single-column table conversion, and then we don't have to worry about this issue.

ericl · 2022-01-03T20:32:16Z

python/ray/data/tests/test_dataset.py

@@ -329,7 +329,7 @@ def test_batch_tensors(ray_start_regular_shared):
    with pytest.raises(pa.lib.ArrowInvalid):


Can we add tests that this is working properly with the flag on and off?

Test that after map_batches() with a UDF returning a pandas DF the _dataset_format is "pandas".

Test that after from_pandas() the _dataset_format is "pandas".

Also test that with the flag off, the format is "arrow".

@ericl I've updated the test code. Could you review it again?

ericl

Please address @clarkzinzow 's comments and adding a unit test with flag on-off behavior--- after that looks good to merge!

clarkzinzow · 2022-01-12T21:14:19Z

python/ray/data/tests/test_dataset.py

-    assert values == rows
+@pytest.mark.parametrize("enable_pandas_block", [False, True])
+def test_from_pandas(ray_start_regular_shared, enable_pandas_block):
+    ctx = ray.data.context.DatasetContext.get_current()


Not in this PR obviously, but we should really look at letting DatasetContext be used as a context manager...

…rame_2

kfstorm · 2022-01-15T09:28:25Z

Test failures are unrelated.

…ion (partial) (ray-project#20988)" This reverts commit 4a55d10.

…ion (partial) (#20988) (#21661) This reverts commit 4a55d10.

ericl · 2022-01-26T02:07:53Z

Any update on getting this re-merged? Seems like a blocker for #21566

…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.

…lementation (partial) (#20988) (#21661)" (#21894) This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`. Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.

…fies batch_format="native" (#21566) With the addition of #20988, the native format becomes ambiguous. This PR proposes to auto-promote arrow to pandas blocks when the user specifies "native" format, to avoid uncertainty.

kfstorm added 30 commits November 17, 2021 21:16

Partial commit

88f1288

partial commit

44a8574

partial commit

7150775

temp commit

519ece0

lint

b589b4c

Working example with 'from_pandas'

c3a1f2e

minor update

e9790af

Implement sample

bac9ebc

fix some tests

d89b4bd

remove _is_arrow_dataset

9201b52

Fallback to arrow block accessor

5a8edfc

add _enable_pandas_block

a42155e

fix

1292124

Add buildkite test

9122d67

simplify

b7e872b

lint

70b2b55

add column name type check

e988022

Address comments

c128490

Merge remote-tracking branch 'upstream/master' into ray_dataset_dataf…

94e9e9f

…rame

nit

4606ffc

address comments

81c6d0b

fix test_map_batches

c2addfa

Merge remote-tracking branch 'upstream/master' into ray_dataset_dataf…

759f6b5

…rame

fix tests

4f2d103

lint

4d2b9ec

Merge remote-tracking branch 'upstream/master' into ray_dataset_dataf…

96cdacf

…rame

fix after merge

8c70323

Revert pandas code changes

9b7b623

nit

44d17c0

Add pandas block format support

c252395

kfstorm added 3 commits December 23, 2021 19:59

address comments

a0176ca

Merge remote-tracking branch 'upstream/master' into ray_dataset_dataf…

d45da85

…rame_2

minor update

26f3a24

kfstorm mentioned this pull request Dec 30, 2021

[Datasets] [Pandas Block] Implement PandasBlockAccessor in pandas-native ways #21296

Closed

2 tasks

kfstorm removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 30, 2021

clarkzinzow reviewed Dec 31, 2021

View reviewed changes

ericl reviewed Jan 3, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 3, 2022

kfstorm added 2 commits January 12, 2022 17:09

address comments

2c4904c

lint

60b37cc

kfstorm removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2022

clarkzinzow reviewed Jan 12, 2022

View reviewed changes

ericl mentioned this pull request Jan 12, 2022

[data] Always convert arrow batches to pandas batches when user specifies batch_format="native" #21566

Merged

ericl approved these changes Jan 12, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2022

kfstorm added 3 commits January 13, 2022 14:23

Merge remote-tracking branch 'upstream/master' into ray_dataset_dataf…

6a41d74

…rame_2

fix datasets_train.py

05472b1

Merge remote-tracking branch 'upstream/master' into ray_dataset_dataf…

c005caa

…rame_2

kfstorm merged commit 4a55d10 into ray-project:master Jan 15, 2022

kfstorm deleted the ray_dataset_dataframe_2 branch January 15, 2022 09:28

jjyao added a commit to jjyao/ray that referenced this pull request Jan 18, 2022

Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementat…

9cb08d6

…ion (partial) (ray-project#20988)" This reverts commit 4a55d10.

rkooo567 pushed a commit that referenced this pull request Jan 18, 2022

Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementat…

fa5c167

…ion (partial) (#20988) (#21661) This reverts commit 4a55d10.

ericl added a commit to ericl/ray that referenced this pull request Jan 26, 2022

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format imp…

77e4326

…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.

kfstorm added a commit to alipay/ant-ray that referenced this pull request Jan 26, 2022

Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format imp…

fa28cdd

…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988

[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988

kfstorm commented Dec 9, 2021

clarkzinzow left a comment

clarkzinzow Dec 31, 2021

ericl Jan 3, 2022

kfstorm Jan 12, 2022

ericl left a comment

clarkzinzow Jan 12, 2022

kfstorm commented Jan 15, 2022

ericl commented Jan 26, 2022

	return pandas.DataFrame(self._items).rename(columns=str)
	return pandas.DataFrame({"value": self._items})

		@@ -329,7 +329,7 @@ def test_batch_tensors(ray_start_regular_shared):
		with pytest.raises(pa.lib.ArrowInvalid):

[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988

[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988

Conversation

kfstorm commented Dec 9, 2021

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Dec 31, 2021

Choose a reason for hiding this comment

ericl Jan 3, 2022

Choose a reason for hiding this comment

kfstorm Jan 12, 2022

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

clarkzinzow Jan 12, 2022

Choose a reason for hiding this comment

kfstorm commented Jan 15, 2022

ericl commented Jan 26, 2022