Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) #20988

Merged
merged 43 commits into from
Jan 15, 2022

Conversation

kfstorm
Copy link
Member

@kfstorm kfstorm commented Dec 9, 2021

Why are these changes needed?

This PR adds pandas block format support by implementing PandasRow, PandasBlockBuilder, PandasBlockAccessor.

Note that sort_and_partition, combine, merge_sorted_blocks, aggregate_combined_blocks in PandasBlockAccessor redirects to arrow block format implementation for now. They'll be implemented in a later PR.

Related issue number

#20719

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@kfstorm kfstorm removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 30, 2021
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, some small comments and questions.

python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/impl/pandas_block.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
@@ -71,7 +71,7 @@ def random_shuffle(self, random_seed: Optional[int]) -> List[T]:

def to_pandas(self) -> "pandas.DataFrame":
import pandas
return pandas.DataFrame(self._items)
return pandas.DataFrame(self._items).rename(columns=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl What if we applied the ray.data.range_arrow() and ray.data.range_tensor() semantics here, where a single column under the column name "value" is created? I.e.

Suggested change
return pandas.DataFrame(self._items).rename(columns=str)
return pandas.DataFrame({"value": self._items})

I think that this improves the consistency of how we do the list --> single-column table conversion, and then we don't have to worry about this issue.

@@ -329,7 +329,7 @@ def test_batch_tensors(ray_start_regular_shared):
with pytest.raises(pa.lib.ArrowInvalid):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add tests that this is working properly with the flag on and off?

  • Test that after map_batches() with a UDF returning a pandas DF the _dataset_format is "pandas".
  • Test that after from_pandas() the _dataset_format is "pandas".

Also test that with the flag off, the format is "arrow".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl I've updated the test code. Could you review it again?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address @clarkzinzow 's comments and adding a unit test with flag on-off behavior--- after that looks good to merge!

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 3, 2022
@kfstorm kfstorm removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2022
assert values == rows
@pytest.mark.parametrize("enable_pandas_block", [False, True])
def test_from_pandas(ray_start_regular_shared, enable_pandas_block):
ctx = ray.data.context.DatasetContext.get_current()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR obviously, but we should really look at letting DatasetContext be used as a context manager...

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2022
@kfstorm
Copy link
Member Author

kfstorm commented Jan 15, 2022

Test failures are unrelated.

@kfstorm kfstorm merged commit 4a55d10 into ray-project:master Jan 15, 2022
@kfstorm kfstorm deleted the ray_dataset_dataframe_2 branch January 15, 2022 09:28
jjyao added a commit to jjyao/ray that referenced this pull request Jan 18, 2022
rkooo567 pushed a commit that referenced this pull request Jan 18, 2022
@ericl
Copy link
Contributor

ericl commented Jan 26, 2022

Any update on getting this re-merged? Seems like a blocker for #21566

ericl added a commit to ericl/ray that referenced this pull request Jan 26, 2022
kfstorm added a commit to alipay/ant-ray that referenced this pull request Jan 26, 2022
ericl pushed a commit that referenced this pull request Jan 31, 2022
…lementation (partial) (#20988) (#21661)" (#21894)

This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.

Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.
ericl added a commit that referenced this pull request Feb 2, 2022
…fies batch_format="native" (#21566)

With the addition of #20988, the native format becomes ambiguous. This PR proposes to auto-promote arrow to pandas blocks when the user specifies "native" format, to avoid uncertainty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants