Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] map_groups doesn't work with "numpy" batch format #30102

Closed
jiaodong opened this issue Nov 8, 2022 · 1 comment · Fixed by #30172
Closed

[Data] map_groups doesn't work with "numpy" batch format #30102

jiaodong opened this issue Nov 8, 2022 · 1 comment · Fixed by #30172
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@jiaodong
Copy link
Member

jiaodong commented Nov 8, 2022

What happened + What you expected to happen

Seems like dataset with groupby followed by map_groups cannot correctly produce arrow blocks and subsequent call with raw python dictionary fails due to invalid block type.

(_map_block_nosplit pid=11452) Traceback (most recent call last):
(_map_block_nosplit pid=11452)   File "python/ray/_raylet.pyx", line 830, in ray._raylet.execute_task
(_map_block_nosplit pid=11452)   File "python/ray/_raylet.pyx", line 834, in ray._raylet.execute_task
(_map_block_nosplit pid=11452)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/_internal/compute.py", line 484, in _map_block_nosplit
(_map_block_nosplit pid=11452)     for new_block in block_fn(blocks, *fn_args, **fn_kwargs):
(_map_block_nosplit pid=11452)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/dataset.py", line 582, in transform
(_map_block_nosplit pid=11452)     yield from process_next_batch(batch)
(_map_block_nosplit pid=11452)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/dataset.py", line 570, in process_next_batch
(_map_block_nosplit pid=11452)     batch = batch_fn(batch, *fn_args, **fn_kwargs)
(_map_block_nosplit pid=11452)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/grouped_dataset.py", line 330, in group_fn
(_map_block_nosplit pid=11452)     block_accessor = BlockAccessor.for_block(batch)
(_map_block_nosplit pid=11452)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/block.py", line 399, in for_block
(_map_block_nosplit pid=11452)     raise TypeError("Not a block type: {} ({})".format(block, type(block)))
(_map_block_nosplit pid=11452) TypeError: Not a block type: {'group': array([1, 1]), 'value': array([1, 2])} (<class 'dict'>)

Versions / Dependencies

master

Reproduction script

import pandas as pd
import numpy as np

import ray

ds = ray.data.from_items([
    {"group": 1, "value": 1},
    {"group": 1, "value": 2},
    {"group": 2, "value": 3},
    {"group": 2, "value": 4}
])

def udf(data):
    print(data)
    return data

ds.groupby("group").map_groups(udf, batch_format="numpy")

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@jiaodong jiaodong added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks data Ray Data-related issues air labels Nov 8, 2022
@jiaodong
Copy link
Member Author

jiaodong commented Nov 8, 2022

Changing it to batch_format="pyarrow" works for me and here's legacy docstring



batch_format: Specify "default" to use the default block format
--
  | (promotes Arrow to pandas), "pandas" to select
  | ``pandas.DataFrame`` as the batch format,
  | or "pyarrow" to select ``pyarrow.Table``.


But as a result the input batch to UDF becomes 'pyarrow.lib.Table' rather than ndarray / Dict[str, ndarray]

@c21 c21 self-assigned this Nov 8, 2022
clarkzinzow pushed a commit that referenced this issue Nov 14, 2022
…30172)

This is to fix issue found in #30102, where user can do ds.groupby("key").map_groups(fn, batch_format="numpy"). We need to correctly convert between block and batch in map_groups to handle it.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
…ay-project#30172)

This is to fix issue found in ray-project#30102, where user can do ds.groupby("key").map_groups(fn, batch_format="numpy"). We need to correctly convert between block and batch in map_groups to handle it.

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants