Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] map_batches fails on multiple calls on data with nested lists #39559

Closed
keerthanvasist opened this issue Sep 11, 2023 · 8 comments · May be fixed by #39869
Closed

[Data] map_batches fails on multiple calls on data with nested lists #39559

keerthanvasist opened this issue Sep 11, 2023 · 8 comments · May be fixed by #39869
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks ray 2.10

Comments

@keerthanvasist
Copy link

keerthanvasist commented Sep 11, 2023

What happened + What you expected to happen

I have data that has nested lists. A simple reproducible example is shown below. It is an extension of the example in the Ray documentation page for map_batches.

Let us consider this example.

import numpy as np
import ray
ds = ray.data.from_items([
    {"name": "Luna", "age": 4, "nicknames": ["Looney", "Loona"]},
    {"name": "Rory", "age": 14, "nicknames": ["Rorey"]},
    {"name": "Scout", "age": 9, "nicknames": ["Scoot"]},
])

ds.show()

{'name': 'Luna', 'age': 4, 'nicknames': ['Looney', 'Loona']}
{'name': 'Rory', 'age': 14, 'nicknames': ['Rorey']}
{'name': 'Scout', 'age': 9, 'nicknames': ['Scoot']}

When I apply this identity function on this:

def identity(batch):
    return batch

ds.map_batches(identity).show()
2023-09-11 13:53:54,565	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(identity)]
2023-09-11 13:53:54,565	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-11 13:53:54,565	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

{'name': 'Luna', 'age': 4, 'nicknames': array(['Looney', 'Loona'], dtype=object)}
{'name': 'Rory', 'age': 14, 'nicknames': 'Rorey'}
{'name': 'Scout', 'age': 9, 'nicknames': 'Scoot'}

Now, nicknames has different types across rows. Somehow the type check does not fail. This is already problematic. But it gets worse.

When I run two successive map functions on it, even without performing any mutations (identity function below), it throws an exception.

ray.exceptions.RayTaskError(ValueError): ray::MapBatches(identity)->MapBatches(identity)() (pid=75980, ip=127.0.0.1)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task
    for b_out in fn(iter(blocks), ctx):
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map
    yield from transform_fn(blocks, ctx, *fn_args, **fn_kwargs)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn
    yield from process_next_batch(batch)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/map_batches.py", line 106, in process_next_batch
    raise e from None
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/map_batches.py", line 87, in process_next_batch
    output_buffer.add_batch(b)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/output_buffer.py", line 50, in add_batch
    self._buffer.add_batch(batch)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/delegating_block_builder.py", line 38, in add_batch
    block = BlockAccessor.batch_to_block(batch)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/block.py", line 397, in batch_to_block
    return pd.DataFrame(dict(batch))
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/frame.py", line 736, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 664, in _extract_index
    raise ValueError("Per-column arrays must each be 1-dimensional")
ValueError: Per-column arrays must each be 1-dimensional

The expected behavior is that the dataframe is unchanged across any of number identityfuction applications.

Versions / Dependencies

Ray version: 2.63.0
Python: 3.10.12
Mac OS

Reproduction script

Simplest reproduction script:

import numpy as np
import ray
ds = ray.data.from_items([
    {"name": "Luna", "age": 4, "nicknames": ["Looney", "Loona"]},
    {"name": "Rory", "age": 14, "nicknames": ["Rorey"]},
    {"name": "Scout", "age": 9, "nicknames": ["Scoot"]},
])
def identity(batch):
    return batch
ds.map_batches(identity).map_batches(identity).show()

Issue Severity

High: It blocks me from completing my task.

@keerthanvasist keerthanvasist added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2023
@keerthanvasist
Copy link
Author

@raulchen Can you please take a look?

@raulchen raulchen added P1 Issue that should be fixed within a few weeks ray 2.8 and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 13, 2023
@raulchen
Copy link
Contributor

Confirmed it's a bug. we should fix it in the next release.

@keerthanvasist
Copy link
Author

Thanks! Is there a timeline for next release that we should be aware of?

@michaelhly
Copy link
Contributor

michaelhly commented Sep 26, 2023

@keerthanvasist @raulchen would this be a valid output?

{'name': 'Scout', 'age': 9, 'nicknames': 'Scoot'}
{'name': 'Luna', 'age': 4, 'nicknames': 'Looney,Loona'}
{'name': 'Rory', 'age': 14, 'nicknames': 'Rorey'}
{'name': 'Scout', 'age': 9, 'nicknames': 'Scoot'}

@keerthanvasist
Copy link
Author

I would say it's not. I will take a look at the CR to try and appreciate the engineering constraints though.

@michaelhly
Copy link
Contributor

@keerthanvasist forgive my noobness, but what if all the nicknames are typed as follow:

{'name': 'Luna', 'age': 4, 'nicknames': array(['Looney', 'Loona'], dtype='<U6')}
{'name': 'Rory', 'age': 14, 'nicknames': array(['Rorey'], dtype='<U5')}
{'name': 'Scout', 'age': 9, 'nicknames': array(['Scoot'], dtype='<U5')}

@keerthanvasist
Copy link
Author

I am also new to Ray. I think this would be okay, but you have to check with someone who has better context on what the contracts are for different block types. Thanks for working on this!

@anyscalesam anyscalesam added serve Ray Serve Related Issue data Ray Data-related issues and removed serve Ray Serve Related Issue labels Nov 1, 2023
@anyscalesam anyscalesam added ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) and removed ray 2.8 labels Nov 2, 2023
@anyscalesam anyscalesam added ray 2.10 and removed ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Nov 13, 2023
@bveeramani
Copy link
Member

Fixed by #45287

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks ray 2.10
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants