[Data] map_batches fails on multiple calls on data with nested lists #39559

keerthanvasist · 2023-09-11T21:04:29Z

What happened + What you expected to happen

I have data that has nested lists. A simple reproducible example is shown below. It is an extension of the example in the Ray documentation page for map_batches.

Let us consider this example.

import numpy as np
import ray
ds = ray.data.from_items([
    {"name": "Luna", "age": 4, "nicknames": ["Looney", "Loona"]},
    {"name": "Rory", "age": 14, "nicknames": ["Rorey"]},
    {"name": "Scout", "age": 9, "nicknames": ["Scoot"]},
])

ds.show()

{'name': 'Luna', 'age': 4, 'nicknames': ['Looney', 'Loona']}
{'name': 'Rory', 'age': 14, 'nicknames': ['Rorey']}
{'name': 'Scout', 'age': 9, 'nicknames': ['Scoot']}

When I apply this identity function on this:

def identity(batch):
    return batch

ds.map_batches(identity).show()
2023-09-11 13:53:54,565	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(identity)]
2023-09-11 13:53:54,565	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-09-11 13:53:54,565	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

{'name': 'Luna', 'age': 4, 'nicknames': array(['Looney', 'Loona'], dtype=object)}
{'name': 'Rory', 'age': 14, 'nicknames': 'Rorey'}
{'name': 'Scout', 'age': 9, 'nicknames': 'Scoot'}

Now, nicknames has different types across rows. Somehow the type check does not fail. This is already problematic. But it gets worse.

When I run two successive map functions on it, even without performing any mutations (identity function below), it throws an exception.

ray.exceptions.RayTaskError(ValueError): ray::MapBatches(identity)->MapBatches(identity)() (pid=75980, ip=127.0.0.1)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 415, in _map_task
    for b_out in fn(iter(blocks), ctx):
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/plan_udf_map_op.py", line 76, in do_map
    yield from transform_fn(blocks, ctx, *fn_args, **fn_kwargs)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/map_batches.py", line 118, in fn
    yield from process_next_batch(batch)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/map_batches.py", line 106, in process_next_batch
    raise e from None
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/planner/map_batches.py", line 87, in process_next_batch
    output_buffer.add_batch(b)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/output_buffer.py", line 50, in add_batch
    self._buffer.add_batch(batch)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/_internal/delegating_block_builder.py", line 38, in add_batch
    block = BlockAccessor.batch_to_block(batch)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/ray/data/block.py", line 397, in batch_to_block
    return pd.DataFrame(dict(batch))
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/frame.py", line 736, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/Users/kvasist/opt/miniconda3/envs/XXXX/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 664, in _extract_index
    raise ValueError("Per-column arrays must each be 1-dimensional")
ValueError: Per-column arrays must each be 1-dimensional

The expected behavior is that the dataframe is unchanged across any of number identityfuction applications.

Versions / Dependencies

Ray version: 2.63.0
Python: 3.10.12
Mac OS

Reproduction script

Simplest reproduction script:

import numpy as np
import ray
ds = ray.data.from_items([
    {"name": "Luna", "age": 4, "nicknames": ["Looney", "Loona"]},
    {"name": "Rory", "age": 14, "nicknames": ["Rorey"]},
    {"name": "Scout", "age": 9, "nicknames": ["Scoot"]},
])
def identity(batch):
    return batch
ds.map_batches(identity).map_batches(identity).show()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

keerthanvasist · 2023-09-13T16:35:02Z

@raulchen Can you please take a look?

raulchen · 2023-09-13T17:43:26Z

Confirmed it's a bug. we should fix it in the next release.

keerthanvasist · 2023-09-13T18:04:45Z

Thanks! Is there a timeline for next release that we should be aware of?

michaelhly · 2023-09-26T19:05:50Z

@keerthanvasist @raulchen would this be a valid output?

{'name': 'Scout', 'age': 9, 'nicknames': 'Scoot'}
{'name': 'Luna', 'age': 4, 'nicknames': 'Looney,Loona'}
{'name': 'Rory', 'age': 14, 'nicknames': 'Rorey'}
{'name': 'Scout', 'age': 9, 'nicknames': 'Scoot'}

keerthanvasist · 2023-09-26T20:20:33Z

I would say it's not. I will take a look at the CR to try and appreciate the engineering constraints though.

michaelhly · 2023-09-26T22:56:15Z

@keerthanvasist forgive my noobness, but what if all the nicknames are typed as follow:

{'name': 'Luna', 'age': 4, 'nicknames': array(['Looney', 'Loona'], dtype='<U6')}
{'name': 'Rory', 'age': 14, 'nicknames': array(['Rorey'], dtype='<U5')}
{'name': 'Scout', 'age': 9, 'nicknames': array(['Scoot'], dtype='<U5')}

keerthanvasist · 2023-09-27T01:21:55Z

I am also new to Ray. I think this would be okay, but you have to check with someone who has better context on what the contracts are for different block types. Thanks for working on this!

bveeramani · 2024-05-29T23:01:38Z

Fixed by #45287

keerthanvasist added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2023

raulchen added P1 Issue that should be fixed within a few weeks ray 2.8 and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 13, 2023

michaelhly mentioned this issue Sep 26, 2023

[data] Fix map_batches on datasets with nested lists #39869

Draft

8 tasks

anyscalesam added serve Ray Serve Related Issue data Ray Data-related issues and removed serve Ray Serve Related Issue labels Nov 1, 2023

anyscalesam assigned scottjlee Nov 2, 2023

anyscalesam added ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) and removed ray 2.8 labels Nov 2, 2023

anyscalesam added ray 2.10 and removed ray 2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Nov 13, 2023

c21 unassigned scottjlee Dec 11, 2023

anyscalesam assigned c21 Jan 16, 2024

oyangz mentioned this issue May 20, 2024

feat: add target_context to dataset columns aws/fmeval#266

Merged

sjincho mentioned this issue May 24, 2024

[Data] Can't return array-like data from UDF if batch contains unsupported type #45235

Closed

bveeramani closed this as completed May 29, 2024

keerthanvasist mentioned this issue Jun 14, 2024

feat: add faithfulness eval algo aws/fmeval#291

Merged

oyangz mentioned this issue Jul 10, 2024

feat: update context to take lists and rename context field aws/fmeval#305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] map_batches fails on multiple calls on data with nested lists #39559

[Data] map_batches fails on multiple calls on data with nested lists #39559

keerthanvasist commented Sep 11, 2023 •

edited

Loading

keerthanvasist commented Sep 13, 2023

raulchen commented Sep 13, 2023

keerthanvasist commented Sep 13, 2023

michaelhly commented Sep 26, 2023 •

edited

Loading

keerthanvasist commented Sep 26, 2023

michaelhly commented Sep 26, 2023

keerthanvasist commented Sep 27, 2023

bveeramani commented May 29, 2024

[Data] map_batches fails on multiple calls on data with nested lists #39559

[Data] map_batches fails on multiple calls on data with nested lists #39559

Comments

keerthanvasist commented Sep 11, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

keerthanvasist commented Sep 13, 2023

raulchen commented Sep 13, 2023

keerthanvasist commented Sep 13, 2023

michaelhly commented Sep 26, 2023 • edited Loading

keerthanvasist commented Sep 26, 2023

michaelhly commented Sep 26, 2023

keerthanvasist commented Sep 27, 2023

bveeramani commented May 29, 2024

keerthanvasist commented Sep 11, 2023 •

edited

Loading

michaelhly commented Sep 26, 2023 •

edited

Loading