Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Ray Data doesn't cast some list outputs to ndarrays #35340

Closed
bveeramani opened this issue May 15, 2023 · 1 comment · Fixed by #35359
Closed

[Data] Ray Data doesn't cast some list outputs to ndarrays #35340

bveeramani opened this issue May 15, 2023 · 1 comment · Fixed by #35359
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks

Comments

@bveeramani
Copy link
Member

bveeramani commented May 15, 2023

What happened + What you expected to happen

I returned a list[list[dict[str, str]]] from my UDF. I expected Ray Data to implicitly convert my output to an ndarray, but I got an error instead.

If I explicitly cast my output to an array with create_ragged_ndarray, I don't get an error.

ray.exceptions.RayTaskError(ValueError): ray::MapBatches(HuggingFacePredictor)() (pid=67876, ip=127.0.0.1)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 386, in submit
    yield from _map_task(fn, ctx, *blocks)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/execution/operators/map_operator.py", line 389, in _map_task
    for b_out in fn(iter(blocks), ctx):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/execution/legacy_compat.py", line 311, in do_map
    yield from block_fn(blocks, ctx, *fn_args, **fn_kwargs)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/planner/map_batches.py", line 109, in fn
    yield from process_next_batch(batch)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/planner/map_batches.py", line 97, in process_next_batch
    raise e from None
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/planner/map_batches.py", line 78, in process_next_batch
    output_buffer.add_batch(b)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/output_buffer.py", line 50, in add_batch
    self._buffer.add_batch(batch)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/delegating_block_builder.py", line 51, in add_batch
    block = BlockAccessor.batch_to_block(batch)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/block.py", line 436, in batch_to_block
    return ArrowBlockAccessor.numpy_to_block(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/arrow_block.py", line 184, in numpy_to_block
    col = ArrowTensorArray.from_numpy(col)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/tensor_extensions/arrow.py", line 312, in from_numpy
    return ArrowVariableShapedTensorArray.from_numpy(arr)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/air/util/tensor_extensions/arrow.py", line 721, in from_numpy
    raise ValueError(
ValueError: ArrowVariableShapedTensorArray only supports heterogeneous-shaped tensor collections, not arbitrarily nested ragged tensors. Got arrays: [('dtype=object', 'shape=(1,)'), ('dtype=object', 'shape=(1,)')]

Versions / Dependencies

Ray: 21e9d38

Reproduction script

import ray
import numpy as np
from typing import Dict

ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"]))

class HuggingFacePredictor:
    def __init__(self):
        from transformers import pipeline
        self.model = pipeline("text-generation", model="gpt2")

    def __call__(self, batch: Dict[str, np.ndarray]):
        model_out = self.model(list(batch["data"]), max_length=20, num_return_sequences=1)
        batch["output"] = model_out
        return batch

scale = ray.data.ActorPoolStrategy(size=2)
predictions = ds.map_batches(HuggingFacePredictor, compute=scale)
predictions.show(limit=1)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 15, 2023
@bveeramani bveeramani added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 15, 2023
@bveeramani bveeramani changed the title [Data] Ray Data doesn't cast [Data] Ray Data doesn't cast some list outputs to ndarrays May 15, 2023
@bveeramani
Copy link
Member Author

This code works fine:

from typing import Dict

import numpy as np

import ray
from ray.data.extensions.tensor_extension import create_ragged_ndarray

ds = ray.data.from_numpy(np.asarray(["Complete this", "for me"]))

class HuggingFacePredictor:
    def __init__(self):
        from transformers import pipeline
        self.model = pipeline("text-generation", model="gpt2")

    def __call__(self, batch: Dict[str, np.ndarray]):
        model_out = self.model(list(batch["data"]), max_length=20, num_return_sequences=1)
        batch["output"] = create_ragged_ndarray(model_out)
        return batch

scale = ray.data.ActorPoolStrategy(size=2)
predictions = ds.map_batches(HuggingFacePredictor, compute=scale)
predictions.show(limit=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants