Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] fix nested ragged ndarray #44236

Merged
merged 7 commits into from
Mar 26, 2024

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Mar 22, 2024

Why are these changes needed?

Currently we support single level of ragged ndarray, i.e, shape of each row is different. But "nested ragged ndarray" isn't supported. I.e, when a row already contains a ragged ndarray, the following error will occur.

Repro:

import ray

def f(row):
    return {"result": [[], [1, 2]]}

ray.data.range(1).map(f).materialize()

Error:

ray.exceptions.RayTaskError(ArrowNotImplementedError): ray::Map(f)() (pid=39856, ip=127.0.0.1)
  File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_operator.py", line 419, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 393, in __call__
    add_fn(data)
  File "/Users/chenh/code/ray/python/ray/data/_internal/output_buffer.py", line 43, in add
    self._buffer.add(item)
  File "/Users/chenh/code/ray/python/ray/data/_internal/delegating_block_builder.py", line 24, in add
    check.build()
  File "/Users/chenh/code/ray/python/ray/data/_internal/table_block.py", line 128, in build
    tables = [self._table_from_pydict(columns)]
  File "/Users/chenh/code/ray/python/ray/data/_internal/arrow_block.py", line 143, in _table_from_pydict
    columns[col_name] = ArrowTensorArray.from_numpy(col, col_name)
  File "/Users/chenh/code/ray/python/ray/air/util/tensor_extensions/arrow.py", line 333, in from_numpy
    return ArrowVariableShapedTensorArray.from_numpy(arr)
  File "/Users/chenh/code/ray/python/ray/air/util/tensor_extensions/arrow.py", line 789, in from_numpy
    pa_dtype = pa.from_numpy_dtype(dtype)
  File "pyarrow/types.pxi", line 5140, in pyarrow.lib.from_numpy_dtype
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 17

Related issue number

Fixes #44235, #41078, #44062

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>
return True
if np.isscalar(udf_return_col[0]):
return True
return is_scalar_list(udf_return_col[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's meant by "scalar list" here? I expect it to mean a list[float] like [1, 2, 3] and not a nested arrays like [[1], [2]], but with this implementation nested lists would be considered scalar lists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we only need to check udf_return_col[0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bveeramani @c21 I've updated the PR with a new way to fix. I.e, do not convert each individual element, as long as the whole list is a nested list.

python/ray/data/_internal/numpy_support.py Show resolved Hide resolved
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
# scalar lists though, since those can be represented as pyarrow list type
# without needing to go through our tensor extension.
if all(
is_valid_udf_return(e) and not is_scalar_list(e) for e in udf_return_col
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think is_scalar_list is used anywhere other than here. Let's remove the function?

Signed-off-by: Hao Chen <[email protected]>
@raulchen raulchen merged commit 459edae into ray-project:master Mar 26, 2024
5 checks passed
@raulchen raulchen deleted the fix-nested-ragged-array branch March 26, 2024 01:23
stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Can't return ragged list-of-lists from my transformation
4 participants