[data] fix nested ragged ndarray #44236

raulchen · 2024-03-22T01:46:02Z

Why are these changes needed?

Currently we support single level of ragged ndarray, i.e, shape of each row is different. But "nested ragged ndarray" isn't supported. I.e, when a row already contains a ragged ndarray, the following error will occur.

Repro:

import ray

def f(row):
    return {"result": [[], [1, 2]]}

ray.data.range(1).map(f).materialize()

Error:

ray.exceptions.RayTaskError(ArrowNotImplementedError): ray::Map(f)() (pid=39856, ip=127.0.0.1)
  File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_operator.py", line 419, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/Users/chenh/code/ray/python/ray/data/_internal/execution/operators/map_transformer.py", line 393, in __call__
    add_fn(data)
  File "/Users/chenh/code/ray/python/ray/data/_internal/output_buffer.py", line 43, in add
    self._buffer.add(item)
  File "/Users/chenh/code/ray/python/ray/data/_internal/delegating_block_builder.py", line 24, in add
    check.build()
  File "/Users/chenh/code/ray/python/ray/data/_internal/table_block.py", line 128, in build
    tables = [self._table_from_pydict(columns)]
  File "/Users/chenh/code/ray/python/ray/data/_internal/arrow_block.py", line 143, in _table_from_pydict
    columns[col_name] = ArrowTensorArray.from_numpy(col, col_name)
  File "/Users/chenh/code/ray/python/ray/air/util/tensor_extensions/arrow.py", line 333, in from_numpy
    return ArrowVariableShapedTensorArray.from_numpy(arr)
  File "/Users/chenh/code/ray/python/ray/air/util/tensor_extensions/arrow.py", line 789, in from_numpy
    pa_dtype = pa.from_numpy_dtype(dtype)
  File "pyarrow/types.pxi", line 5140, in pyarrow.lib.from_numpy_dtype
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported numpy type 17

Related issue number

Fixes #44235, #41078, #44062

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>

bveeramani · 2024-03-22T18:46:17Z

python/ray/data/_internal/numpy_support.py

+        return True
+    if np.isscalar(udf_return_col[0]):
+        return True
+    return is_scalar_list(udf_return_col[0])


What's meant by "scalar list" here? I expect it to mean a list[float] like [1, 2, 3] and not a nested arrays like [[1], [2]], but with this implementation nested lists would be considered scalar lists.

why we only need to check udf_return_col[0]?

@bveeramani @c21 I've updated the PR with a new way to fix. I.e, do not convert each individual element, as long as the whole list is a nested list.

python/ray/data/_internal/numpy_support.py

Signed-off-by: Hao Chen <[email protected]>

bveeramani · 2024-03-25T18:38:03Z

python/ray/data/_internal/numpy_support.py

-            # scalar lists though, since those can be represented as pyarrow list type
-            # without needing to go through our tensor extension.
-            if all(
-                is_valid_udf_return(e) and not is_scalar_list(e) for e in udf_return_col


Don't think is_scalar_list is used anywhere other than here. Let's remove the function?

Signed-off-by: Hao Chen <[email protected]>

--------- Signed-off-by: Hao Chen <[email protected]>

fix

bd6b9fd

Signed-off-by: Hao Chen <[email protected]>

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, stephanie-wang and omatthew98 as code owners March 22, 2024 01:46

scottjlee approved these changes Mar 22, 2024

View reviewed changes

bveeramani reviewed Mar 22, 2024

View reviewed changes

aslonnie mentioned this pull request Mar 22, 2024

[data] fix nested ragged ndarray #44248

Closed

raulchen added 5 commits March 22, 2024 15:49

test

0638bf1

Signed-off-by: Hao Chen <[email protected]>

fix

7552f28

Signed-off-by: Hao Chen <[email protected]>

update test

22b51c7

Signed-off-by: Hao Chen <[email protected]>

lint

fa54a40

Signed-off-by: Hao Chen <[email protected]>

refine

897ba92

Signed-off-by: Hao Chen <[email protected]>

bveeramani approved these changes Mar 25, 2024

View reviewed changes

remove dead code

1150859

Signed-off-by: Hao Chen <[email protected]>

raulchen merged commit 459edae into ray-project:master Mar 26, 2024
5 checks passed

raulchen deleted the fix-nested-ragged-array branch March 26, 2024 01:23

stephanie-wang pushed a commit to stephanie-wang/ray that referenced this pull request Mar 27, 2024

[data] fix nested ragged ndarray (ray-project#44236)

c59e02e

--------- Signed-off-by: Hao Chen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] fix nested ragged ndarray #44236

[data] fix nested ragged ndarray #44236

raulchen commented Mar 22, 2024 •

edited by scottjlee

Loading

bveeramani Mar 22, 2024

c21 Mar 22, 2024

raulchen Mar 24, 2024

bveeramani Mar 25, 2024

[data] fix nested ragged ndarray #44236

[data] fix nested ragged ndarray #44236

Conversation

raulchen commented Mar 22, 2024 • edited by scottjlee Loading

Why are these changes needed?

Related issue number

Checks

bveeramani Mar 22, 2024

Choose a reason for hiding this comment

c21 Mar 22, 2024

Choose a reason for hiding this comment

raulchen Mar 24, 2024

Choose a reason for hiding this comment

bveeramani Mar 25, 2024

Choose a reason for hiding this comment

raulchen commented Mar 22, 2024 •

edited by scottjlee

Loading