Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] [Bug] AIR dogfooding- Reading csv with ray.data fails, but works with pandas #23448

Closed
Tracked by #23449
amogkam opened this issue Mar 24, 2022 · 3 comments · Fixed by #24398
Closed
Tracked by #23449
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical size:small
Milestone

Comments

@amogkam
Copy link
Contributor

amogkam commented Mar 24, 2022

UCI_Credit_Card.csv

Reading the attached csv file with ray.data fails with the following error

/Users/kai/.pyenv/versions/3.7.7/bin/python /Users/kai/coding/ray/python/ray/ml/examples/seldon_example_cc.py
2022-03-23 16:18:14,529	INFO services.py:1462 -- View the Ray dashboard at[ http://127.0.0.1:8265](http://127.0.0.1:8265/)
(remote_read pid=36776) 2022-03-23 16:18:18,972	INFO worker.py:449 -- Task failed with retryable exception: TaskID(32d950ec0ccf9d2affffffffffffffffffffffff01000000).
(remote_read pid=36776) Traceback (most recent call last):
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 667, in ray._raylet.execute_task
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/read_api.py", line 250, in remote_read
(remote_read pid=36776)     block = task()
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/datasource.py", line 155, in __call__
(remote_read pid=36776)     for block in result:
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/file_based_datasource.py", line 316, in read_files
(remote_read pid=36776)     for data in read_stream(f, read_path, **reader_args):
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/csv_datasource.py", line 35, in _read_stream
(remote_read pid=36776)     batch = reader.read_next_batch()
(remote_read pid=36776)   File "pyarrow/ipc.pxi", line 543, in pyarrow.lib.RecordBatchReader.read_next_batch
(remote_read pid=36776)   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
(remote_read pid=36776) pyarrow.lib.ArrowInvalid: In CSV column #12: CSV conversion error to int64: invalid value '1e+05'
(remote_read pid=36776) 2022-03-23 16:18:19,130	INFO worker.py:449 -- Task failed with retryable exception: TaskID(32d950ec0ccf9d2affffffffffffffffffffffff01000000).
(remote_read pid=36776) Traceback (most recent call last):
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 667, in ray._raylet.execute_task
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/read_api.py", line 250, in remote_read
(remote_read pid=36776)     block = task()
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/datasource.py", line 155, in __call__
(remote_read pid=36776)     for block in result:
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/file_based_datasource.py", line 316, in read_files
(remote_read pid=36776)     for data in read_stream(f, read_path, **reader_args):
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/csv_datasource.py", line 35, in _read_stream
(remote_read pid=36776)     batch = reader.read_next_batch()
(remote_read pid=36776)   File "pyarrow/ipc.pxi", line 543, in pyarrow.lib.RecordBatchReader.read_next_batch
(remote_read pid=36776)   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
(remote_read pid=36776) pyarrow.lib.ArrowInvalid: In CSV column #12: CSV conversion error to int64: invalid value '1e+05'
(pid=36765) 
(remote_read pid=36776) 2022-03-23 16:18:20,144	INFO worker.py:449 -- Task failed with retryable exception: TaskID(32d950ec0ccf9d2affffffffffffffffffffffff01000000).
(remote_read pid=36776) Traceback (most recent call last):
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 667, in ray._raylet.execute_task
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/read_api.py", line 250, in remote_read
(remote_read pid=36776)     block = task()
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/datasource.py", line 155, in __call__
(remote_read pid=36776)     for block in result:
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/file_based_datasource.py", line 316, in read_files
(remote_read pid=36776)     for data in read_stream(f, read_path, **reader_args):
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/csv_datasource.py", line 35, in _read_stream
(remote_read pid=36776)     batch = reader.read_next_batch()
(remote_read pid=36776)   File "pyarrow/ipc.pxi", line 543, in pyarrow.lib.RecordBatchReader.read_next_batch
(remote_read pid=36776)   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
(remote_read pid=36776) pyarrow.lib.ArrowInvalid: In CSV column #12: CSV conversion error to int64: invalid value '1e+05'
Traceback (most recent call last):
  File "/Users/kai/coding/ray/python/ray/ml/examples/seldon_example_cc.py", line 9, in <module>
    dataset = load_data()
  File "/Users/kai/coding/ray/python/ray/ml/examples/seldon_example_cc.py", line 5, in load_data
    return ray.data.read_csv(path)
  File "/Users/kai/coding/ray/python/ray/data/read_api.py", line 481, in read_csv
    **arrow_csv_args,
  File "/Users/kai/coding/ray/python/ray/data/read_api.py", line 286, in read_datasource
    block_list.ensure_schema_for_first_block()
  File "/Users/kai/coding/ray/python/ray/data/impl/block_list.py", line 197, in ensure_schema_for_first_block
    schema = ray.get(get_schema.remote(block))
  File "/Users/kai/coding/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/kai/coding/ray/python/ray/worker.py", line 1809, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ArrowInvalid): ray::_get_schema() (pid=36776, ip=127.0.0.1)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::remote_read() (pid=36776, ip=127.0.0.1)
  File "/Users/kai/coding/ray/python/ray/data/read_api.py", line 250, in remote_read
    block = task()
  File "/Users/kai/coding/ray/python/ray/data/datasource/datasource.py", line 155, in __call__
    for block in result:
  File "/Users/kai/coding/ray/python/ray/data/datasource/file_based_datasource.py", line 316, in read_files
    for data in read_stream(f, read_path, **reader_args):
  File "/Users/kai/coding/ray/python/ray/data/datasource/csv_datasource.py", line 35, in _read_stream
    batch = reader.read_next_batch()
  File "pyarrow/ipc.pxi", line 543, in pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: In CSV column #12: CSV conversion error to int64: invalid value '1e+05'
(remote_read pid=36776) 2022-03-23 16:18:21,126	INFO worker.py:449 -- Task failed with retryable exception: TaskID(32d950ec0ccf9d2affffffffffffffffffffffff01000000).
(remote_read pid=36776) Traceback (most recent call last):
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 663, in ray._raylet.execute_task
(remote_read pid=36776)   File "python/ray/_raylet.pyx", line 667, in ray._raylet.execute_task
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/read_api.py", line 250, in remote_read
(remote_read pid=36776)     block = task()
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/datasource.py", line 155, in __call__
(remote_read pid=36776)     for block in result:
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/file_based_datasource.py", line 316, in read_files
(remote_read pid=36776)     for data in read_stream(f, read_path, **reader_args):
(remote_read pid=36776)   File "/Users/kai/coding/ray/python/ray/data/datasource/csv_datasource.py", line 35, in _read_stream
(remote_read pid=36776)     batch = reader.read_next_batch()
(remote_read pid=36776)   File "pyarrow/ipc.pxi", line 543, in pyarrow.lib.RecordBatchReader.read_next_batch
(remote_read pid=36776)   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
(remote_read pid=36776) pyarrow.lib.ArrowInvalid: In CSV column #12: CSV conversion error to int64: invalid value '1e+05'

Pandas is able to successfully read the csv file.

There are workarounds to this so it is not fully blocking (such as using Dask on Ray and then converting to Ray Datasets), but it is difficult to overcome and persistent with this particular dataset.

@amogkam amogkam added this to the Ray AIR milestone Mar 24, 2022
@amogkam amogkam added the P1 Issue that should be fixed within a few weeks label Mar 24, 2022
@clarkzinzow clarkzinzow changed the title [air] [data] Dogfooding- Reading csv with ray.data fails, but works with pandas [Datasets] [Bug] AIR dogfooding- Reading csv with ray.data fails, but works with pandas Apr 18, 2022
@clarkzinzow clarkzinzow added bug Something that is supposed to be working; but isn't data Ray Data-related issues size:small labels Apr 18, 2022
@clarkzinzow clarkzinzow self-assigned this Apr 18, 2022
@jianoaix
Copy link
Contributor

Same as #23133

@jianoaix
Copy link
Contributor

@amogkam do you want to try the mitigations summarized at: #23133 (comment)

Now these args are not exposed at Dataset level yet, but we may add if needed.

@clarkzinzow
Copy link
Contributor

Downgrading this to a P2 since (1) this is an Arrow-level issue with type inference + merging on batched reading, and (2) valid workarounds are given by @jianoaix in this comment.

@clarkzinzow clarkzinzow added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical size:small
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants