[Datasets] read_csv not filter out files by default #29032

c21 · 2022-10-04T04:55:48Z

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (#26605). We should change to NOT filter files by default.

Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray.

Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension.

Also improve the error message when reading CSV file:

>>> 2022-10-03 23:20:02,578	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::_execute_read_task() (pid=47372, ip=127.0.0.1)
  File "pyarrow/_csv.pyx", line 942, in pyarrow._csv.open_csv
  File "pyarrow/_csv.pyx", line 834, in pyarrow._csv.CSVStreamingReader._open
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Row #2: Expected 4 columns, got 5: {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

During handling of the above exception, another exception occurred:

ray::_execute_read_task() (pid=47372, ip=127.0.0.1)
  File "/Users/chengsu/ray/python/ray/data/_internal/lazy_block_list.py", line 580, in _execute_read_task
    block = task()
  File "/Users/chengsu/ray/python/ray/data/datasource/datasource.py", line 202, in __call__
    for block in result:
  File "/Users/chengsu/ray/python/ray/data/datasource/file_based_datasource.py", line 475, in read_files
    for data in read_stream(f, read_path, **reader_args):
  File "/Users/chengsu/ray/python/ray/data/datasource/csv_datasource.py", line 58, in _read_stream
    raise type(e)(f"{e}. Failed to read CSV file: {path}. "
pyarrow.lib.ArrowInvalid: CSV parse error: Row #2: Expected 4 columns, got 5: {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}. Failed to read CSV file: /Users/chengsu/try/json/test. Please check the file has correct format, or filter out file with 'partition_filter' field. See read_csv() documentation for more details.

Related issue number

Closes #26605

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

matthewdeng · 2022-10-04T05:07:55Z

python/ray/data/read_api.py

@@ -577,9 +577,7 @@ def read_csv(
    ray_remote_args: Dict[str, Any] = None,
    arrow_open_stream_args: Optional[Dict[str, Any]] = None,
    meta_provider: BaseFileMetadataProvider = DefaultFileMetadataProvider(),
-    partition_filter: Optional[
-        PathPartitionFilter
-    ] = CSVDatasource.file_extension_filter(),


Is the same logic applicable for other data types (e.g. json, parquet)?

Is the same logic applicable for other data types (e.g. json, parquet)?

We don't have default filtering for Parquet. So Parquet is good.
We have default filter for JSON to filter out files without .json extension. Just tried out Arrow and Spark on my laptop and they won't filter out files when reading JSON. We can change the behavior for JSON files as well later in another followup PR. To me, CSV fix is more urgent as there're multiple user reports (but it can also be the case that our read_json is not popular).

Oh interesting, read_parquet doesn't have default filtering while read_parquet_bulk does (which seems a bit unintuitive to me, since read_parquet_bulk doesn't support directories).

In that case I think it's reasonable to go forward with this for now and follow-up with a more holistic/consistent solution.

One question - is the error message when reading a non-CSV file directly actionable to the user, especially for users who previously relied on this default behavior? E.g. when the file is a TSV, or if the file is some random file that should be excluded.

Oh interesting, read_parquet doesn't have default filtering while read_parquet_bulk does (which seems a bit unintuitive to me, since read_parquet_bulk doesn't support directories).

Yeah, this looks unintuitive to me too. Don't think we should have different between these two Parquet APIs.

One question - is the error message when reading a non-CSV file directly actionable to the user, especially for users who previously relied on this default behavior? E.g. when the file is a TSV, or if the file is some random file that should be excluded.

Yeah agreed, we should have an actionable error message. That's exactly I am doing now, plan to have another PR to tackle the error message when reading non-CSV files by mistake, and it should also apply when reading a malformed .csv file.

Added handling to provide more detailed error message. Added an example error message when reading non-CSV file in PR description.

c21 · 2022-10-04T06:52:26Z

python/ray/data/datasource/csv_datasource.py

+                except StopIteration:
+                    return
+        except Exception as e:
+            raise type(e)(


I am not sure how fragile it is, please suggest any better way.

jianoaix · 2022-10-05T22:44:00Z

python/ray/data/tests/test_dataset_csv.py

+    # Single CSV file without extension.
+    ds = ray.data.read_csv(path3)
+    assert ds.to_pandas().equals(df)
+


Can you add a case where two CSV files, one with .csv and the other without, but we can successfully read both into dataset?

@jianoaix - sure, added.

jianoaix · 2022-10-05T22:55:15Z

python/ray/data/datasource/csv_datasource.py

+        except Exception as e:
+            raise type(e)(
+                f"{e}. Failed to read CSV file: {path}. "
+                "Please check the file has correct format, or filter out file with "


So at this point, is it sure that only incorrect format will fail the open_csv?

We don't know for sure. Arrow just returns a generic pyarrow.lib.ArrowInvalid: CSV parse error, so it can also because the CSV file is malformed.

Changed to catch pyarrow.lib.ArrowInvalid only, and printed out error message to try to be reasonable in all cases:

pyarrow.lib.ArrowInvalid: CSV parse error: Row #2: Expected 2 columns, got 3: $��$&��5��one��&R&��(��,�� ... Failed to read CSV file: ... Please check the CSV file has correct format, or filter out non-CSV file with 'partition_filter' field. See read_csv() documentation for more details.

Signed-off-by: Cheng Su <[email protected]>

clarkzinzow

LGTM overall, small comment on raised error message

clarkzinzow · 2022-10-06T21:21:32Z

python/ray/data/datasource/csv_datasource.py

+            raise pa.lib.ArrowInvalid(
+                f"{e}. Failed to read CSV file: {path}. "
+                "Please check the CSV file has correct format, or filter out non-CSV "
+                "file with 'partition_filter' field. See read_csv() documentation for "
+                "more details."
+            )


Dumping the original exception at the beginning of this exception's message will put a potentially wordy message before our higher-level message and will get rid of the original traceback, replaced with our traceback. Might be better to just chain the exceptions here since that's the use case for exception chaining.

Also I think that raising a ValueError rather than Arrow's ArrowInvalid error makes more sense here.

Suggested change

raise pa.lib.ArrowInvalid(

f"{e}. Failed to read CSV file: {path}. "

"Please check the CSV file has correct format, or filter out non-CSV "

"file with 'partition_filter' field. See read_csv() documentation for "

"more details."

)

raise ValueError(

f"Failed to read CSV file: {path}. "

"Please check the CSV file has correct format, or filter out non-CSV "

"file with 'partition_filter' field. See read_csv() documentation for "

"more details."

) from e

@clarkzinzow - didn't know our best practice here. This makes sense to me. Updated.

Signed-off-by: Cheng Su <[email protected]>

Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (ray-project#26605). We should change to NOT filter files by default. Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray. Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension. Also improve the error message when reading CSV file Signed-off-by: Weichen Xu <[email protected]>

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners October 4, 2022 04:55

c21 assigned clarkzinzow and jianoaix Oct 4, 2022

matthewdeng reviewed Oct 4, 2022

View reviewed changes

c21 commented Oct 4, 2022

View reviewed changes

jianoaix reviewed Oct 5, 2022

View reviewed changes

c21 added 4 commits October 6, 2022 10:13

read_csv do not filter out files by default

560eb5a

Signed-off-by: Cheng Su <[email protected]>

Add clearer error message when failing to read CSV file

914210e

Signed-off-by: Cheng Su <[email protected]>

Address comments

dded6e1

Signed-off-by: Cheng Su <[email protected]>

Try to fix documentation

d26b73f

Signed-off-by: Cheng Su <[email protected]>

c21 force-pushed the csv branch from 9759e4b to d26b73f Compare October 6, 2022 17:13

jianoaix approved these changes Oct 6, 2022

View reviewed changes

c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 6, 2022

clarkzinzow approved these changes Oct 6, 2022

View reviewed changes

Address comment

0f6092c

Signed-off-by: Cheng Su <[email protected]>

clarkzinzow merged commit 92df1c1 into ray-project:master Oct 7, 2022

c21 deleted the csv branch October 7, 2022 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] read_csv not filter out files by default #29032

[Datasets] read_csv not filter out files by default #29032

c21 commented Oct 4, 2022 •

edited

Loading

matthewdeng Oct 4, 2022

c21 Oct 4, 2022 •

edited

Loading

matthewdeng Oct 4, 2022

c21 Oct 4, 2022

c21 Oct 4, 2022

c21 Oct 4, 2022

jianoaix Oct 5, 2022

c21 Oct 6, 2022

jianoaix Oct 5, 2022

c21 Oct 6, 2022

clarkzinzow left a comment

clarkzinzow Oct 6, 2022 •

edited

Loading

c21 Oct 6, 2022

[Datasets] read_csv not filter out files by default #29032

[Datasets] read_csv not filter out files by default #29032

Conversation

c21 commented Oct 4, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

c21 Oct 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Oct 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Oct 4, 2022 •

edited

Loading

c21 Oct 4, 2022 •

edited

Loading

clarkzinzow Oct 6, 2022 •

edited

Loading