Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] read_csv not filter out files by default #29032

Merged
merged 5 commits into from
Oct 7, 2022

Conversation

c21
Copy link
Contributor

@c21 c21 commented Oct 4, 2022

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (#26605). We should change to NOT filter files by default.

Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray.

Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension.

Also improve the error message when reading CSV file:

>>> 2022-10-03 23:20:02,578	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::_execute_read_task() (pid=47372, ip=127.0.0.1)
  File "pyarrow/_csv.pyx", line 942, in pyarrow._csv.open_csv
  File "pyarrow/_csv.pyx", line 834, in pyarrow._csv.CSVStreamingReader._open
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Row #2: Expected 4 columns, got 5: {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

During handling of the above exception, another exception occurred:

ray::_execute_read_task() (pid=47372, ip=127.0.0.1)
  File "/Users/chengsu/ray/python/ray/data/_internal/lazy_block_list.py", line 580, in _execute_read_task
    block = task()
  File "/Users/chengsu/ray/python/ray/data/datasource/datasource.py", line 202, in __call__
    for block in result:
  File "/Users/chengsu/ray/python/ray/data/datasource/file_based_datasource.py", line 475, in read_files
    for data in read_stream(f, read_path, **reader_args):
  File "/Users/chengsu/ray/python/ray/data/datasource/csv_datasource.py", line 58, in _read_stream
    raise type(e)(f"{e}. Failed to read CSV file: {path}. "
pyarrow.lib.ArrowInvalid: CSV parse error: Row #2: Expected 4 columns, got 5: {"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}. Failed to read CSV file: /Users/chengsu/try/json/test. Please check the file has correct format, or filter out file with 'partition_filter' field. See read_csv() documentation for more details.

Related issue number

Closes #26605

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -577,9 +577,7 @@ def read_csv(
ray_remote_args: Dict[str, Any] = None,
arrow_open_stream_args: Optional[Dict[str, Any]] = None,
meta_provider: BaseFileMetadataProvider = DefaultFileMetadataProvider(),
partition_filter: Optional[
PathPartitionFilter
] = CSVDatasource.file_extension_filter(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the same logic applicable for other data types (e.g. json, parquet)?

Copy link
Contributor Author

@c21 c21 Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the same logic applicable for other data types (e.g. json, parquet)?

We don't have default filtering for Parquet. So Parquet is good.
We have default filter for JSON to filter out files without .json extension. Just tried out Arrow and Spark on my laptop and they won't filter out files when reading JSON. We can change the behavior for JSON files as well later in another followup PR. To me, CSV fix is more urgent as there're multiple user reports (but it can also be the case that our read_json is not popular).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, read_parquet doesn't have default filtering while read_parquet_bulk does (which seems a bit unintuitive to me, since read_parquet_bulk doesn't support directories).

In that case I think it's reasonable to go forward with this for now and follow-up with a more holistic/consistent solution.

One question - is the error message when reading a non-CSV file directly actionable to the user, especially for users who previously relied on this default behavior? E.g. when the file is a TSV, or if the file is some random file that should be excluded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, read_parquet doesn't have default filtering while read_parquet_bulk does (which seems a bit unintuitive to me, since read_parquet_bulk doesn't support directories).

Yeah, this looks unintuitive to me too. Don't think we should have different between these two Parquet APIs.

One question - is the error message when reading a non-CSV file directly actionable to the user, especially for users who previously relied on this default behavior? E.g. when the file is a TSV, or if the file is some random file that should be excluded.

Yeah agreed, we should have an actionable error message. That's exactly I am doing now, plan to have another PR to tackle the error message when reading non-CSV files by mistake, and it should also apply when reading a malformed .csv file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added handling to provide more detailed error message. Added an example error message when reading non-CSV file in PR description.

except StopIteration:
return
except Exception as e:
raise type(e)(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how fragile it is, please suggest any better way.

# Single CSV file without extension.
ds = ray.data.read_csv(path3)
assert ds.to_pandas().equals(df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case where two CSV files, one with .csv and the other without, but we can successfully read both into dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - sure, added.

except Exception as e:
raise type(e)(
f"{e}. Failed to read CSV file: {path}. "
"Please check the file has correct format, or filter out file with "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at this point, is it sure that only incorrect format will fail the open_csv?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know for sure. Arrow just returns a generic pyarrow.lib.ArrowInvalid: CSV parse error, so it can also because the CSV file is malformed.

Changed to catch pyarrow.lib.ArrowInvalid only, and printed out error message to try to be reasonable in all cases:

pyarrow.lib.ArrowInvalid: CSV parse error: Row #2: Expected 2 columns, got 3: $�����$&������5�����one����������&R&���������(������,������� ...
Failed to read CSV file: ...
Please check the CSV file has correct format, or filter out non-CSV file with 'partition_filter' field.
See read_csv() documentation for more details.

@c21 c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 6, 2022
Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, small comment on raised error message

Comment on lines 58 to 63
raise pa.lib.ArrowInvalid(
f"{e}. Failed to read CSV file: {path}. "
"Please check the CSV file has correct format, or filter out non-CSV "
"file with 'partition_filter' field. See read_csv() documentation for "
"more details."
)
Copy link
Contributor

@clarkzinzow clarkzinzow Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumping the original exception at the beginning of this exception's message will put a potentially wordy message before our higher-level message and will get rid of the original traceback, replaced with our traceback. Might be better to just chain the exceptions here since that's the use case for exception chaining.

Also I think that raising a ValueError rather than Arrow's ArrowInvalid error makes more sense here.

Suggested change
raise pa.lib.ArrowInvalid(
f"{e}. Failed to read CSV file: {path}. "
"Please check the CSV file has correct format, or filter out non-CSV "
"file with 'partition_filter' field. See read_csv() documentation for "
"more details."
)
raise ValueError(
f"Failed to read CSV file: {path}. "
"Please check the CSV file has correct format, or filter out non-CSV "
"file with 'partition_filter' field. See read_csv() documentation for "
"more details."
) from e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarkzinzow - didn't know our best practice here. This makes sense to me. Updated.

Signed-off-by: Cheng Su <[email protected]>
@clarkzinzow clarkzinzow merged commit 92df1c1 into ray-project:master Oct 7, 2022
@c21 c21 deleted the csv branch October 7, 2022 18:13
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (ray-project#26605). We should change to NOT filter files by default.

Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray.

Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension.

Also improve the error message when reading CSV file

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] CSV reading of unspecified extension is not robust
4 participants