-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] read_csv not filter out files by default #29032
Conversation
@@ -577,9 +577,7 @@ def read_csv( | |||
ray_remote_args: Dict[str, Any] = None, | |||
arrow_open_stream_args: Optional[Dict[str, Any]] = None, | |||
meta_provider: BaseFileMetadataProvider = DefaultFileMetadataProvider(), | |||
partition_filter: Optional[ | |||
PathPartitionFilter | |||
] = CSVDatasource.file_extension_filter(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the same logic applicable for other data types (e.g. json, parquet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the same logic applicable for other data types (e.g. json, parquet)?
We don't have default filtering for Parquet. So Parquet is good.
We have default filter for JSON to filter out files without .json
extension. Just tried out Arrow and Spark on my laptop and they won't filter out files when reading JSON. We can change the behavior for JSON files as well later in another followup PR. To me, CSV fix is more urgent as there're multiple user reports (but it can also be the case that our read_json
is not popular).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, read_parquet
doesn't have default filtering while read_parquet_bulk
does (which seems a bit unintuitive to me, since read_parquet_bulk
doesn't support directories).
In that case I think it's reasonable to go forward with this for now and follow-up with a more holistic/consistent solution.
One question - is the error message when reading a non-CSV file directly actionable to the user, especially for users who previously relied on this default behavior? E.g. when the file is a TSV, or if the file is some random file that should be excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, read_parquet doesn't have default filtering while read_parquet_bulk does (which seems a bit unintuitive to me, since read_parquet_bulk doesn't support directories).
Yeah, this looks unintuitive to me too. Don't think we should have different between these two Parquet APIs.
One question - is the error message when reading a non-CSV file directly actionable to the user, especially for users who previously relied on this default behavior? E.g. when the file is a TSV, or if the file is some random file that should be excluded.
Yeah agreed, we should have an actionable error message. That's exactly I am doing now, plan to have another PR to tackle the error message when reading non-CSV files by mistake, and it should also apply when reading a malformed .csv
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added handling to provide more detailed error message. Added an example error message when reading non-CSV file in PR description.
except StopIteration: | ||
return | ||
except Exception as e: | ||
raise type(e)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how fragile it is, please suggest any better way.
# Single CSV file without extension. | ||
ds = ray.data.read_csv(path3) | ||
assert ds.to_pandas().equals(df) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a case where two CSV files, one with .csv and the other without, but we can successfully read both into dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix - sure, added.
except Exception as e: | ||
raise type(e)( | ||
f"{e}. Failed to read CSV file: {path}. " | ||
"Please check the file has correct format, or filter out file with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So at this point, is it sure that only incorrect format will fail the open_csv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know for sure. Arrow just returns a generic pyarrow.lib.ArrowInvalid: CSV parse error
, so it can also because the CSV file is malformed.
Changed to catch pyarrow.lib.ArrowInvalid
only, and printed out error message to try to be reasonable in all cases:
pyarrow.lib.ArrowInvalid: CSV parse error: Row #2: Expected 2 columns, got 3: $�����$&������5�����one����������&R&���������(������,������� ...
Failed to read CSV file: ...
Please check the CSV file has correct format, or filter out non-CSV file with 'partition_filter' field.
See read_csv() documentation for more details.
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, small comment on raised error message
raise pa.lib.ArrowInvalid( | ||
f"{e}. Failed to read CSV file: {path}. " | ||
"Please check the CSV file has correct format, or filter out non-CSV " | ||
"file with 'partition_filter' field. See read_csv() documentation for " | ||
"more details." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumping the original exception at the beginning of this exception's message will put a potentially wordy message before our higher-level message and will get rid of the original traceback, replaced with our traceback. Might be better to just chain the exceptions here since that's the use case for exception chaining.
Also I think that raising a ValueError
rather than Arrow's ArrowInvalid
error makes more sense here.
raise pa.lib.ArrowInvalid( | |
f"{e}. Failed to read CSV file: {path}. " | |
"Please check the CSV file has correct format, or filter out non-CSV " | |
"file with 'partition_filter' field. See read_csv() documentation for " | |
"more details." | |
) | |
raise ValueError( | |
f"Failed to read CSV file: {path}. " | |
"Please check the CSV file has correct format, or filter out non-CSV " | |
"file with 'partition_filter' field. See read_csv() documentation for " | |
"more details." | |
) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarkzinzow - didn't know our best practice here. This makes sense to me. Updated.
Signed-off-by: Cheng Su <[email protected]>
Currently read_csv filters out files without .csv extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (ray-project#26605). We should change to NOT filter files by default. Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray. Added documentation in case users want to use partition_filter to filter out files, and gave an example to filter out files with .csv extension. Also improve the error message when reading CSV file Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Cheng Su [email protected]
Why are these changes needed?
Currently
read_csv
filters out files without.csv
extension when reading. This behavior seems to be surprising to users, and reported to be bad user experience in 3+ user reports (#26605). We should change to NOT filter files by default.Verified Arrow (https://arrow.apache.org/docs/python/csv.html) and Spark (https://spark.apache.org/docs/latest/sql-data-sources-csv.html) does not filter out CSV files by default. I don't see a strong reason why we want to do it in a different way in Ray.
Added documentation in case users want to use
partition_filter
to filter out files, and gave an example to filter out files with.csv
extension.Also improve the error message when reading CSV file:
Related issue number
Closes #26605
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.