-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] read_csv not filter out files by default #29032
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,10 @@ | |
PathPartitionEncoder, | ||
PathPartitionFilter, | ||
) | ||
from ray.data.datasource.file_based_datasource import _unwrap_protocol | ||
from ray.data.datasource.file_based_datasource import ( | ||
FileExtensionFilter, | ||
_unwrap_protocol, | ||
) | ||
|
||
|
||
def df_to_csv(dataframe, path, **kwargs): | ||
|
@@ -196,7 +199,12 @@ def test_csv_read(ray_start_regular_shared, fs, data_path, endpoint_url): | |
storage_options=storage_options, | ||
) | ||
|
||
ds = ray.data.read_csv(path, filesystem=fs, partitioning=None) | ||
ds = ray.data.read_csv( | ||
path, | ||
filesystem=fs, | ||
partition_filter=FileExtensionFilter("csv"), | ||
partitioning=None, | ||
) | ||
assert ds.num_blocks() == 2 | ||
df = pd.concat([df1, df2], ignore_index=True) | ||
dsdf = ds.to_pandas() | ||
|
@@ -642,7 +650,7 @@ def test_csv_read_with_column_type_specified(shutdown_only, tmp_path): | |
|
||
# Incorrect to parse scientific notation in int64 as PyArrow represents | ||
# it as double. | ||
with pytest.raises(pa.lib.ArrowInvalid): | ||
with pytest.raises(ValueError): | ||
ray.data.read_csv( | ||
file_path, | ||
convert_options=csv.ConvertOptions( | ||
|
@@ -661,15 +669,53 @@ def test_csv_read_with_column_type_specified(shutdown_only, tmp_path): | |
assert ds.to_pandas().equals(expected_df) | ||
|
||
|
||
def test_csv_read_filter_no_file(shutdown_only, tmp_path): | ||
def test_csv_read_filter_non_csv_file(shutdown_only, tmp_path): | ||
df = pd.DataFrame({"one": [1, 2, 3], "two": ["a", "b", "c"]}) | ||
|
||
# CSV file with .csv extension. | ||
path1 = os.path.join(tmp_path, "test2.csv") | ||
df.to_csv(path1, index=False) | ||
|
||
# CSV file without .csv extension. | ||
path2 = os.path.join(tmp_path, "test3") | ||
df.to_csv(path2, index=False) | ||
|
||
# Directory of CSV files. | ||
ds = ray.data.read_csv(tmp_path) | ||
assert ds.to_pandas().equals(pd.concat([df, df], ignore_index=True)) | ||
|
||
# Non-CSV file in Parquet format. | ||
table = pa.Table.from_pandas(df) | ||
path = os.path.join(str(tmp_path), "test.parquet") | ||
pq.write_table(table, path) | ||
path3 = os.path.join(tmp_path, "test1.parquet") | ||
pq.write_table(table, path3) | ||
|
||
# Single non-CSV file. | ||
error_message = "Failed to read CSV file" | ||
with pytest.raises(ValueError, match=error_message): | ||
ray.data.read_csv(path3) | ||
|
||
# Single non-CSV file with filter. | ||
error_message = "No input files found to read" | ||
with pytest.raises(ValueError, match=error_message): | ||
ray.data.read_csv(path) | ||
ray.data.read_csv(path3, partition_filter=FileExtensionFilter("csv")) | ||
|
||
# Single CSV file without extension. | ||
ds = ray.data.read_csv(path2) | ||
assert ds.to_pandas().equals(df) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a case where two CSV files, one with .csv and the other without, but we can successfully read both into dataset? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jianoaix - sure, added. |
||
# Single CSV file without extension with filter. | ||
error_message = "No input files found to read" | ||
with pytest.raises(ValueError, match=error_message): | ||
ray.data.read_csv(path2, partition_filter=FileExtensionFilter("csv")) | ||
|
||
# Directory of CSV and non-CSV files. | ||
error_message = "Failed to read CSV file" | ||
with pytest.raises(ValueError, match=error_message): | ||
ray.data.read_csv(tmp_path) | ||
|
||
# Directory of CSV and non-CSV files with filter. | ||
ds = ray.data.read_csv(tmp_path, partition_filter=FileExtensionFilter("csv")) | ||
assert ds.to_pandas().equals(df) | ||
|
||
|
||
@pytest.mark.skipif( | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the same logic applicable for other data types (e.g. json, parquet)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have default filtering for Parquet. So Parquet is good.
We have default filter for JSON to filter out files without
.json
extension. Just tried out Arrow and Spark on my laptop and they won't filter out files when reading JSON. We can change the behavior for JSON files as well later in another followup PR. To me, CSV fix is more urgent as there're multiple user reports (but it can also be the case that ourread_json
is not popular).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting,
read_parquet
doesn't have default filtering whileread_parquet_bulk
does (which seems a bit unintuitive to me, sinceread_parquet_bulk
doesn't support directories).In that case I think it's reasonable to go forward with this for now and follow-up with a more holistic/consistent solution.
One question - is the error message when reading a non-CSV file directly actionable to the user, especially for users who previously relied on this default behavior? E.g. when the file is a TSV, or if the file is some random file that should be excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this looks unintuitive to me too. Don't think we should have different between these two Parquet APIs.
Yeah agreed, we should have an actionable error message. That's exactly I am doing now, plan to have another PR to tackle the error message when reading non-CSV files by mistake, and it should also apply when reading a malformed
.csv
file.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added handling to provide more detailed error message. Added an example error message when reading non-CSV file in PR description.