Add support for reading CSV files with comments #10467

bbannier · 2024-05-12T10:44:23Z

This PR adds support for parsing CSV files containing comment lines.

bbannier · 2024-05-12T12:23:28Z

This is currently a sketch for a possible implementation for #10262. The approach taken push interpretation of comment lines into arrow-csv apache/arrow-rs#5759 adding support for that; the task here is then to plug a datafusion comment config setting through to arrow-csv.

If this is a viable solution it would require a bump of at least the arrow-csv dependency in datafusion to a version containing support for comments. To at least explore that I prefixed the actual implementation patch here with two patches performing that bump. It appears that a bump to the master version of arrow-csv (or something else from the collection of crates in https://github.com/apache/arrow-rs) requires changes to datafusion; I attempted to perform that bump, but currently there are still some remaining issues,

Error: ResourcesExhausted("Failed to allocate additional 2208 bytes for GroupedHashAggregateStream[0] with 348 bytes already allocated - maximum available is 1600")

---- aggregates::tests::aggregate_source_with_yielding_with_spill stdout ----
Error: ResourcesExhausted("Failed to allocate additional 2208 bytes for GroupedHashAggregateStream[0] with 348 bytes already allocated - maximum available is 1600")

---- aggregates::tests::run_first_last_multi_partitions stdout ----
Error: ResourcesExhausted("Failed to allocate additional 3704 bytes for GroupedHashAggregateStream[0] with 437 bytes already allocated - maximum available is 3200")

@alamb, would you be open to shepherding this PR and apache/arrow-rs#5759, or alternatively could help identify someone who could?

alamb · 2024-05-14T12:32:33Z

If this is a viable solution it would require a bump of at least the arrow-csv dependency in datafusion to a version containing support for comments.

Yes. FWIW DataFusion typically upgrades to the latest arrow-rs (including arrow-csv) dependency so while extra time would be needed no extra work would be

bbannier · 2024-06-09T20:10:20Z

This is now rebased on main which recently bumped arrow.

alamb

Thank you very much for this contribution @bbannier -- this code looks great. The only thing I think this PR now needs is some test coverage so we don't break it in the future

Here is my suggestion for testing:

update csv_files.slt, see this file for info on running sql logic tests

Note I think you can programatically create a csv file with a command like

> copy (values ('column1,column2'), ('#second line is a comment'), ('2,3')) TO '/tmp/my.csv' OPTIONS ('format.delimiter' '|');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.004 seconds.

That results in

$ cat /tmp/my.csv
column1,column2
#second line is a comment
2,3

This patch adds support for parsing CSV files containing comment lines. Closes apache#10262.

alamb

Thank you @bbannier 🚀

alamb · 2024-06-10T15:31:28Z

datafusion/sqllogictest/test_files/csv_files.slt

+         'format.delimiter' ',');
+
+query TT
+SELECT * from stored_table_with_comments;


👍 Love it

This patch adds support for parsing CSV files containing comment lines. Closes apache#10262.

github-actions bot added the core Core DataFusion crate label May 12, 2024

bbannier force-pushed the t/comment branch 3 times, most recently from b1092d8 to 62b8364 Compare May 12, 2024 11:58

bbannier force-pushed the t/comment branch 3 times, most recently from fb58860 to 1df527d Compare May 13, 2024 18:55

bbannier force-pushed the t/comment branch 2 times, most recently from d4faa11 to f27f2dc Compare June 9, 2024 19:33

bbannier marked this pull request as ready for review June 9, 2024 20:10

alamb reviewed Jun 9, 2024

View reviewed changes

Add support for reading CSV files with comments

ac2287b

This patch adds support for parsing CSV files containing comment lines. Closes apache#10262.

bbannier force-pushed the t/comment branch from f27f2dc to ac2287b Compare June 10, 2024 08:23

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 10, 2024

bbannier requested a review from alamb June 10, 2024 09:33

alamb approved these changes Jun 10, 2024

View reviewed changes

alamb merged commit 5912025 into apache:main Jun 10, 2024
23 checks passed

bbannier deleted the t/comment branch June 10, 2024 15:38

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024

Add support for reading CSV files with comments (apache#10467)

3be259e

This patch adds support for parsing CSV files containing comment lines. Closes apache#10262.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for reading CSV files with comments #10467

Add support for reading CSV files with comments #10467

bbannier commented May 12, 2024

bbannier commented May 12, 2024 •

edited

Loading

alamb commented May 14, 2024

bbannier commented Jun 9, 2024

alamb left a comment

alamb left a comment

alamb Jun 10, 2024

Add support for reading CSV files with comments #10467

Add support for reading CSV files with comments #10467

Conversation

bbannier commented May 12, 2024

bbannier commented May 12, 2024 • edited Loading

alamb commented May 14, 2024

bbannier commented Jun 9, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 10, 2024

Choose a reason for hiding this comment

bbannier commented May 12, 2024 •

edited

Loading