Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet-concat #4274

Merged
merged 1 commit into from
May 24, 2023
Merged

Add parquet-concat #4274

merged 1 commit into from
May 24, 2023

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

Adds a CLI tool for efficiently concatenating parquet files, this definitely could be made more sophisticated, but serves as a demo of how to use the new API added in #4269 whilst also providing some utility to users

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 24, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a doc comment from append_column added in #4269 to this binary as an example of how to use it

I tested this out with some local parquet data:

parquet-concat combined.parquet 1.parquet 2.parquet
(arrow_dev) alamb@MacBook-Pro-8:~/Downloads$ du -s -h 1.parquet 2.parquet combined.parquet
 69M	1.parquet
 40K	2.parquet
 69M	combined.parquet
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 9332578         |
+-----------------+
1 row in set. Query took 0.007 seconds.
❯ select count(*) from '2.parquet';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 637             |
+-----------------+
❯ select count(*) from 'combined.parquet';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 9333215         |
+-----------------+
❯ select avg(value) from (select value from '1.parquet' UNION ALL select value from '2.parquet');
+----------------------+
| AVG(1.parquet.value) |
+----------------------+
| 82.11578603943015    |
+----------------------+
1 row in set. Query took 0.035 seconds.
❯ select avg(value) from 'combined.parquet';
+-----------------------------+
| AVG(combined.parquet.value) |
+-----------------------------+
| 82.11578603943015           |
+-----------------------------+
1 row in set. Query took 0.015 seconds.

Works great for me 🚀

@tustvold
Copy link
Contributor Author

I recommend adding a doc comment from append_column added in #4269 to this binary as an example of how to use it

I'm not actually sure how to do this, I will link ArrowWriter following #3871 as I think that will be more discoverable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants