-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write schema for empty parquet files #2373
Comments
Hi @Hanspagh ! Are you referring specifically to the We could likely corner-case this to work, but it might get a little messy because we do distributed writes. Each partition writes its own Parquet file(s), and so in the event that you have |
Yes I am referring to In our data transformation repo we automatically smoke test all our transformation jobs with random sample data based on the input schemas for the job, and then we can validate that the output of the job matches a output schema. This work very well in spark because no matter the filters etc. spark will always produce a parquet file with a schema based on the query plan. We are experimenting with using daft as a replacement for spark and would be very sad to lose this possibility for automated smoke-tests. I see the problem in ending up with several empty parquet files because of partitions, for our use-case it would not really matter, since this is only used for testing anyways, but maybe we need make this behavior optional? |
That makes sense. Couple of follow-up questions for you here:
There is a separate but perhaps related thread on getting Daft to write to a single Parquet file (instead of outputting multiple files): #2359 Perhaps in "single-file" mode it could be expected to output an empty Parquet file, but in the "many-file" mode it would output an empty directory. |
I just did bit of experimenting. As you said spark will always write a parquet directory (a folder with 1 or more parts in it) and dont think there is a way to get spark to a write a single file. I tried to play a bit round with spark locally and it seems no matter the amount of executor I only get a single part, when the output file is empty. I think the official docs says that spark will write at least one file per partition. Even if I force my qurry to use multiple partitions with That being said for our usecase it does not really matter, since this is purely for testing purposes, but if you want to align with spark you might want to adhere to ther above :). I hope this helps, please reach out if need more information :). |
By the way we didn’t manage to work on this just yet, but a couple of updates that might help here:
1. We merged a Schema.to_pyarrow() which should help with testing quite a bit!
2. We’re working on a new version of our execution engine, which will help us produce just 1 file instead of many files for our local runner. This will make it much easier to support this empty file usecase!
…--
Jay Chia
Founder, Eventual ( https://www.eventualcomputing.com/ )
On Fri, Jun 21 2024 at 1:30 AM, Hans < ***@***.*** > wrote:
I just did bit of experimenting. As you said spark will always write a
parquet directory (a folder with 1 or more parts in it) and dont think
there is a way to get spark to a write a single file.
I tried to play a bit round with spark locally and it seems no matter the
amount of executor I only get a single part, when the output file is
empty. I think the official docs says that spark will write at least one file
per *partition*.
Even if I force my qurry to use multiple partitions with repartition , the
final number of partitions for my empty df will be 0, hence I get get a
single part in my parquet file. If I instead change the partitions for a
df with data I get one part for each partition.
That being said for our usecase it does not really matter, since this is
purely for testing purposes, but if you want to align with spark you might
want to adhere to ther above :).
I hope this helps, please reach out if need more information :).
—
Reply to this email directly, view it on GitHub (
#2373 (comment) )
, or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AEG7ELVBZ3T2DFOVS3GTFBTZIPQCDAVCNFSM6AAAAABJIHAAG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSGI3DSNJRGA
).
You are receiving this because you commented. Message ID: <Eventual-Inc/Daft/issues/2373/2182269510
@ github. com>
|
Hey @Hanspagh I actually just synced with some of our team... This maybe isn't as difficult to support as we initially thought :) Look out for an update soon! |
Thank you for the update, we are looking forward to see what you come up with :) |
Is your feature request related to a problem? Please describe.
I am trying out daft as an alternative to spark, in our current use of spark we make use of the feature that even if dataframe is empty, spark will still create an empty parquet file with the schema of the dataframe, I would like daft to do the same
Describe the solution you'd like
I found the following issue in pyarrow which seems to indicate it is "bug" in arrow, either we wait for the upstream fix or we work around it by using
pyarrow.parquet.write_table
when writing parquetLet me know what you think, I am might also be able to help drive this change
The text was updated successfully, but these errors were encountered: