Write schema for empty parquet files #2373

Hanspagh · 2024-06-13T11:42:21Z

Is your feature request related to a problem? Please describe.
I am trying out daft as an alternative to spark, in our current use of spark we make use of the feature that even if dataframe is empty, spark will still create an empty parquet file with the schema of the dataframe, I would like daft to do the same

Describe the solution you'd like
I found the following issue in pyarrow which seems to indicate it is "bug" in arrow, either we wait for the upstream fix or we work around it by using pyarrow.parquet.write_table when writing parquet

Let me know what you think, I am might also be able to help drive this change

The text was updated successfully, but these errors were encountered:

jaychia · 2024-06-17T18:48:12Z

Hi @Hanspagh !

Are you referring specifically to the df.write_parquet(...) API, and that you'd want an empty df to write an empty Parquet file?

We could likely corner-case this to work, but it might get a little messy because we do distributed writes. Each partition writes its own Parquet file(s), and so in the event that you have N partitions and all of them are empty, then we'd end up with N empty Parquet files.

Hanspagh · 2024-06-18T07:14:13Z

Yes I am referring to df.write_parquet(..), maybe I can explain a bit more about our use-case and why we 'need' empty parquet files.

In our data transformation repo we automatically smoke test all our transformation jobs with random sample data based on the input schemas for the job, and then we can validate that the output of the job matches a output schema. This work very well in spark because no matter the filters etc. spark will always produce a parquet file with a schema based on the query plan.

We are experimenting with using daft as a replacement for spark and would be very sad to lose this possibility for automated smoke-tests.

I see the problem in ending up with several empty parquet files because of partitions, for our use-case it would not really matter, since this is only used for testing anyways, but maybe we need make this behavior optional?

jaychia · 2024-06-20T21:52:25Z

That makes sense. Couple of follow-up questions for you here:

Does Spark have the behavior of writing many empty Parquet files, or does it somehow just write one empty file?
We've tested Spark behavior before and when writing Parquet files it always writes multiple files (at least one file per executor). Is this consistent with what you're observing as well?

There is a separate but perhaps related thread on getting Daft to write to a single Parquet file (instead of outputting multiple files): #2359

Perhaps in "single-file" mode it could be expected to output an empty Parquet file, but in the "many-file" mode it would output an empty directory.

Hanspagh · 2024-06-21T08:30:03Z

I just did bit of experimenting. As you said spark will always write a parquet directory (a folder with 1 or more parts in it) and dont think there is a way to get spark to a write a single file.

I tried to play a bit round with spark locally and it seems no matter the amount of executor I only get a single part, when the output file is empty. I think the official docs says that spark will write at least one file per partition.

Even if I force my qurry to use multiple partitions with repartition, the final number of partitions for my empty df will be 0, hence I get get a single part in my parquet file. If I instead change the partitions for a df with data I get one part for each partition.

That being said for our usecase it does not really matter, since this is purely for testing purposes, but if you want to align with spark you might want to adhere to ther above :).

I hope this helps, please reach out if need more information :).

jaychia · 2024-07-03T04:32:21Z

By the way we didn’t manage to work on this just yet, but a couple of updates that might help here: 1. We merged a Schema.to_pyarrow() which should help with testing quite a bit! 2. We’re working on a new version of our execution engine, which will help us produce just 1 file instead of many files for our local runner. This will make it much easier to support this empty file usecase!

…

-- Jay Chia Founder, Eventual ( https://www.eventualcomputing.com/ )

On Fri, Jun 21 2024 at 1:30 AM, Hans < ***@***.*** > wrote: I just did bit of experimenting. As you said spark will always write a parquet directory (a folder with 1 or more parts in it) and dont think there is a way to get spark to a write a single file. I tried to play a bit round with spark locally and it seems no matter the amount of executor I only get a single part, when the output file is empty. I think the official docs says that spark will write at least one file per *partition*. Even if I force my qurry to use multiple partitions with repartition , the final number of partitions for my empty df will be 0, hence I get get a single part in my parquet file. If I instead change the partitions for a df with data I get one part for each partition. That being said for our usecase it does not really matter, since this is purely for testing purposes, but if you want to align with spark you might want to adhere to ther above :). I hope this helps, please reach out if need more information :). — Reply to this email directly, view it on GitHub ( #2373 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AEG7ELVBZ3T2DFOVS3GTFBTZIPQCDAVCNFSM6AAAAABJIHAAG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSGI3DSNJRGA ). You are receiving this because you commented. Message ID: <Eventual-Inc/Daft/issues/2373/2182269510 @ github. com>

jaychia · 2024-07-18T16:14:12Z

Hey @Hanspagh I actually just synced with some of our team... This maybe isn't as difficult to support as we initially thought :)

Look out for an update soon!

Hanspagh · 2024-07-18T19:17:49Z

Thank you for the update, we are looking forward to see what you come up with :)

[#2373](#2373)

jaychia · 2024-07-24T20:48:18Z

@Hanspagh we just merged in #2540 which should do this by default. This will be out in the new release, let us know if you get to try it!

jaychia assigned kevinzwang Jul 18, 2024

kevinzwang mentioned this issue Jul 19, 2024

[FEAT] Create file when writing dataframe with no rows #2540

Merged

kevinzwang added a commit that referenced this issue Jul 24, 2024

[FEAT] Create file when writing dataframe with no rows (#2540)

f740880

[#2373](#2373)

jaychia closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write schema for empty parquet files #2373

Write schema for empty parquet files #2373

Hanspagh commented Jun 13, 2024

jaychia commented Jun 17, 2024

Hanspagh commented Jun 18, 2024

jaychia commented Jun 20, 2024

Hanspagh commented Jun 21, 2024

jaychia commented Jul 3, 2024 via email

jaychia commented Jul 18, 2024

Hanspagh commented Jul 18, 2024

jaychia commented Jul 24, 2024

Write schema for empty parquet files #2373

Write schema for empty parquet files #2373

Comments

Hanspagh commented Jun 13, 2024

jaychia commented Jun 17, 2024

Hanspagh commented Jun 18, 2024

jaychia commented Jun 20, 2024

Hanspagh commented Jun 21, 2024

jaychia commented Jul 3, 2024 via email

jaychia commented Jul 18, 2024

Hanspagh commented Jul 18, 2024

jaychia commented Jul 24, 2024