Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write schema for empty parquet files #2373

Closed
Hanspagh opened this issue Jun 13, 2024 · 8 comments · Fixed by #2540
Closed

Write schema for empty parquet files #2373

Hanspagh opened this issue Jun 13, 2024 · 8 comments · Fixed by #2540
Assignees

Comments

@Hanspagh
Copy link

Is your feature request related to a problem? Please describe.
I am trying out daft as an alternative to spark, in our current use of spark we make use of the feature that even if dataframe is empty, spark will still create an empty parquet file with the schema of the dataframe, I would like daft to do the same

Describe the solution you'd like
I found the following issue in pyarrow which seems to indicate it is "bug" in arrow, either we wait for the upstream fix or we work around it by using pyarrow.parquet.write_table when writing parquet

Let me know what you think, I am might also be able to help drive this change

@jaychia
Copy link
Contributor

jaychia commented Jun 17, 2024

Hi @Hanspagh !

Are you referring specifically to the df.write_parquet(...) API, and that you'd want an empty df to write an empty Parquet file?

We could likely corner-case this to work, but it might get a little messy because we do distributed writes. Each partition writes its own Parquet file(s), and so in the event that you have N partitions and all of them are empty, then we'd end up with N empty Parquet files.

@Hanspagh
Copy link
Author

Yes I am referring to df.write_parquet(..), maybe I can explain a bit more about our use-case and why we 'need' empty parquet files.

In our data transformation repo we automatically smoke test all our transformation jobs with random sample data based on the input schemas for the job, and then we can validate that the output of the job matches a output schema. This work very well in spark because no matter the filters etc. spark will always produce a parquet file with a schema based on the query plan.

We are experimenting with using daft as a replacement for spark and would be very sad to lose this possibility for automated smoke-tests.

I see the problem in ending up with several empty parquet files because of partitions, for our use-case it would not really matter, since this is only used for testing anyways, but maybe we need make this behavior optional?

@jaychia
Copy link
Contributor

jaychia commented Jun 20, 2024

That makes sense. Couple of follow-up questions for you here:

  1. Does Spark have the behavior of writing many empty Parquet files, or does it somehow just write one empty file?
  2. We've tested Spark behavior before and when writing Parquet files it always writes multiple files (at least one file per executor). Is this consistent with what you're observing as well?

There is a separate but perhaps related thread on getting Daft to write to a single Parquet file (instead of outputting multiple files): #2359

Perhaps in "single-file" mode it could be expected to output an empty Parquet file, but in the "many-file" mode it would output an empty directory.

@Hanspagh
Copy link
Author

I just did bit of experimenting. As you said spark will always write a parquet directory (a folder with 1 or more parts in it) and dont think there is a way to get spark to a write a single file.

I tried to play a bit round with spark locally and it seems no matter the amount of executor I only get a single part, when the output file is empty. I think the official docs says that spark will write at least one file per partition.

Even if I force my qurry to use multiple partitions with repartition, the final number of partitions for my empty df will be 0, hence I get get a single part in my parquet file. If I instead change the partitions for a df with data I get one part for each partition.

That being said for our usecase it does not really matter, since this is purely for testing purposes, but if you want to align with spark you might want to adhere to ther above :).

I hope this helps, please reach out if need more information :).

@jaychia
Copy link
Contributor

jaychia commented Jul 3, 2024 via email

@jaychia
Copy link
Contributor

jaychia commented Jul 18, 2024

Hey @Hanspagh I actually just synced with some of our team... This maybe isn't as difficult to support as we initially thought :)

Look out for an update soon!

@Hanspagh
Copy link
Author

Thank you for the update, we are looking forward to see what you come up with :)

@jaychia
Copy link
Contributor

jaychia commented Jul 24, 2024

@Hanspagh we just merged in #2540 which should do this by default. This will be out in the new release, let us know if you get to try it!

@jaychia jaychia closed this as completed Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants