Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document parquet writer memory limiting (#5450) #5457

Merged
merged 3 commits into from
Mar 8, 2024

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 2, 2024

Which issue does this PR close?

Closes #5450

Rationale for this change

What changes are included in this PR?

This adds an example showing how to limit the growth of the buffers within the arrow writers. I debated exposing a setting to control this, but the way it would only ever be a best effort limit felt a little off. Making it explicit in the client implementation at least makes it very obvious how it behaves.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 2, 2024
Copy link
Contributor

@DDtKey DDtKey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think would be nice to have an additional config(or method) to "enforce flush on buffer size". To be able to encapsulate this logic for user's code 🤔

Making it explicit in the client implementation at least makes it very obvious how it behaves.

To be honest, I'm not sure it's clearer. It's documented now - yes. But we have buffer config and in fact not clear what it does for user. And we can't enforce it.

Just for example, take a look at current documentation of buffer size:

/// buffer_size determines the number of bytes to buffer before flushing
/// to the underlying [AsyncWrite]

But it can be done/discussed separately ofc

Comment on lines +73 to +79
/// ## Memory Limiting
///
/// The nature of parquet forces buffering of an entire row group before it can be flushed
/// to the underlying writer. This buffering may exceed the configured buffer size
/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely flushing the row group,
/// although this will have implications for file size and query performance. See [ArrowWriter]
/// for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have this documented! Thanks!

Should we refer to this in instantiation methods? (try_new(_with_options))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would help -- perhaps something like try_new in try_new/try_new_with_options

/// Please see the documentation on [`Self`] for details on memory usage.

@tustvold
Copy link
Contributor Author

tustvold commented Mar 5, 2024

I still think would be nice to have an additional config(or method) to "enforce flush on buffer size"

No objections from me, FWIW my conclusion from the investigation under #5458 is we should probably encourage people to move away from AsyncWrite in general

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- this is an improvement in my mind. I filed #5484 to discuss additional API work

Comment on lines +73 to +79
/// ## Memory Limiting
///
/// The nature of parquet forces buffering of an entire row group before it can be flushed
/// to the underlying writer. This buffering may exceed the configured buffer size
/// of [`AsyncArrowWriter`]. Memory usage can be limited by prematurely flushing the row group,
/// although this will have implications for file size and query performance. See [ArrowWriter]
/// for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would help -- perhaps something like try_new in try_new/try_new_with_options

/// Please see the documentation on [`Self`] for details on memory usage.

///
/// ## Memory Limiting
///
/// The nature of parquet forces buffering of an entire row group before it can be flushed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth suggesting to users that if they want to minimize memory overages when writing such data, they can send in smaller RecordBatches (e.g. split up via RecordBatch::slice for example) which gives the parquet writer more chances to check / flush?

@alamb
Copy link
Contributor

alamb commented Mar 7, 2024

But it can be done/discussed separately ofc

I filed #5484 to track / discuss this idea. Thank you @DDtKey

@tustvold tustvold merged commit 79634c0 into apache:master Mar 8, 2024
16 checks passed
@mapleFU
Copy link
Member

mapleFU commented Mar 9, 2024

Hi, thanks for the document here, I think I can also copying this to C++ Parquet lib. By the way, do you think it's necessary to also noted that dictionary encoding might be memory consuming? (Since maybe a huge dict woule be allocated)

@tustvold
Copy link
Contributor Author

tustvold commented Mar 9, 2024

By the way, do you think it's necessary to also noted that dictionary encoding might be memory consumin

We have a maximum dictionary page size that I think should bound this, I believe the C++ parquet lib does something similar

@mapleFU
Copy link
Member

mapleFU commented Mar 9, 2024

Generally yeah, but in fact, when writing to a wide table with several hundreds of columns, it will still be a huge overhead...Velox will disable dictionary encoding when memory usage is high. I think we can also note that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size
4 participants