Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide access to inner Write for parquet writers #5471

Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 4, 2024

Which issue does this PR close?

Closes #5253

Rationale for this change

Whilst working on #5458 I wanted to explore how we could do IO better when writing parquet data to object storage. By allowing mutable access to the underlying writer, we make it easier to integrate with systems that need to flush data asynchronously.

This additionally came up in #5450.

#4122 added similar methods to the IPC readers/writers.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 4, 2024

/// Returns a mutable reference to the underlying writer.
///
/// It is inadvisable to directly write to the underlying writer.
Copy link
Contributor Author

@tustvold tustvold Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from the docs on std::io::BufWriter, which is effectively what this is.

One could argue that exposing this is dangerous as they could potentially mess with the writer underneath us, but I think this is fine because the standard library does this, and there isn't anything to stop them using a writer with interior mutability anyway (as AsyncArrowWriter in fact did)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add some additional commentary about why it is inadvisable and maybe when it is ok (clearly there is some usecase otherwise we wouldn't have it) for example

Suggested change
/// It is inadvisable to directly write to the underlying writer.
/// WARNING: directly writing to the underlying writer
/// prior to closing this ArrowWriter can result in corrupted
/// parquet files.
///
/// This API can be used to flush the underlying writer

I think the actual usecase in this PR is more complicated (the async writer writes directly into the buffer) so I probably don't fully understand what is going on / the use case. Perhaps all the more reason to improve the documentation 🤔

@tustvold
Copy link
Contributor Author

tustvold commented Mar 4, 2024

FYI @ShiKaiWi

@tustvold tustvold force-pushed the provide-access-to-inner-parquet-writers branch from a70270c to 96de92b Compare March 4, 2024 23:29
@tustvold tustvold force-pushed the provide-access-to-inner-parquet-writers branch from 96de92b to 9195e9d Compare March 4, 2024 23:29
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @tustvold

I had some comment suggestions but nothing that would prevent merging

I think it is worth considering if we really need to provide inner_mut(), though perhaps some more documentation would make it clearer what was going on / when it was ok to do so


/// Returns a mutable reference to the underlying writer.
///
/// It is inadvisable to directly write to the underlying writer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add some additional commentary about why it is inadvisable and maybe when it is ok (clearly there is some usecase otherwise we wouldn't have it) for example

Suggested change
/// It is inadvisable to directly write to the underlying writer.
/// WARNING: directly writing to the underlying writer
/// prior to closing this ArrowWriter can result in corrupted
/// parquet files.
///
/// This API can be used to flush the underlying writer

I think the actual usecase in this PR is more complicated (the async writer writes directly into the buffer) so I probably don't fully understand what is going on / the use case. Perhaps all the more reason to improve the documentation 🤔

/// Flushes any outstanding data and returns the underlying writer.
pub fn into_inner(mut self) -> Result<W> {
self.flush()?;
self.writer.into_inner()
}

/// Close and finalize the underlying Parquet writer
pub fn close(mut self) -> Result<crate::format::FileMetaData> {
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
///
/// returning the completed [`crate::formatFileMetaData`] for the written file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sufficiently obvious

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale for this suggestion was to make it easier for someone to quickly determine "what is the difference between the very similarly sounding finish() and close() methods

self.writer.finish()
}

/// Close and finalize the underlying Parquet writer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Close and finalize the underlying Parquet writer
/// Close and finalize the underlying Parquet writer, consuming self
///
/// Returns the metadata for the newly created Parquet file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I think this is sufficiently obvious

parquet/src/arrow/arrow_writer/mod.rs Show resolved Hide resolved
/// Trigger forced flushing once buffer size reaches this value
buffer_size: usize,
}

impl<W: AsyncWrite + Unpin + Send> AsyncArrowWriter<W> {
/// Try to create a new Async Arrow Writer.
///
/// `buffer_size` determines the number of bytes to buffer before flushing
/// `buffer_size` determines the minimum number of bytes to buffer before flushing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment below says "self::write" will flush the intermediate buffer it it is at least half full, but this comment seems to imply that buffer_size determines when to flush. This seems inconsistent to me, but maybe I don't undersand

The same comments applies to try_new_with_options below


/// Returns a mutable reference to the underlying writer.
///
/// It is inadvisable to directly write to the underlying writer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think it would be good to explain why it is inadvisable and document when one would use this API

let err = file_writer.next_row_group().err().unwrap().to_string();
assert_eq!(err, "Parquet error: SerializedFileWriter already finished");

drop(file_writer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the drop necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file_writer has a borrow of the vec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet: add method to get both the inner writer and the file metadata when closing SerializedFileWriter
2 participants