Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/parquet: add crdb metadata to parquet files #103131

Merged
merged 2 commits into from
May 15, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented May 11, 2023

util/parquet: add option to write kv metadata to files

This change adds an option to the writer which allows the caller
to write arbitrary kv metadata to parquet files. This is useful
for testing purposes.

Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None


util/parquet: remove dependency on writer to read parquet files

Previously, the test utils used to read parquet files
would require the writer as an argument. The main reason
the writer was required is that the writer contained
crdb-specific type information which could be used to
decode raw data until crdb datums.

With this change, the writer is updated to write this
crdb-specific type information to the parquet file in
its metadata. The reader is updated to the read type
information from the file metadata. There is a new
test utility function ReadFile(parquetFile string)
which can be used to read all datums from a parquet
file without providing any additional type information.
The function also returns metadata since it is possible
for users of the Writer to write arbitrary metadata
and such users may need this metadata in testing.

Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava changed the title Parquet meta refactor util/parquet: add crdb metadata to parquet files May 11, 2023
This change adds an option to the writer which allows the caller
to write arbitrary kv metadata to parquet files. This is useful
for testing purposes.

Informs: cockroachdb#99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None
@jayshrivastava jayshrivastava force-pushed the parquet-meta-refactor branch 3 times, most recently from f0b170a to 3ba03db Compare May 11, 2023 19:53
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more context.Context, see https://github.com/jayshrivastava/cockroach/tree/parquet-cdc, which integrates the new parquet writer with changefeeds using ReadFile and custom metadata.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@jayshrivastava jayshrivastava marked this pull request as ready for review May 11, 2023 20:50

// deserializeIntArray deserializes an integer sting in the format "23 2 32 43
// 32" to an array of ints.
func deserializeIntArray(s string) ([]uint32, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these test only functions? If so.. maybe move to testutils?

@@ -127,10 +191,15 @@ type Writer struct {
// TODO(#99028): maxRowGroupSize should be a configuration Option, along with
// compression schemes, allocator, batch size, page size etc
func NewWriter(sch *SchemaDefinition, sink io.Writer, opts ...Option) (*Writer, error) {
schMeta, err := MakeSchemaMetadata(sch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we always add this metdata? Or should it be only added in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll make it a test only thing (probably by adding a WithCRDBReaderMeta option). It can take an enum for any additional meta (ex. primary key cols which are needed in changefeed tests).

Previously, the test utils used to read parquet files
would require the writer as an argument. The main reason
the writer was required is that the writer contained
crdb-specific type information which could be used to
decode raw data until crdb datums.

With this change, the writer is updated to write this
crdb-specific type information to the parquet file in
its metadata. The reader is updated to the read type
information from the file metadata. There is a new
test utility function `ReadFile(parquetFile string)`
which can be used to read all datums from a parquet
file without providing any additional type information.
The function also returns metadata since it is possible
for users of the `Writer` to write arbitrary metadata
and such users may need this metadata in testing.

Informs: cockroachdb#99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None
}

// ReadFileAndVerifyDatums asserts that a parquet file's metadata matches the
// metadata from the writer and its data matches writtenDatums.
func ReadFileAndVerifyDatums(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this function to testutils?

@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy

@jayshrivastava jayshrivastava mentioned this pull request May 15, 2023
13 tasks
@craig
Copy link
Contributor

craig bot commented May 15, 2023

Build failed:

@jayshrivastava
Copy link
Contributor Author

bors retry

@craig
Copy link
Contributor

craig bot commented May 15, 2023

Build succeeded:

@craig craig bot merged commit 1ceb218 into cockroachdb:master May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants