-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/parquet: add crdb metadata to parquet files #103131
util/parquet: add crdb metadata to parquet files #103131
Conversation
238de49
to
351524e
Compare
This change adds an option to the writer which allows the caller to write arbitrary kv metadata to parquet files. This is useful for testing purposes. Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071 Release note: None
f0b170a
to
3ba03db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more context.Context, see https://github.com/jayshrivastava/cockroach/tree/parquet-cdc, which integrates the new parquet writer with changefeeds using ReadFile
and custom metadata.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/util/parquet/writer.go
Outdated
|
||
// deserializeIntArray deserializes an integer sting in the format "23 2 32 43 | ||
// 32" to an array of ints. | ||
func deserializeIntArray(s string) ([]uint32, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these test only functions? If so.. maybe move to testutils?
pkg/util/parquet/writer.go
Outdated
@@ -127,10 +191,15 @@ type Writer struct { | |||
// TODO(#99028): maxRowGroupSize should be a configuration Option, along with | |||
// compression schemes, allocator, batch size, page size etc | |||
func NewWriter(sch *SchemaDefinition, sink io.Writer, opts ...Option) (*Writer, error) { | |||
schMeta, err := MakeSchemaMetadata(sch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we always add this metdata? Or should it be only added in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll make it a test only thing (probably by adding a WithCRDBReaderMeta
option). It can take an enum for any additional meta (ex. primary key cols which are needed in changefeed tests).
Previously, the test utils used to read parquet files would require the writer as an argument. The main reason the writer was required is that the writer contained crdb-specific type information which could be used to decode raw data until crdb datums. With this change, the writer is updated to write this crdb-specific type information to the parquet file in its metadata. The reader is updated to the read type information from the file metadata. There is a new test utility function `ReadFile(parquetFile string)` which can be used to read all datums from a parquet file without providing any additional type information. The function also returns metadata since it is possible for users of the `Writer` to write arbitrary metadata and such users may need this metadata in testing. Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071 Release note: None
3ba03db
to
7a36a77
Compare
} | ||
|
||
// ReadFileAndVerifyDatums asserts that a parquet file's metadata matches the | ||
// metadata from the writer and its data matches writtenDatums. | ||
func ReadFileAndVerifyDatums( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this function to testutils?
bors r=miretskiy |
Build failed: |
bors retry |
Build succeeded: |
util/parquet: add option to write kv metadata to files
This change adds an option to the writer which allows the caller
to write arbitrary kv metadata to parquet files. This is useful
for testing purposes.
Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None
util/parquet: remove dependency on writer to read parquet files
Previously, the test utils used to read parquet files
would require the writer as an argument. The main reason
the writer was required is that the writer contained
crdb-specific type information which could be used to
decode raw data until crdb datums.
With this change, the writer is updated to write this
crdb-specific type information to the parquet file in
its metadata. The reader is updated to the read type
information from the file metadata. There is a new
test utility function
ReadFile(parquetFile string)
which can be used to read all datums from a parquet
file without providing any additional type information.
The function also returns metadata since it is possible
for users of the
Writer
to write arbitrary metadataand such users may need this metadata in testing.
Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None