util/parquet: support tuples #103589

jayshrivastava · 2023-05-18T14:02:36Z

This change adds support for writing tuples. Implementation details below.

The standard way to write a tuple in parquet is to use a group:

message schema {                 -- toplevel schema
   optional group a (LIST) {
       optional T1 element;       -- physical column for the first field
       ...
       optional Tn element;       -- physical column for the nth field
   }
}

Because parquet has a very strict format, it does not write such groups
as one column with all the fields adjacent to each other. Instead, it
writes each field in the tuple as its own column. This 1:N mapping
from CRDB datum to physical column in parquet violates the assumption
used in this library that the mapping is 1:1.

This change aims to update the library to break that assumption. Firstly,
there is now a clear distiction between a "datum column" and a "physical
column". Also, the Writer is updated to be able to write to multiple
physical columns for a given datum, and the reader is updated
to "squash" physical columns into single tuple datums if needed. Finally,
randomized testing and benchmarking is extended to cover tuples.

Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None

cockroach-teamcity · 2023-05-18T14:02:48Z

This change is

miretskiy

Reviewed 7 of 8 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

pkg/util/parquet/schema.go line 54 at r1 (raw file):

	// Typically, there is a 1:1 mapping between columns and physical columns in
	// a file. However, if there is at least one tuple in the schema, then the
	// mapping is not 1:1 since each field in a tuple in which case each field

reads a bit strange... maybe drop "in which case each field"?

pkg/util/parquet/schema.go line 62 at r1 (raw file):

	// start index of datumColumn k is 3.
	numPhysicalCols      uint32
	physicalColsStartIdx uint32

just use integers? no need to cast to uint?
In general, use uints to represent bits; indexes -- use regular int.

pkg/util/parquet/schema.go line 115 at r1 (raw file):

// makeColumn constructs a datumColumn. It does not populate
// datumColumn.physicalColsStartIdx.

thank you for mentioning physicalColStartIdx bit.

pkg/util/parquet/schema.go line 129 at r1 (raw file):

			defaultTypeLength, defaultSchemaFieldID)
		if err != nil {
			return datumColumn{}, err

thanks for making this change too.

pkg/util/parquet/schema.go line 397 at r1 (raw file):

		scalarColWriter, ok := elementCol.colWriter.(scalarWriter)
		if !ok {
			return datumColumn{}, errors.AssertionFailedf("expected scalar datumColumn writer")

nit: I think error message should keep "column" instead of datumColumn.

pkg/util/parquet/schema.go line 416 at r1 (raw file):

		if len(contents) == 0 {
			return result, pgerror.Newf(pgcode.FeatureNotSupported,
				"parquet writer does not support tuples with zero fields")

nit: does not support empty tuple types?

pkg/util/parquet/schema.go line 424 at r1 (raw file):

			if innerTyp.Family() == types.ArrayFamily {
				return result, pgerror.Newf(pgcode.FeatureNotSupported,
					"parquet writer does not support arrays in tuples")

is it parquet that doesn't support nested arrays/tuples or is it... us?

pkg/util/parquet/schema.go line 432 at r1 (raw file):

			var label string
			if labels == nil {
				label = fmt.Sprintf("%d", i)

maybe use "f#" (field #) or "field#"?
Do you think that, perhaps you should use the name of the top level tuple, and suffix it with "f#"?
(my_tuple_col_f1, my_tuple_col_f2 OR, if labels set, my_tuple_col_first_name, my_tuple_col_last_name)

pkg/util/parquet/testutils.go line 166 at r1 (raw file):

type ReadDatumsMetadata struct {
	// KVMeta is the kv metadata read from the file.
	KVMeta map[string]string

can we not use "KV" -- has overloaded meaning. Anything other than that. FileMetadata; FieldMeta, etc.

pkg/util/parquet/testutils.go line 553 at r1 (raw file):

//
// Behavior is undefined if the intervals are not sorted, not disjoint,
// not ascending, or out of bounds.

neat; I think behavior is defined: boom! :)

pkg/util/parquet/write_functions.go line 144 at r1 (raw file):

// A colWriter is responsible for encoding a datum and writing it to
// a file.ColumnChunkWriter. In some cases, the datum will be divided
// and written separately to different file.ColumnChunkWriters.

just say that array is usually single element array for scalar types, and tuple types have more...

pkg/util/parquet/write_functions.go line 770 at r1 (raw file):

}

// batchWriter is an interface representing parquet datumColumn chunk writers such as

i hate goland refactoring to rename things... I think you should keep column in the comment because that's what you meant.

pkg/util/parquet/writer.go line 141 at r1 (raw file):

	currentRowGroupWriter file.BufferedRowGroupWriter
	// Caches the file.ColumnChunkWriters for each datumColumn in the
	// schema definition.

neat... but requires expanded comment :)
Add examples/explanations why array is not okay and you need a map[int] -> []writer

pkg/util/parquet/writer.go line 189 at r1 (raw file):

	}
	colChunkWriters := make([]file.ColumnChunkWriter, 0, w.sch.cols[colIdx].numPhysicalCols)
	for i := w.sch.cols[colIdx].physicalColsStartIdx; i < w.sch.cols[colIdx].physicalColsStartIdx+w.sch.cols[colIdx].numPhysicalCols; i += 1 {

nit: perhaps do

startIdx := w.sch....
numFields := w.....numPhysicalCols

and use those in this array to shorten/improve readability.

pkg/util/parquet/writer.go line 237 at r1 (raw file):

		w.currentRowGroupSize = 0
		// Evict all entries from the cache.
		w.columnChunkWriterCache = make(map[int][]file.ColumnChunkWriter, len(w.sch.cols))

we have to recreate the map every time we reach group lenght?
Why? (And that makes me sad somewhat)

pkg/util/parquet/writer_test.go line 478 at r1 (raw file):

					types.MakeTuple([]*types.T{types.Int, types.Int}),
				},
				columnNames: []string{"a", "b", "c", "d", "e", "f", "g", "h"},

let's add the test that doesn't have tuple labels.

This change adds support for writing tuples. Implementation details below. The standard way to write a tuple in parquet is to use a group: ``` message schema { -- toplevel schema optional group a (LIST) { optional T1 element; -- physical column for the first field ... optional Tn element; -- physical column for the nth field } } ``` Because parquet has a very strict format, it does not write such groups as one column with all the fields adjacent to each other. Instead, it writes each field in the tuple as its own column. This 1:N mapping from CRDB datum to physical column in parquet violates the assumption used in this library that the mapping is 1:1. This change aims to update the library to break that assumption. Firstly, there is now a clear distiction between a "datum column" and a "physical column". Also, the `Writer` is updated to be able to write to multiple physical columns for a given datum, and the reader is updated to "squash" physical columns into single tuple datums if needed. Finally, randomized testing and benchmarking is extended to cover tuples. Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071 Release note: None

jayshrivastava · 2023-05-22T19:39:52Z

bors r=miretskiy

craig · 2023-05-22T20:15:17Z

Build succeeded:

Bazel Essential CI (Cockroach)

jayshrivastava changed the title ~~Parquet tuple~~ util/parquet: support tuples May 18, 2023

jayshrivastava mentioned this pull request May 18, 2023

changefeedccl: use new parquet library #103293

Merged

jayshrivastava force-pushed the parquet-tuple branch 6 times, most recently from e96a939 to 71f2e7c Compare May 19, 2023 18:28

jayshrivastava requested a review from miretskiy May 19, 2023 18:28

jayshrivastava marked this pull request as ready for review May 19, 2023 18:28

jayshrivastava requested a review from a team as a code owner May 19, 2023 18:28

jayshrivastava mentioned this pull request May 19, 2023

cdc: add new parquet library #99028

Closed

13 tasks

miretskiy approved these changes May 19, 2023

View reviewed changes

jayshrivastava force-pushed the parquet-tuple branch from 71f2e7c to 198a5ad Compare May 22, 2023 17:05

craig bot merged commit 7ed8248 into cockroachdb:master May 22, 2023

jayshrivastava mentioned this pull request Jun 21, 2023

cdc: support composite types in parquet #102904

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util/parquet: support tuples #103589

util/parquet: support tuples #103589

jayshrivastava commented May 18, 2023 •

edited

Loading

cockroach-teamcity commented May 18, 2023

miretskiy left a comment

jayshrivastava commented May 22, 2023

craig bot commented May 22, 2023

util/parquet: support tuples #103589

util/parquet: support tuples #103589

Conversation

jayshrivastava commented May 18, 2023 • edited Loading

cockroach-teamcity commented May 18, 2023

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava commented May 22, 2023

craig bot commented May 22, 2023

jayshrivastava commented May 18, 2023 •

edited

Loading