cdc: add apache arrow parquet library and writer #99288

jayshrivastava · 2023-03-22T21:07:19Z

cdc: add apache arrow parquet library

This commit installs the apache arrow parquet library for Go
at version 11. The release can be found here:
https://github.com/apache/arrow/releases/tag/go%2Fv11.0.0

This library is licensed under the Apache License 2.0.

Informs: #99028
Epic: None
Release note: None

util/parquet: create parquet writer library

This change implements a Writer struct in the new util/parquet package.
This Writer writes datums to the io.Writer sink
using a configurable parquet version (defaults to v2.6).

The package implements several features internally required to write in the parquet format:

schema creation
row group / column page management
encoding/decoding of CRDB datums to parquet datums
Currently, the writer only supports types found in the TPCC workload, namely INT, DECIMAL, STRING
UUID, TIMESTAMP and BOOL.

This change also adds a benchmark and tests which verify the correctness of the
writer and test utils for reading datums from parquet files.

Informs: #99028
Epic: None
Release note: None

changefeedccl: add parquet writer

This change adds the file parquet.go which contains
helper functions to help create parquet writers
and export data via cdcevent.Row structs.

This change also adds tests to ensure rows are written
to parquet files correctly.

Epic: None
Release note: None

cockroach-teamcity · 2023-03-22T21:07:29Z

This change is

jayshrivastava · 2023-03-23T17:05:27Z

Dependency Updates

Updated golang/exp@025e73f...334a238
Updated google/flatbuffers@v2.0.0...v2.0.8
Added https://github.com/JohnCGriffin/overflow/tree/46fa312c352cdb9517817d04f2067d49f418e332
Added https://github.com/andybalholm/brotli/tree/1d750214c25205863625bb3eb8190a51b2cef26d
Added https://github.com/klauspost/asmfmt/tree/ef134b9cec704e2b7b336fb02153b7d1a58247da
Added https://github.com/minio/asm2plan9s/tree/cdd76441f9d8c17c52d85f7657da5e8ce55f6083
Added https://github.com/minio/c2goasm/tree/36a3d3bbc4f3ec1eafcf7989776be7e84736ace9
Added https://github.com/zeebo/xxh3/tree/e6b0fd3c7bb50f49d637174ab61e39c0aa684c8c

miretskiy

Lots of comments, but most are nits. I think this is an excellent start.

Reviewed 4 of 5 files at r1, 3 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @samiskin)

-- commits line 26 at r2:
is the version mention important?

pkg/ccl/changefeedccl/parquet.go line 31 at r2 (raw file):

// MaxParquetRowGroupSize is the maximal number of rows which can be written out to
// a single row group in a parquet file.
// TODO(jayant): customers may want to adjust the row group size based on rows instead of bytes

Meh... It's a bit of math.. I don't know if you need this todo.

pkg/ccl/changefeedccl/parquet.go line 34 at r2 (raw file):

var MaxParquetRowGroupSize = settings.RegisterIntSetting(
	settings.TenantWritable,
	"changefeed.format.parquet.max_row_group_size",

Do we want this to be cluster setting? Customers already specify format=parquet option.... we can just have
a format specific option rows_per_group or some such.

pkg/ccl/changefeedccl/parquet.go line 75 at r2 (raw file):

// files. It should not be required for reading files since all the necessary
// metadata to interpret columns will be written out to the file and read by a
// reader.

❤️ this comment.

pkg/ccl/changefeedccl/parquet.go line 94 at r2 (raw file):

func newParquetSchema(row cdcevent.Row) (*parquetSchemaDefinition, error) {
	fields := make([]schema.Node, 0)
	cols := make([]parquetColumn, 0)

you can probably allocate both of these as

cols := make([]parquetColumn, 0, len(row.ResultColumns()))

pkg/ccl/changefeedccl/parquet.go line 125 at r2 (raw file):

	})
	groupNode, err := schema.NewGroupNode("schema", parquet.Repetitions.Required,
		fields, defaultParquetSchemaFieldID)

you probably don't need to build up fields as you're building up columns, do you?
You could just iterate columns and extract node.

pkg/ccl/changefeedccl/parquet.go line 161 at r2 (raw file):

	}
	opts := []parquet.WriterProperty{parquet.WithCreatedBy("cockroachdb"),
		parquet.WithMaxRowGroupLength(1)}

I know we have discussed the replacement of the existing parquet encoding library for export some time in the future.
And, I certainly do not expect this replacement to happen as part of this PR. I do think though, that we can make
this future replacement much easier by introducing a standalone parquet encoder/writer package that does not
have any dependency on CDC code.

There are multiple benefits to this approach:

As mentioned above, easy to swap out export implementation
Better defined API -- parquet library deals with taking cockroach specific types (tree.Datum) and producing parquet files.
Better testing -- it's just a lot easier to test encoding/decoding by throwing bunch of tree.Datums and seeing what happens.

The final benefit is that I don't think it's such a tall order -- the code as it exists right now is already pretty clean and nice.
With that in mind, here are my concrete suggestions:

Create parquet directory where parquet writer and related stuff will live (under changefeeedccl for now). This entire package will eventually be moved
under sql or util package.
Just like the underlying arrow library supports options (parquet.WithMaxRowGroupLength), your NewParquetWriter should also take options:

type config struct {
  maxRowGroupLength int
}
type Option interface {
   apply(*config)
}

Note: by moving this stuff under parquet package, you can clean up naming a lot -- by removing parquet prefixes on many variables.

You will need to define small Schema struct which just has a method Add(name, *types.T) to add columns of specific type.
(the newParquetSchema will of course use the above schema object to convert cdcevent.Row -> Schema)
This function will take the Schema when constructing writer.
Similar change to the AddData method -- we don't want to take cdcevent.Row -- either take tree.Datums or even EncDatumRow, or define an iterator like interface.

This file -- i.e. changefeed specific wrappers for parquet will still remain because we do want to AddRow, and add event type stuff, etc. But that's a higher level package
that will be used by changefeeds.

pkg/ccl/changefeedccl/parquet.go line 171 at r2 (raw file):

}

// nonNilDefLevel represents a def level of 1, meaning that the value is non-nil.

would be nice to add a bit more context as to what those def levels are.
Perhaps link https://github.com/apache/parquet-format/blob/master/README.md#nested-encoding

pkg/ccl/changefeedccl/parquet.go line 178 at r2 (raw file):

// writeColBatch writes a value to the provided column chunk writer.
func writeColBatch(colWriter file.ColumnChunkWriter, value interface{}) (int64, error) {

what's the int result? number of items written?

pkg/ccl/changefeedccl/parquet.go line 191 at r2 (raw file):

		return w.WriteBatch([]parquet.FixedLenByteArray{value.(parquet.FixedLenByteArray)}, nonNilDefLevel, nil)
	default:
		panic("unimplemented")

let's return assertion failed error instead of panic.

pkg/ccl/changefeedccl/parquet.go line 208 at r2 (raw file):

		return w.WriteBatch([]parquet.FixedLenByteArray{}, nilDefLevel, nil)
	default:
		panic("unimplemented")

s/panic/error/

pkg/ccl/changefeedccl/parquet.go line 238 at r2 (raw file):

// AddData writes the updatedRow. There is no guarantee that the row will
// immediately be flushed to the output sink.
func (w *ParquetWriter) AddData(updatedRow cdcevent.Row, prevRow cdcevent.Row) error {

would it be better to pass in event type instead of the full prevRow?

pkg/ccl/changefeedccl/parquet.go line 270 at r2 (raw file):

func getEventTypeDatum(updatedRow cdcevent.Row, prevRow cdcevent.Row) tree.Datum {
	eventTypeDatum := tree.NewDString(parquetEventInsert)

you probably want to define insert/update dstrings as variables here to avoid creating new dstrings for those.
Also, see setupContextForRow in expr_eval where cdc expression computes op type.
It's okay to repeat the code, but maybe there is a way not to... More importantly, see that function regarding
differentiating between update and insert -- which requires withDiff option.

pkg/ccl/changefeedccl/parquet.go line 291 at r2 (raw file):

// By default, a column's repetitions are set to parquet.Repetitions.Optional,
// which means that the column is nullable.
func newParquetColumn(column cdcevent.ResultColumn) (parquetColumn, error) {

nit: perhaps makeParquetColumn is a better name (usually new returns *)

pkg/ccl/changefeedccl/parquet.go line 310 at r2 (raw file):

	case types.BoolFamily:
		result.node = schema.NewBooleanNode(colName, defaultRepetitions, defaultParquetSchemaFieldID)
		result.node.LogicalType()

is this a no-op call?

pkg/ccl/changefeedccl/parquet.go line 394 at r2 (raw file):

var parquetBoolEncoder parquetEncodeFn = func(d tree.Datum) (interface{}, error) {
	return bool(*d.(*tree.DBool)), nil

let's be a lot more verbose. Let's make sure we return an error if d is not DBool.
When we AddData, we are adding tree.Datums -- those are interfaces, so it's quite possible to make a mistake
(that's why say EncDatumRow has EnsureDecoded to guaranteed that encoded datum can be decoded as target type).

Same for all of the below decoders.

(Another option: if this is only intended to be used in tests... you could move these to test function... and then
keep those type assertions. but: I do think having encoder/decoder support is good)

pkg/ccl/changefeedccl/parquet_test.go line 122 at r2 (raw file):

			},
		},
	} {

nice, high level tests. I think if the underlying parquet writer only concerned itself with datum -> parquet conversion, the
a lower level test, plus a lower level benchmark could be written using randgen.RandDatum to generate random datums for testing.

miretskiy

Reviewed 17 of 22 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @samiskin)

pkg/util/parquet/writer.go line 23 at r5 (raw file):

// Config stores configurable options for the Writer.
type Config struct {

config does not need to be exported.
External callers should use WithXXX to construct config.

pkg/util/parquet/writer.go line 35 at r5 (raw file):

}

type option interface {

since option will be used by external callers, please export it.
apply does not need to be exported though.

pkg/util/parquet/writer.go line 56 at r5 (raw file):

type WithVersion string

func (v WithVersion) apply(c *Config) error {

it's an interesting way to define options... I guess it's fine, though it's quite common to have something like:

type funcOpt func(c *Config) error
func (f funcOpt) apply(c *Config) error {
   return f(c)
}
func WithVersion(v string) option {
   return func(c *Config) error {
      ....
   }
}

pkg/util/parquet/writer.go line 156 at r5 (raw file):

		w.currentRowGroupWriter = w.writer.AppendBufferedRowGroup()
		w.currentRowGroupSize = 0
	}

I was thinking about how to better integrate this AddData method with the existing iterator in cdcevent.
One way is to .. change cdcevent iterator somehow...

Here is another idea: instead of AddData(), we do something like:

type Row struct {
  w *Writer
}

func (r Row) SetCol(idx int, tree.Datum) error {}
func (r Row) Close() error {
   w.currentRowGroupSize += 1
		if err := w.currentRowGroupWriter.Close(); err != nil {
			return err
		}
		w.currentRowGroupWriter = w.writer.AppendBufferedRowGroup()
		w.currentRowGroupSize = 0
                return nil
}

func (w *Writer) AddRow() (Row, func()) {
	if w.currentRowGroupWriter == nil {
		w.currentRowGroupWriter = w.writer.AppendBufferedRowGroup()
	}
        return Row{w: w}
}

pkg/util/parquet/writer_bench_test.go line 16 at r4 (raw file):

Previously, jayshrivastava (Jayant) wrote…

Here's the benchmark output.

BenchmarkParquetWriter-10        3896122              3112 ns/op       2958 B/op       48 allocs/op

Per our discussion, you have a plan on how to reduce allocs.
Use unsafe string to get bytes from datum; and use idea similar to
tree.DatumAlloc to remove allocations for single element batch arrays.

pkg/util/parquet/writer_bench_test.go line 35 at r5 (raw file):

	for i := 0; i < numCols; i++ {
		sch.columnTypes[i] = types.String
		sch.columnNames[i] = fmt.Sprintf("col%d", i)

Do we want to test different types? I suspect we do, and I also suspect this isn't done yet because we don't support all types.
Perhaps allocate few types you already support. Or... leave a todo.

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @samiskin)

pkg/util/parquet/writer.go line 23 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

config does not need to be exported.
External callers should use WithXXX to construct config.

Done.

pkg/util/parquet/writer.go line 35 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

since option will be used by external callers, please export it.
apply does not need to be exported though.

Done.

pkg/util/parquet/writer.go line 56 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

it's an interesting way to define options... I guess it's fine, though it's quite common to have something like:
type funcOpt func(c *Config) error
func (f funcOpt) apply(c *Config) error {
   return f(c)
}
func WithVersion(v string) option {
   return func(c *Config) error {
      ....
   }
}

Done.

pkg/util/parquet/writer.go line 156 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I was thinking about how to better integrate this AddData method with the existing iterator in cdcevent.
One way is to .. change cdcevent iterator somehow...

Here is another idea: instead of AddData(), we do something like:
type Row struct {
  w *Writer
}

func (r Row) SetCol(idx int, tree.Datum) error {}
func (r Row) Close() error {
   w.currentRowGroupSize += 1
		if err := w.currentRowGroupWriter.Close(); err != nil {
			return err
		}
		w.currentRowGroupWriter = w.writer.AppendBufferedRowGroup()
		w.currentRowGroupSize = 0
                return nil
}

func (w *Writer) AddRow() (Row, func()) {
	if w.currentRowGroupWriter == nil {
		w.currentRowGroupWriter = w.writer.AppendBufferedRowGroup()
	}
        return Row{w: w}
}

Done. I added many assertions and tests that break them. I ended up using a []uint64 bitmap because reseting the util.FastIntMap would require reallocating it... I think. Using an []uint64 instead of uint64 lets us support more than 64 cols.

pkg/util/parquet/writer_bench_test.go line 16 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Per our discussion, you have a plan on how to reduce allocs.
Use unsafe string to get bytes from datum; and use idea similar to
tree.DatumAlloc to remove allocations for single element batch arrays.

Done.

pkg/util/parquet/writer_bench_test.go line 35 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Do we want to test different types? I suspect we do, and I also suspect this isn't done yet because we don't support all types.
Perhaps allocate few types you already support. Or... leave a todo.

Added a TODO.

miretskiy

Reviewed 1 of 9 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @samiskin)

pkg/ccl/changefeedccl/parquet.go line 54 at r7 (raw file):

func AddData(
	writer *parquet.Writer, updatedRow cdcevent.Row, prevRow cdcevent.Row, returnDatums bool,
) ([]tree.Datum, error) {

I only see returnDatums parameter being used in the tests.
I'd rather not. It's okay if in the test you have a helper that iterates the row again to extract datums.

pkg/util/json/parser.go line 335 at r7 (raw file):

// (i.e. jsonString([]byte)).
// See https://groups.google.com/g/golang-nuts/c/Zsfk-VMd_fU/m/O1ru4fO-BgAJ
func UnsafeGetBytes(s string) ([]byte, error) {

Let's not export this. It's fine to copy this utility method into your library.
While doing this, we should add a todo to replace it w/ less unsafe version once we switch to go 1.20

pkg/util/parquet/writer.go line 97 at r7 (raw file):

// compression schemes, allocator, batch size, page size etc
func NewWriter(sch *SchemaDefinition, sink io.Writer, opts ...Option) (*Writer, error) {
	cfg := newConfig()

I would inline it; and I don't think you need to make a "new config" Just regular, stack allocated value is fine
(you can pass opt.apply(&cfg) below)

pkg/util/parquet/writer.go line 154 at r7 (raw file):

func (r *RowWriter) writeIdx(idx int) {
	r.colIdxMap[idx/64] = r.colIdxMap[idx/64] | (1 << (idx % 64))

r.colIdxMap[idx >> 6 ] |= 1<<(idx & 64)

pkg/util/parquet/writer.go line 170 at r7 (raw file):

	}
	if r.idxIsWritten(idx) {
		return errors.AssertionFailedf("previously wrote datum to row at idx %d", idx)

I don't know if this is worth it... I mean, if the only reason we keep this bitmap is to tell
caller not to override the same column multiple times -- I'd say let the caller go for it.
Why not?
If you don't set anything, or you only set some columns, you wind up writing tree.DNull...
It's one thing to check bounds, and/or types. But I feel like this is a bit too much. And, there could be an argument
that if the caller knows that it only has 1 column, why force it to set more columns?

pkg/util/parquet/writer.go line 200 at r7 (raw file):

// returned RowWriter must be used to write all datums in the row before
// the next call to AddData.
func (w *Writer) AddData() (*RowWriter, error) {

rename to AddRow() perhaps?

pkg/util/parquet/write_functions.go line 34 at r7 (raw file):

//
// This means any array below will not be in use outside the writeBatch
// function below.

very nice.

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @samiskin)

pkg/ccl/changefeedccl/parquet.go line 54 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I only see returnDatums parameter being used in the tests.
I'd rather not. It's okay if in the test you have a helper that iterates the row again to extract datums.

Done.

pkg/ccl/changefeedccl/parquet.go line 74 at r8 (raw file):

// populateDatums writes the appropriate datums into the datumAlloc slice.
func populateDatums(updatedRow cdcevent.Row, prevRow cdcevent.Row, datumAlloc []tree.Datum) error {

This is its own function so we can use this production code in tests rather than copy the code in tests.

pkg/ccl/changefeedccl/parquet.go line 75 at r8 (raw file):

// populateDatums writes the appropriate datums into the datumAlloc slice.
func populateDatums(updatedRow cdcevent.Row, prevRow cdcevent.Row, datumAlloc []tree.Datum) error {
	datums := datumAlloc[:0]

I prefer this only because I dislike using an idx when you have an iterator like so

        idx := 0
	if err := updatedRow.ForEachColumn().Datum(func(d tree.Datum, _ cdcevent.ResultColumn) error {
		datums[idx] = d
		idx += 1
        })....

        datums[idx] = eventType

pkg/util/parquet/writer.go line 97 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I would inline it; and I don't think you need to make a "new config" Just regular, stack allocated value is fine
(you can pass opt.apply(&cfg) below)

Done.

pkg/util/parquet/writer.go line 154 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

r.colIdxMap[idx >> 6 ] |= 1<<(idx & 64)

Done (and deleted)

pkg/util/parquet/writer.go line 170 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I don't know if this is worth it... I mean, if the only reason we keep this bitmap is to tell
caller not to override the same column multiple times -- I'd say let the caller go for it.
Why not?
If you don't set anything, or you only set some columns, you wind up writing tree.DNull...
It's one thing to check bounds, and/or types. But I feel like this is a bit too much. And, there could be an argument
that if the caller knows that it only has 1 column, why force it to set more columns?

Per our discussion, we will now pass an []tree.Datum

pkg/util/parquet/writer.go line 200 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

rename to AddRow() perhaps?

Done.

Code quote:

AddData

pkg/util/json/parser.go line 335 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Let's not export this. It's fine to copy this utility method into your library.
While doing this, we should add a todo to replace it w/ less unsafe version once we switch to go 1.20

Done.

miretskiy

Reviewed 2 of 16 files at r3, 1 of 22 files at r4, 1 of 9 files at r7, 7 of 8 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @samiskin)

pkg/ccl/changefeedccl/parquet.go line 21 at r8 (raw file):

type parquetWriter struct {
	inner      *parquet.Writer

you could just embed *parquet.Writer if you want (so that you don't need to reference inner)

pkg/ccl/changefeedccl/parquet.go line 65 at r8 (raw file):

	}

	return w.inner.AddRow(w.datumAlloc)

perfect.

pkg/sql/sem/tree/datum.go line 1043 at r8 (raw file):

// AsDDecimal attempts to retrieve a DDecimal from an Expr, returning a DDecimal and
// a flag signifying whether the assertion was successful.
func AsDDecimal(e Expr) (*DDecimal, bool) {

surprised it wasn't there before.

pkg/util/parquet/write_functions.go line 123 at r8 (raw file):

		return []byte{}, nil
	}
	const maxStrLen = 1 << 30 // Really, can't see us supporting input JSONs that big.

s/JSONs/string/

This commit installs the apache arrow parquet library for Go at version 11. The release can be found here: https://github.com/apache/arrow/releases/tag/go%2Fv11.0.0 This library is licensed under the Apache License 2.0. Informs: cockroachdb#99028 Epic: None Release note: None

jayshrivastava

Thanks for the reviews and for teaching me your ways 🙏

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @samiskin)

jayshrivastava · 2023-04-06T18:28:02Z

bors r=miretskiy

jayshrivastava · 2023-04-06T18:34:23Z

bors r-

craig · 2023-04-06T18:34:25Z

Canceled.

This change implements a `Writer` struct in the new `util/parquet` package. This `Writer` writes datums the `io.Writer` sink using a configurable parquet version (defaults to v2.6). The package implements several features internally required to write in the parquet format: - schema creation - row group / column page management - encoding/decoding of CRDB datums to parquet datums Currently, the writer only supports types found in the TPCC workload, namely INT, DECIMAL, STRING UUID, TIMESTAMP and BOOL. This change also adds a benchmark and tests which verify the correctness of the writer and test utils for reading datums from parquet files. Informs: cockroachdb#99028 Epic: None Release note: None

This change adds the file `parquet.go` which contains helper functions to help create parquet writers and export data via `cdcevent.Row` structs. This change also adds tests to ensure rows are written to parquet files correctly. Epic: None Release note: None

jayshrivastava · 2023-04-06T19:41:00Z

bors r=miretskiy

craig · 2023-04-06T20:33:12Z

Build succeeded:

Bazel Essential CI (Cockroach)

jayshrivastava force-pushed the parquet-perf branch 2 times, most recently from e6caf6e to af6cc60 Compare March 23, 2023 15:56

jayshrivastava force-pushed the parquet-perf branch from af6cc60 to 062bca4 Compare March 23, 2023 17:18

jayshrivastava marked this pull request as ready for review March 23, 2023 17:18

jayshrivastava requested review from a team as code owners March 23, 2023 17:18

jayshrivastava requested review from samiskin and removed request for a team March 23, 2023 17:18

miretskiy self-requested a review March 23, 2023 17:18

jayshrivastava force-pushed the parquet-perf branch 3 times, most recently from a10630a to cffa798 Compare March 23, 2023 20:45

miretskiy suggested changes Mar 24, 2023

View reviewed changes

jayshrivastava requested a review from a team as a code owner March 27, 2023 21:13

jayshrivastava force-pushed the parquet-perf branch 14 times, most recently from 1d47bf5 to 9831f4d Compare March 28, 2023 20:28

jayshrivastava force-pushed the parquet-perf branch from 98c6e10 to f015156 Compare April 3, 2023 20:06

miretskiy suggested changes Apr 4, 2023

View reviewed changes

jayshrivastava force-pushed the parquet-perf branch 2 times, most recently from 716083d to 4ab442b Compare April 5, 2023 19:46

jayshrivastava commented Apr 5, 2023

View reviewed changes

jayshrivastava force-pushed the parquet-perf branch 3 times, most recently from bd19035 to b824800 Compare April 5, 2023 21:27

jayshrivastava requested a review from miretskiy April 5, 2023 21:39

miretskiy suggested changes Apr 5, 2023

View reviewed changes

jayshrivastava force-pushed the parquet-perf branch from b824800 to e9d3b84 Compare April 6, 2023 15:51

jayshrivastava commented Apr 6, 2023

View reviewed changes

jayshrivastava requested a review from miretskiy April 6, 2023 15:55

miretskiy approved these changes Apr 6, 2023

View reviewed changes

jayshrivastava force-pushed the parquet-perf branch from e9d3b84 to 1c002e8 Compare April 6, 2023 17:09

jayshrivastava force-pushed the parquet-perf branch from 1c002e8 to 77771c1 Compare April 6, 2023 17:39

jayshrivastava commented Apr 6, 2023

View reviewed changes

jayshrivastava added 2 commits April 6, 2023 14:36

jayshrivastava force-pushed the parquet-perf branch from 77771c1 to d6acc93 Compare April 6, 2023 18:37

craig bot merged commit 7037eef into cockroachdb:master Apr 6, 2023

jayshrivastava mentioned this pull request May 10, 2023

cdc: add new parquet library #99028

Closed

13 tasks

jayshrivastava mentioned this pull request Jun 21, 2023

release-23.1: cdc: support the parquet format in changefeeds #105287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdc: add apache arrow parquet library and writer #99288

cdc: add apache arrow parquet library and writer #99288

jayshrivastava commented Mar 22, 2023 •

edited

Loading

cockroach-teamcity commented Mar 22, 2023

jayshrivastava commented Mar 23, 2023

miretskiy left a comment

miretskiy left a comment

jayshrivastava left a comment

miretskiy left a comment

jayshrivastava left a comment

miretskiy left a comment

jayshrivastava left a comment

jayshrivastava commented Apr 6, 2023

jayshrivastava commented Apr 6, 2023

craig bot commented Apr 6, 2023

jayshrivastava commented Apr 6, 2023

craig bot commented Apr 6, 2023

cdc: add apache arrow parquet library and writer #99288

cdc: add apache arrow parquet library and writer #99288

Conversation

jayshrivastava commented Mar 22, 2023 • edited Loading

cdc: add apache arrow parquet library

util/parquet: create parquet writer library

changefeedccl: add parquet writer

cockroach-teamcity commented Mar 22, 2023

jayshrivastava commented Mar 23, 2023

miretskiy left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

jayshrivastava commented Apr 6, 2023

jayshrivastava commented Apr 6, 2023

craig bot commented Apr 6, 2023

jayshrivastava commented Apr 6, 2023

craig bot commented Apr 6, 2023

jayshrivastava commented Mar 22, 2023 •

edited

Loading