Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add full support for the parquet format #104528

Merged
merged 7 commits into from
Jun 15, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Jun 7, 2023

changefeedccl: support key_in_value with parquet format

Previously, the option key_in_value was disallowed with
format=parquet. This change allows these settings to be
used together. Note that key_in_value is enabled by
default with cloudstorage sinks and format=parquet is
only allowed with cloudstorage sinks, so key_in_value is
enabled for parquet by default.

Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None


changefeedccl: add test coverage for parquet event types

When using format=parquet, an additional column is produced to
indicate the type of operation corresponding to the row: create,
update, or delete. This change adds coverage for this in unit
testing.

Additionally, the test modified in this change is made more simple
by reducing the number of rows and different types because this
complexity is unnecessary as all types are tested within the
util/parquet package already.

Informs: #99028
Epic: CRDB-27372
Release note: None
Epic: None


util/parquet: support tuple labels in util/parquet testutils

Previously, the test utilities in util/parquet would not reconstruct
tuples read from files with their labels. This change updates the
package to do so. This is required for testing in users of this
package such as CDC.

Informs: #99028
Epic: CRDB-27372
Release note: None


changefeedccl: support diff option with parquet format

This change adds support for the diff changefeed
options when using format=parquet. Enabling diff also adds
support for CDC Transformations with parquet.

Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None


changefeedccl: support end_time option with parquet format

This change adds support for the end_time changefeed
options when using format=parquet. No significant code
changes were needed to enable this feature.

Closes: #103129
Closes: #99028
Epic: CRDB-27372
Release note (enterprise change): Changefeeds now officially
support the parquet format at specificiation version 2.6.
It is only usable with the cloudstorage sink.

The syntax to use parquet is like the following:
CREATE CHANGEFEED FOR foo INTO ... WITH format=parquet

It supports all standard changefeed options and features
including CDC transformations, except it does not support the
topic_in_value option.


changefeedccl: use parquet with 50% probability in nemeses test

Informs: #99028
Epic: CRDB-27372
Release note: None


do not merge: force parquet cloud storage tests

This change forces all tests, including tests for diff and end_time
to run with the cloudstorage sink and format=parquet where possible.

Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava mentioned this pull request Jun 7, 2023
13 tasks
@jayshrivastava jayshrivastava force-pushed the parquet-diff branch 5 times, most recently from 10f2156 to 7e56e82 Compare June 7, 2023 21:47
Previously, the option `key_in_value` was disallowed with
`format=parquet`. This change allows these settings to be
used together. Note that `key_in_value` is enabled by
default with `cloudstorage` sinks and `format=parquet` is
only allowed with cloudstorage sinks, so `key_in_value` is
enabled for parquet by default.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
When using `format=parquet`, an additional column is produced to
indicate the type of operation corresponding to the row: create,
update, or delete. This change adds coverage for this in unit
testing.

Additionally, the test modified in this change is made more simple
by reducing the number of rows and different types because this
complexity is unnecessary as all types are tested within the
util/parquet package already.

Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
Epic: None
Previously, the test utilities in `util/parquet` would not reconstruct
tuples read from files with their labels. This change updates the
package to do so. This is required for testing in users of this
package such as CDC.

Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
@jayshrivastava jayshrivastava force-pushed the parquet-diff branch 2 times, most recently from cc9914b to 883706a Compare June 8, 2023 15:10
@jayshrivastava jayshrivastava changed the title Parquet diff changefeedccl: add full support for the parquet format Jun 8, 2023
@jayshrivastava jayshrivastava marked this pull request as ready for review June 8, 2023 19:46
@jayshrivastava jayshrivastava requested a review from a team as a code owner June 8, 2023 19:46
@jayshrivastava jayshrivastava requested review from miretskiy and removed request for a team June 8, 2023 19:46
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, 2 of 2 files at r3, 2 of 5 files at r4, 1 of 1 files at r5, 4 of 4 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)


pkg/ccl/changefeedccl/parquet.go line 74 at r7 (raw file):

const parquetOptUpdatedTimestampColName = metaSentinel + changefeedbase.OptUpdatedTimestamps
const parquetOptMVCCTimestampColName = metaSentinel + changefeedbase.OptMVCCTimestamps
const parquetOptDiffColName = metaSentinel + changefeedbase.OptDiff

should it be "prev" instead of diff? Or "_before" -- which is pretty standard name too?


pkg/ccl/changefeedccl/parquet.go line 174 at r7 (raw file):

				return nil
			})
			valueBuilder, err := json.NewFixedKeysObjectBuilder(keys)

you definitely want to construct this value builder once.


pkg/ccl/changefeedccl/parquet.go line 180 at r7 (raw file):

			if err := prevRow.ForEachColumn().Datum(func(d tree.Datum, col cdcevent.ResultColumn) error {
				j, err := tree.AsJSON(d, sessiondatapb.DataConversionConfig{}, time.UTC)

Why do we need to convert each datum to JSON? Couldn't we just valueBuilder.Set(col, d)?

This change adds support for the `diff` changefeed
options when using `format=parquet`. Enabling `diff` also adds
support for CDC Transformations with parquet.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
This change adds support for the `end_time` changefeed
options when using `format=parquet`. No significant code
changes were needed to enable this feature.

Closes: cockroachdb#103129
Closes: cockroachdb#99028
Epic: CRDB-27372
Release note (enterprise change): Changefeeds now officially
support the parquet format at specificiation version 2.6.
It is only usable with the cloudstorage sink.

The syntax to use parquet is like the following:
`CREATE CHANGEFEED FOR foo INTO `...` WITH format=parquet`

It supports all standard changefeed options and features
including CDC transformations, except it does not support the
`topic_in_value` option.
This change forces all tests, including tests for `diff` and `end_time`
to run with the `cloudstorage` sink and `format=parquet` where possible.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/changefeedccl/parquet.go line 74 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

should it be "prev" instead of diff? Or "_before" -- which is pretty standard name too?

Renamed to "before".


pkg/ccl/changefeedccl/parquet.go line 174 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

you definitely want to construct this value builder once.

Done.


pkg/ccl/changefeedccl/parquet.go line 180 at r7 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Why do we need to convert each datum to JSON? Couldn't we just valueBuilder.Set(col, d)?

Looks like tree.Datum does not implement json.JSON. We do a similar thing in the JSON encoder

j, err := tree.AsJSON(d, sessiondatapb.DataConversionConfig{}, time.UTC)
if err != nil {
return err
}
return e.valueBuilder.Set(col.Name, j)

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r8, 1 of 1 files at r9, 4 of 4 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy

@craig
Copy link
Contributor

craig bot commented Jun 15, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changefeedccl: remove limitations for parquet format cdc: add new parquet library
3 participants