-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: add full support for the parquet format #104528
Conversation
10f2156
to
7e56e82
Compare
Previously, the option `key_in_value` was disallowed with `format=parquet`. This change allows these settings to be used together. Note that `key_in_value` is enabled by default with `cloudstorage` sinks and `format=parquet` is only allowed with cloudstorage sinks, so `key_in_value` is enabled for parquet by default. Informs: cockroachdb#103129 Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
When using `format=parquet`, an additional column is produced to indicate the type of operation corresponding to the row: create, update, or delete. This change adds coverage for this in unit testing. Additionally, the test modified in this change is made more simple by reducing the number of rows and different types because this complexity is unnecessary as all types are tested within the util/parquet package already. Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None Epic: None
Previously, the test utilities in `util/parquet` would not reconstruct tuples read from files with their labels. This change updates the package to do so. This is required for testing in users of this package such as CDC. Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
cc9914b
to
883706a
Compare
883706a
to
fab7b69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, 2 of 2 files at r3, 2 of 5 files at r4, 1 of 1 files at r5, 4 of 4 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/parquet.go
line 74 at r7 (raw file):
const parquetOptUpdatedTimestampColName = metaSentinel + changefeedbase.OptUpdatedTimestamps const parquetOptMVCCTimestampColName = metaSentinel + changefeedbase.OptMVCCTimestamps const parquetOptDiffColName = metaSentinel + changefeedbase.OptDiff
should it be "prev" instead of diff? Or "_before" -- which is pretty standard name too?
pkg/ccl/changefeedccl/parquet.go
line 174 at r7 (raw file):
return nil }) valueBuilder, err := json.NewFixedKeysObjectBuilder(keys)
you definitely want to construct this value builder once.
pkg/ccl/changefeedccl/parquet.go
line 180 at r7 (raw file):
if err := prevRow.ForEachColumn().Datum(func(d tree.Datum, col cdcevent.ResultColumn) error { j, err := tree.AsJSON(d, sessiondatapb.DataConversionConfig{}, time.UTC)
Why do we need to convert each datum to JSON? Couldn't we just valueBuilder.Set(col, d)?
fab7b69
to
37a91ee
Compare
This change adds support for the `diff` changefeed options when using `format=parquet`. Enabling `diff` also adds support for CDC Transformations with parquet. Informs: cockroachdb#103129 Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
This change adds support for the `end_time` changefeed options when using `format=parquet`. No significant code changes were needed to enable this feature. Closes: cockroachdb#103129 Closes: cockroachdb#99028 Epic: CRDB-27372 Release note (enterprise change): Changefeeds now officially support the parquet format at specificiation version 2.6. It is only usable with the cloudstorage sink. The syntax to use parquet is like the following: `CREATE CHANGEFEED FOR foo INTO `...` WITH format=parquet` It supports all standard changefeed options and features including CDC transformations, except it does not support the `topic_in_value` option.
Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
This change forces all tests, including tests for `diff` and `end_time` to run with the `cloudstorage` sink and `format=parquet` where possible. Informs: cockroachdb#103129 Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
37a91ee
to
d5d1c88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/changefeedccl/parquet.go
line 74 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
should it be "prev" instead of diff? Or "_before" -- which is pretty standard name too?
Renamed to "before".
pkg/ccl/changefeedccl/parquet.go
line 174 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
you definitely want to construct this value builder once.
Done.
pkg/ccl/changefeedccl/parquet.go
line 180 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Why do we need to convert each datum to JSON? Couldn't we just valueBuilder.Set(col, d)?
Looks like tree.Datum
does not implement json.JSON
. We do a similar thing in the JSON encoder
cockroach/pkg/ccl/changefeedccl/encoder_json.go
Lines 212 to 216 in 07101ad
j, err := tree.AsJSON(d, sessiondatapb.DataConversionConfig{}, time.UTC) | |
if err != nil { | |
return err | |
} | |
return e.valueBuilder.Set(col.Name, j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r8, 1 of 1 files at r9, 4 of 4 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
bors r=miretskiy |
Build succeeded: |
changefeedccl: support key_in_value with parquet format
Previously, the option
key_in_value
was disallowed withformat=parquet
. This change allows these settings to beused together. Note that
key_in_value
is enabled bydefault with
cloudstorage
sinks andformat=parquet
isonly allowed with cloudstorage sinks, so
key_in_value
isenabled for parquet by default.
Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None
changefeedccl: add test coverage for parquet event types
When using
format=parquet
, an additional column is produced toindicate the type of operation corresponding to the row: create,
update, or delete. This change adds coverage for this in unit
testing.
Additionally, the test modified in this change is made more simple
by reducing the number of rows and different types because this
complexity is unnecessary as all types are tested within the
util/parquet package already.
Informs: #99028
Epic: CRDB-27372
Release note: None
Epic: None
util/parquet: support tuple labels in util/parquet testutils
Previously, the test utilities in
util/parquet
would not reconstructtuples read from files with their labels. This change updates the
package to do so. This is required for testing in users of this
package such as CDC.
Informs: #99028
Epic: CRDB-27372
Release note: None
changefeedccl: support diff option with parquet format
This change adds support for the
diff
changefeedoptions when using
format=parquet
. Enablingdiff
also addssupport for CDC Transformations with parquet.
Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None
changefeedccl: support end_time option with parquet format
This change adds support for the
end_time
changefeedoptions when using
format=parquet
. No significant codechanges were needed to enable this feature.
Closes: #103129
Closes: #99028
Epic: CRDB-27372
Release note (enterprise change): Changefeeds now officially
support the parquet format at specificiation version 2.6.
It is only usable with the cloudstorage sink.
The syntax to use parquet is like the following:
CREATE CHANGEFEED FOR foo INTO
...WITH format=parquet
It supports all standard changefeed options and features
including CDC transformations, except it does not support the
topic_in_value
option.changefeedccl: use parquet with 50% probability in nemeses test
Informs: #99028
Epic: CRDB-27372
Release note: None
do not merge: force parquet cloud storage tests
This change forces all tests, including tests for
diff
andend_time
to run with the
cloudstorage
sink andformat=parquet
where possible.Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None