Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: remove limitations for parquet format #103129

Closed
3 of 7 tasks
jayshrivastava opened this issue May 11, 2023 · 1 comment · Fixed by #104528
Closed
3 of 7 tasks

changefeedccl: remove limitations for parquet format #103129

jayshrivastava opened this issue May 11, 2023 · 1 comment · Fixed by #104528
Assignees
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc

Comments

@jayshrivastava
Copy link
Contributor

jayshrivastava commented May 11, 2023

#99028 tracks all the steps required to add the apache arrow parquet library and remove the old one. This issue tracks changefeed options which are not supported by parquet (these are the same as the options not supported by initial scan)

Jira issue: CRDB-27845
Epic CRDB-27372

@jayshrivastava jayshrivastava added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cdc Change Data Capture T-cdc labels May 11, 2023
@blathers-crl
Copy link

blathers-crl bot commented May 11, 2023

cc @cockroachdb/cdc

@jayshrivastava jayshrivastava changed the title changefeedccl: remove limitations for parquet changefeedccl: remove limitations for parquet format May 11, 2023
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 5, 2023
Previously, `format=parquet` and `resolved` could not be used
together when running changefeeds. This change adds support for
this.

The release note is being left intentionally blank for a future
commit.

Informs: cockroachdb#103129
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 5, 2023
Previously, `format=parquet` and `resolved` could not be used
together when running changefeeds. This change adds support for
this.

The release note is being left intentionally blank for a future
commit.

Informs: cockroachdb#103129
Release note: None
craig bot pushed a commit that referenced this issue Jun 5, 2023
101790: ui: update changefeed metrics page r=samiskin a=samiskin

Resolves #98085
Resolves #98088
Resolves #99409
Resolves #100640
Resolves #97931
Resolves #98086

This PR does the following changes:
- Added a `scale` parameter to `Metrics` so that I could support a duration metric that's being emitted in Seconds rather than Nanoseconds.  Would like frontend feedback on whether this is ok.
- Added support for minutes and hours in Duration graphs
- There is now a "Changefeed Status" graph to show counts of Running/Paused/Failed
- There is now a "Commit Latency" graph to show P50,P90, and P99 commit latencies
- Sink Byte Traffic is now Emitted Bytes
- Sink Timings has been removed because I don't believe either of the metrics exist anymore
- Max Changefeed Latency is now Max Checkpoint Latency
- There is now a Backfill Pending Ranges graph
- There is now a Protected Timestamp Max Age graph
- There is now a Schema Registry Registrations graph

<img width="824" alt="Screenshot 2023-04-18 at 5 02 33 PM" src="https://user-images.githubusercontent.com/6236424/232904464-d52000d9-7e4f-4fd2-a1ee-7df6eaf41c4a.png">
<img width="660" alt="Screenshot 2023-04-19 at 12 48 19 PM" src="https://user-images.githubusercontent.com/6236424/233144858-3bd27004-b907-4c31-90d2-aeec6695f6aa.png">
<img width="655" alt="Screenshot 2023-04-18 at 4 19 24 PM" src="https://user-images.githubusercontent.com/6236424/232895664-97103785-13f8-4224-90e6-5706a8f4dd37.png">
<img width="660" alt="Screenshot 2023-04-18 at 4 19 42 PM" src="https://user-images.githubusercontent.com/6236424/232895722-bc60bf04-c08a-48f0-ac93-1b48d3a4303c.png">

Release note (ui change): The metrics page for changefeeds has been updated with new graphs to track backfill progress, protected timestamps age, and number of schema registry registrations.

104239: cloud: limit object reads for pebble for S3 and GCS r=RaduBerinde a=RaduBerinde

#### cloud: consolidate ReadFile APIs

This change consolidates the `ReadFile` and `ReadFileAt` APIs in
`cloud.ExternalStorage`. We use a `ReadOptions` struct to optionally
specify the offset or that we don't care about the size. This will
allow us to add more options without large code changes.

Epic: none
Release note: None

#### cloud: add read LimitHint, implement for S3 and GCS

Epic: none
Release note: None

#### storage: set LimitHint for pebble object reads

Epic: none
Release note: None


104283: changefeedccl: support the resolved option with format=parquet r=miretskiy a=jayshrivastava

Previously, `format=parquet` and `resolved` could not be used
together when running changefeeds. This change adds support for
this.

The release note is being left intentionally blank for a future
commit.

Informs: #103129
Release note: None

104286: keyvisualizer: return error if delete query fails r=zachlite a=zachlite

This commit returns the error produced from `DeleteSamplesBeforeTime`, if any.  Before this change, an error would cause a panic, which is disruptive and unnecessary. 

The caller of this function returns errors produced to the job system, which will back off, and try again later. For more details, see [Resume](https://github.com/cockroachdb/cockroach/blob/afcd974a8ca96f9f89a3ccb2e2b75bd70830fbf6/pkg/keyvisualizer/keyvisjob/job.go#L38).

resolves #103968
Epic: none
Release note (bug fix): The keyvisualizer job no longer panics
if an error is encountered while cleaning up stale samples. Instead,
if the job encounters an error, the job will try again later.

104288: roachtest: harden tpchvec/perf r=yuzefovich a=yuzefovich

This commit improves `tpchvec/perf` roachtest so that it's less likely to fail due to some flake in performance. In particular, this test has an assertion that if a query runtime in ON config is 1.5x slower than in OFF config, then some bundles are collected and the test is failed. However, we've seen quite a few times when those bundles don't explain the slowness (which likely to be intermittent). To prevent these false positives this commit improves the test to run the query that was marked as too slow one more time and only fail the test if it's significantly slower again in ON config vs OFF config.

Fixes: #101526.

Release note: None

Co-authored-by: Shiranka Miskin <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Zach Lite <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 8, 2023
Previously, the option `key_in_value` was disallowed with
`format=parquet`. This change allows these settings to be
used together. Note that `key_in_value` is enabled by
default with `cloudstorage` sinks and `format=parquet` is
only allowed with cloudstorage sinks, so `key_in_value` is
enabled for parquet by default.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 8, 2023
This change adds support for the `diff` changefeed
options when using `format=parquet`. Enabling `diff` also adds
support for CDC Transformations with parquet.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 8, 2023
This change adds support for the `end_time` changefeed
options when using `format=parquet`. No significant code
changes were needed to enable this feature.

Closes: cockroachdb#103129
Closes: cockroachdb#99028
Epic: CRDB-27372
Release note (enterprise change): Changefeeds now officially
support the parquet format at specificiation version 2.6.
It is only usable with the cloudstorage sink.

The syntax to use parquet is like the following:
`CREATE CHANGEFEED FOR foo INTO `...` WITH format=parquet`

It supports all standard changefeed options and features
including CDC transformations, except it does not support the
`topic_in_value` option.
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 8, 2023
This change forces all tests, including tests for `diff` and `end_time`
to run with the `cloudstorage` sink and `format=parquet` where possible.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 13, 2023
This change adds support for the `diff` changefeed
options when using `format=parquet`. Enabling `diff` also adds
support for CDC Transformations with parquet.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 13, 2023
This change forces all tests, including tests for `diff` and `end_time`
to run with the `cloudstorage` sink and `format=parquet` where possible.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
craig bot pushed a commit that referenced this issue Jun 15, 2023
104528: changefeedccl: add full support for the parquet format r=miretskiy a=jayshrivastava

### changefeedccl: support key_in_value with parquet format

Previously, the option `key_in_value` was disallowed with
`format=parquet`. This change allows these settings to be
used together. Note that `key_in_value` is enabled by
default with `cloudstorage` sinks and `format=parquet` is
only allowed with cloudstorage sinks, so `key_in_value` is
enabled for parquet by default.

Informs: #103129
Informs: #99028
Epic: [CRDB-27372](https://cockroachlabs.atlassian.net/browse/CRDB-27372)
Release note: None

---

### changefeedccl: add test coverage for parquet event types

When using `format=parquet`, an additional column is produced to
indicate the type of operation corresponding to the row: create,
update, or delete. This change adds coverage for this in unit
testing.

Additionally, the test modified in this change is made more simple
by reducing the number of rows and different types because this
complexity is unnecessary as all types are tested within the
util/parquet package already.

Informs: #99028
Epic: [CRDB-27372](https://cockroachlabs.atlassian.net/browse/CRDB-27372)
Release note: None
Epic: None

---

### util/parquet: support tuple labels in util/parquet testutils

Previously, the test utilities in `util/parquet` would not reconstruct
tuples read from files with their labels. This change updates the
package to do so. This is required for testing in users of this
package such as CDC.

Informs: #99028
Epic: [CRDB-27372](https://cockroachlabs.atlassian.net/browse/CRDB-27372)
Release note: None

---

### changefeedccl: support diff option with parquet format

This change adds support for the `diff` changefeed
options when using `format=parquet`. Enabling `diff` also adds
support for CDC Transformations with parquet.

Informs: #103129
Informs: #99028
Epic: [CRDB-27372](https://cockroachlabs.atlassian.net/browse/CRDB-27372)
Release note: None

---

### changefeedccl: support end_time option with parquet format

This change adds support for the `end_time` changefeed
options when using `format=parquet`. No significant code
changes were needed to enable this feature.

Closes: #103129
Closes: #99028
Epic: [CRDB-27372](https://cockroachlabs.atlassian.net/browse/CRDB-27372)
Release note (enterprise change): Changefeeds now officially
support the parquet format at specificiation version 2.6.
It is only usable with the cloudstorage sink.

The syntax to use parquet is like the following:
`CREATE CHANGEFEED FOR foo INTO `...` WITH format=parquet`

It supports all standard changefeed options and features
including CDC transformations, except it does not support the
`topic_in_value` option.

---

### changefeedccl: use parquet with 50% probability in nemeses test

Informs: #99028
Epic: [CRDB-27372](https://cockroachlabs.atlassian.net/browse/CRDB-27372)
Release note: None

---

### do not merge: force parquet cloud storage tests

This change forces all tests, including tests for `diff` and `end_time`
to run with the `cloudstorage` sink and `format=parquet` where possible.

Informs: #103129
Informs: #99028
Epic: [CRDB-27372](https://cockroachlabs.atlassian.net/browse/CRDB-27372)
Release note: None

Co-authored-by: Jayant Shrivastava <[email protected]>
@craig craig bot closed this as completed in f763d8e Jun 15, 2023
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 21, 2023
Previously, `format=parquet` and `resolved` could not be used
together when running changefeeds. This change adds support for
this.

The release note is being left intentionally blank for a future
commit.

Informs: cockroachdb#103129
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 21, 2023
Previously, the option `key_in_value` was disallowed with
`format=parquet`. This change allows these settings to be
used together. Note that `key_in_value` is enabled by
default with `cloudstorage` sinks and `format=parquet` is
only allowed with cloudstorage sinks, so `key_in_value` is
enabled for parquet by default.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 21, 2023
This change adds support for the `diff` changefeed
options when using `format=parquet`. Enabling `diff` also adds
support for CDC Transformations with parquet.

Informs: cockroachdb#103129
Informs: cockroachdb#99028
Epic: CRDB-27372
Release note: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jun 21, 2023
This change adds support for the `end_time` changefeed
options when using `format=parquet`. No significant code
changes were needed to enable this feature.

Closes: cockroachdb#103129
Closes: cockroachdb#99028
Epic: CRDB-27372
Release note (enterprise change): Changefeeds now officially
support the parquet format at specificiation version 2.6.
It is only usable with the cloudstorage sink.

The syntax to use parquet is like the following:
`CREATE CHANGEFEED FOR foo INTO `...` WITH format=parquet`

It supports all standard changefeed options and features
including CDC transformations, except it does not support the
`topic_in_value` option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant