Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266

Closed
dikshant opened this issue Jun 2, 2023 · 6 comments
Assignees
Labels
A-cluster-upgrades C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@dikshant
Copy link

dikshant commented Jun 2, 2023

We should automatically run debug doctor or something equivalent periodically in CRDB itself. Should the result of this uncover any descriptor corruption, we should notify the users with proper logging or alerts. For known cases of corruption, we should also investigate if it is feasible to automatically apply a fix.

Jira issue: CRDB-28437

Epic CRDB-28665

@dikshant dikshant added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) A-cluster-upgrades labels Jun 2, 2023
@dikshant dikshant changed the title run debug doctor automatically before upgrades to catch descriptor corruptions Run debug doctor periodically to find any corrupted descriptors and provide logs/alerts Jun 2, 2023
@rafiss
Copy link
Collaborator

rafiss commented Jun 6, 2023

We may need to rescope this as a higher-level epic, since making the checks work against older binaries needs some extra thought.

@exalate-issue-sync exalate-issue-sync bot changed the title Run debug doctor periodically to find any corrupted descriptors and provide logs/alerts Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts Jun 9, 2023
@ecwall
Copy link
Contributor

ecwall commented Jun 13, 2023

I think we can automate this in patches by keeping track of what fixes we ran in a new table. It would be similar to upgrades, but the order would not matter.

@rafiss rafiss changed the title Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair Jun 15, 2023
@exalate-issue-sync exalate-issue-sync bot assigned chrisseto and unassigned fqazi Jul 20, 2023
@rafiss
Copy link
Collaborator

rafiss commented Jul 27, 2023

Another approach here is to set up a background task that runs once a day or so, and checks the crdb_internal.invalid_objects table. Then we can have it update a metric that counts the number of rows in the table; if the metric goes above 0, it should show up in our SLI reviews. It can also log the invalid_objects as well.

chrisseto added a commit to chrisseto/cockroach that referenced this issue Aug 10, 2023
Short of continuously polling `crdb_internal.invalid_objects`, there was
not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight
checks ahead of upgrades to avoid being stuck in a mixed version state.
It would also allow CRL to more easily monitor cloud clusters for
corruptions in the wild.

This commit updates the schematelemetry job to additionally update the
`sql.schema.invalid_objects` gauge and emit logs for any encountered
corruptions.

Informs: cockroachdb#104266
Epic: CRDB-28665
Release note: None
chrisseto added a commit to chrisseto/cockroach that referenced this issue Aug 14, 2023
Short of continuously polling `crdb_internal.invalid_objects`, there was
not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight
checks ahead of upgrades to avoid being stuck in a mixed version state.
It would also allow CRL to more easily monitor cloud clusters for
corruptions in the wild.

This commit updates the schematelemetry job to additionally update the
`sql.schema.invalid_objects` gauge and emit logs for any encountered
corruptions.

Informs: cockroachdb#104266
Epic: CRDB-28665
Release note: None
chrisseto added a commit to chrisseto/cockroach that referenced this issue Aug 22, 2023
Short of continuously polling `crdb_internal.invalid_objects`, there was
not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight
checks ahead of upgrades to avoid being stuck in a mixed version state.
It would also allow CRL to more easily monitor cloud clusters for
corruptions in the wild.

This commit updates the schematelemetry job to additionally update the
`sql.schema.invalid_objects` gauge and emit logs for any encountered
corruptions.

Informs: cockroachdb#104266
Epic: CRDB-28665
Release note: None
craig bot pushed a commit that referenced this issue Aug 24, 2023
108398: sql: disable pausable portals for all statements with mutations r=rharding6373 a=rharding6373

Previously we examined the AST to determine whether a statement could be executed in a pausable portal or not. However, this was insufficient to identify volatile UDFs that could also contain mutations.

This PR revokes a portal's pausability if the statement's plan contains a mutation. That is, if the opt builder determines that any operator is a mutation operator (see `IsMutationOp`).

Although there is some overlap between this PR and the existing `IsAllowedToPause`, this PR leaves the latter in place, since it restricts some statements that are not considered mutation operators, e.g., import operators.

Epic: None
Fixes: #107130

Release note: None

108559: schematelemetry: emit metrics and logs about invalid objects r=fqazi,andyyang890 a=chrisseto

Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild.

This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions.

Informs: #104266
Epic: CRDB-28665
Release note: None

109047: ui: use cluster setting from redux on schema insights r=maryliag a=maryliag

Part Of #108373

Use the value of the cluster setting
`sql.index_recommendation.drop_unused_duration` from redux, instead of adding as part of the select.
With this change, now users with VIEWACTIVITY or
VIEWACTIVITYREDACTED can see index recommendations on the console, without the need the view cluster settings permission.

This commit changes the Schema Insights Api.

https://www.loom.com/share/6b9ef154c9c44157a45e1e66b1fbc890

Release note (ui change): Users without `VIEWCLUSTERSETTINGS` permission but with `VIEWACTIVITY` or `VIEWACTIVITYREDACTED` can now see index recommendations.

109251: roachtest: update import-cancellation owner to queries r=dt a=dt

Release note: none.
Epic: none.

109442: build: fix PR coverage when a package is added/removed r=RaduBerinde a=RaduBerinde

The PR coverage script for running tests fails if a package does not exist (because it was added or removed in the PR). This change improves the script to ignore package paths for which a `BUILD.bazel` does not exist.

Epic: none
Release note: None

Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Chris Seto <[email protected]>
Co-authored-by: maryliag <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
@rafiss
Copy link
Collaborator

rafiss commented Aug 30, 2023

I think we can close this ticket now that #108559 is resolved. @dikshant do you agree?

We can use #107568 to decide if we want to display this metric as a default chart in the DB Console.

@chrisseto
Copy link
Contributor

Mind reader. I was just about to ask the same question. I'd vote for closing this and we can open a new one if we want to alert end users or add in auto-repair.

Do we think there will be any situations where we'd prompt the user about an opt-in repair that we wouldn't feel confident in applying automatically?

rafiss pushed a commit to rafiss/cockroach that referenced this issue Aug 30, 2023
Short of continuously polling `crdb_internal.invalid_objects`, there was
not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight
checks ahead of upgrades to avoid being stuck in a mixed version state.
It would also allow CRL to more easily monitor cloud clusters for
corruptions in the wild.

This commit updates the schematelemetry job to additionally update the
`sql.schema.invalid_objects` gauge and emit logs for any encountered
corruptions.

Informs: cockroachdb#104266
Epic: CRDB-28665

Release note (ops change): Added a new sql.schema.invalid_objects gauge
metric. This gauge is periodically updated based on the schedule set by
the sql.schema.telemetry.recurrence cluster setting. When it is updated,
it counts the number of schema objects (tables, types, schemas, databases,
and functions) that are in an invalid state according to CockroachDB’s
internal validation checks. This metric is expected to be zero in a healthy
cluster, and if it is not, it indicates that there is a problem that must
be repaired.
rafiss pushed a commit that referenced this issue Aug 30, 2023
Short of continuously polling `crdb_internal.invalid_objects`, there was
not a convenient way to monitor a cluster for descriptor corruption.

Having such an indicator would allow customers to perform preflight
checks ahead of upgrades to avoid being stuck in a mixed version state.
It would also allow CRL to more easily monitor cloud clusters for
corruptions in the wild.

This commit updates the schematelemetry job to additionally update the
`sql.schema.invalid_objects` gauge and emit logs for any encountered
corruptions.

Informs: #104266
Epic: CRDB-28665

Release note (ops change): Added a new sql.schema.invalid_objects gauge
metric. This gauge is periodically updated based on the schedule set by
the sql.schema.telemetry.recurrence cluster setting. When it is updated,
it counts the number of schema objects (tables, types, schemas, databases,
and functions) that are in an invalid state according to CockroachDB’s
internal validation checks. This metric is expected to be zero in a healthy
cluster, and if it is not, it indicates that there is a problem that must
be repaired.
@dikshant
Copy link
Author

dikshant commented Sep 1, 2023

Thanks for the work on this. This is good to close!

@dikshant dikshant closed this as completed Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-upgrades C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants