Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266

dikshant · 2023-06-02T15:36:08Z

We should automatically run debug doctor or something equivalent periodically in CRDB itself. Should the result of this uncover any descriptor corruption, we should notify the users with proper logging or alerts. For known cases of corruption, we should also investigate if it is feasible to automatically apply a fix.

Jira issue: CRDB-28437

Epic CRDB-28665

rafiss · 2023-06-06T17:29:17Z

We may need to rescope this as a higher-level epic, since making the checks work against older binaries needs some extra thought.

ecwall · 2023-06-13T20:54:12Z

I think we can automate this in patches by keeping track of what fixes we ran in a new table. It would be similar to upgrades, but the order would not matter.

rafiss · 2023-07-27T18:35:35Z

Another approach here is to set up a background task that runs once a day or so, and checks the crdb_internal.invalid_objects table. Then we can have it update a metric that counts the number of rows in the table; if the metric goes above 0, it should show up in our SLI reviews. It can also log the invalid_objects as well.

Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note: None

108398: sql: disable pausable portals for all statements with mutations r=rharding6373 a=rharding6373 Previously we examined the AST to determine whether a statement could be executed in a pausable portal or not. However, this was insufficient to identify volatile UDFs that could also contain mutations. This PR revokes a portal's pausability if the statement's plan contains a mutation. That is, if the opt builder determines that any operator is a mutation operator (see `IsMutationOp`). Although there is some overlap between this PR and the existing `IsAllowedToPause`, this PR leaves the latter in place, since it restricts some statements that are not considered mutation operators, e.g., import operators. Epic: None Fixes: #107130 Release note: None 108559: schematelemetry: emit metrics and logs about invalid objects r=fqazi,andyyang890 a=chrisseto Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: #104266 Epic: CRDB-28665 Release note: None 109047: ui: use cluster setting from redux on schema insights r=maryliag a=maryliag Part Of #108373 Use the value of the cluster setting `sql.index_recommendation.drop_unused_duration` from redux, instead of adding as part of the select. With this change, now users with VIEWACTIVITY or VIEWACTIVITYREDACTED can see index recommendations on the console, without the need the view cluster settings permission. This commit changes the Schema Insights Api. https://www.loom.com/share/6b9ef154c9c44157a45e1e66b1fbc890 Release note (ui change): Users without `VIEWCLUSTERSETTINGS` permission but with `VIEWACTIVITY` or `VIEWACTIVITYREDACTED` can now see index recommendations. 109251: roachtest: update import-cancellation owner to queries r=dt a=dt Release note: none. Epic: none. 109442: build: fix PR coverage when a package is added/removed r=RaduBerinde a=RaduBerinde The PR coverage script for running tests fails if a package does not exist (because it was added or removed in the PR). This change improves the script to ignore package paths for which a `BUILD.bazel` does not exist. Epic: none Release note: None Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Chris Seto <[email protected]> Co-authored-by: maryliag <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

rafiss · 2023-08-30T17:13:07Z

I think we can close this ticket now that #108559 is resolved. @dikshant do you agree?

We can use #107568 to decide if we want to display this metric as a default chart in the DB Console.

chrisseto · 2023-08-30T17:47:18Z

Mind reader. I was just about to ask the same question. I'd vote for closing this and we can open a new one if we want to alert end users or add in auto-repair.

Do we think there will be any situations where we'd prompt the user about an opt-in repair that we wouldn't feel confident in applying automatically?

Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note (ops change): Added a new sql.schema.invalid_objects gauge metric. This gauge is periodically updated based on the schedule set by the sql.schema.telemetry.recurrence cluster setting. When it is updated, it counts the number of schema objects (tables, types, schemas, databases, and functions) that are in an invalid state according to CockroachDB’s internal validation checks. This metric is expected to be zero in a healthy cluster, and if it is not, it indicates that there is a problem that must be repaired.

Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: #104266 Epic: CRDB-28665 Release note (ops change): Added a new sql.schema.invalid_objects gauge metric. This gauge is periodically updated based on the schedule set by the sql.schema.telemetry.recurrence cluster setting. When it is updated, it counts the number of schema objects (tables, types, schemas, databases, and functions) that are in an invalid state according to CockroachDB’s internal validation checks. This metric is expected to be zero in a healthy cluster, and if it is not, it indicates that there is a problem that must be repaired.

dikshant · 2023-09-01T15:04:13Z

Thanks for the work on this. This is good to close!

dikshant added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) A-cluster-upgrades labels Jun 2, 2023

dikshant changed the title ~~run debug doctor automatically before upgrades to catch descriptor corruptions~~ Run debug doctor periodically to find any corrupted descriptors and provide logs/alerts Jun 2, 2023

exalate-issue-sync bot changed the title ~~Run debug doctor periodically to find any corrupted descriptors and provide logs/alerts~~ Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts Jun 9, 2023

exalate-issue-sync bot assigned fqazi Jun 9, 2023

rafiss changed the title ~~Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts~~ Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair Jun 15, 2023

exalate-issue-sync bot assigned chrisseto and unassigned fqazi Jul 20, 2023

rafiss mentioned this issue Aug 1, 2023

Expose crdb_internal.invalid_objects inside the console UI #107568

Open

chrisseto mentioned this issue Aug 10, 2023

schematelemetry: emit metrics and logs about invalid objects #108559

Merged

exalate-issue-sync bot mentioned this issue Aug 30, 2023

PR #108559 - schematelemetry: emit metrics and logs about invalid objects cockroachdb/docs#17790

Open

blathers-crl bot mentioned this issue Aug 30, 2023

release-23.1: schematelemetry: emit metrics and logs about invalid objects #109733

Merged

blathers-crl bot mentioned this issue Aug 30, 2023

release-23.1.9-rc: schematelemetry: emit metrics and logs about invalid objects #109739

Merged

dikshant closed this as completed Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266

Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266

dikshant commented Jun 2, 2023 •

edited by exalate-issue-sync bot

Loading

rafiss commented Jun 6, 2023

ecwall commented Jun 13, 2023

rafiss commented Jul 27, 2023

rafiss commented Aug 30, 2023 •

edited

Loading

chrisseto commented Aug 30, 2023

dikshant commented Sep 1, 2023

Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266

Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266

Comments

dikshant commented Jun 2, 2023 • edited by exalate-issue-sync bot Loading

rafiss commented Jun 6, 2023

ecwall commented Jun 13, 2023

rafiss commented Jul 27, 2023

rafiss commented Aug 30, 2023 • edited Loading

chrisseto commented Aug 30, 2023

dikshant commented Sep 1, 2023

dikshant commented Jun 2, 2023 •

edited by exalate-issue-sync bot

Loading

rafiss commented Aug 30, 2023 •

edited

Loading