-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run descriptor validation checks periodically to find any corrupted descriptors and provide logs/alerts or auto-repair #104266
Comments
We may need to rescope this as a higher-level epic, since making the checks work against older binaries needs some extra thought. |
I think we can automate this in patches by keeping track of what fixes we ran in a new table. It would be similar to upgrades, but the order would not matter. |
Another approach here is to set up a background task that runs once a day or so, and checks the |
Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note: None
Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note: None
Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note: None
108398: sql: disable pausable portals for all statements with mutations r=rharding6373 a=rharding6373 Previously we examined the AST to determine whether a statement could be executed in a pausable portal or not. However, this was insufficient to identify volatile UDFs that could also contain mutations. This PR revokes a portal's pausability if the statement's plan contains a mutation. That is, if the opt builder determines that any operator is a mutation operator (see `IsMutationOp`). Although there is some overlap between this PR and the existing `IsAllowedToPause`, this PR leaves the latter in place, since it restricts some statements that are not considered mutation operators, e.g., import operators. Epic: None Fixes: #107130 Release note: None 108559: schematelemetry: emit metrics and logs about invalid objects r=fqazi,andyyang890 a=chrisseto Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: #104266 Epic: CRDB-28665 Release note: None 109047: ui: use cluster setting from redux on schema insights r=maryliag a=maryliag Part Of #108373 Use the value of the cluster setting `sql.index_recommendation.drop_unused_duration` from redux, instead of adding as part of the select. With this change, now users with VIEWACTIVITY or VIEWACTIVITYREDACTED can see index recommendations on the console, without the need the view cluster settings permission. This commit changes the Schema Insights Api. https://www.loom.com/share/6b9ef154c9c44157a45e1e66b1fbc890 Release note (ui change): Users without `VIEWCLUSTERSETTINGS` permission but with `VIEWACTIVITY` or `VIEWACTIVITYREDACTED` can now see index recommendations. 109251: roachtest: update import-cancellation owner to queries r=dt a=dt Release note: none. Epic: none. 109442: build: fix PR coverage when a package is added/removed r=RaduBerinde a=RaduBerinde The PR coverage script for running tests fails if a package does not exist (because it was added or removed in the PR). This change improves the script to ignore package paths for which a `BUILD.bazel` does not exist. Epic: none Release note: None Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Chris Seto <[email protected]> Co-authored-by: maryliag <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Radu Berinde <[email protected]>
Mind reader. I was just about to ask the same question. I'd vote for closing this and we can open a new one if we want to alert end users or add in auto-repair. Do we think there will be any situations where we'd prompt the user about an opt-in repair that we wouldn't feel confident in applying automatically? |
Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: cockroachdb#104266 Epic: CRDB-28665 Release note (ops change): Added a new sql.schema.invalid_objects gauge metric. This gauge is periodically updated based on the schedule set by the sql.schema.telemetry.recurrence cluster setting. When it is updated, it counts the number of schema objects (tables, types, schemas, databases, and functions) that are in an invalid state according to CockroachDB’s internal validation checks. This metric is expected to be zero in a healthy cluster, and if it is not, it indicates that there is a problem that must be repaired.
Short of continuously polling `crdb_internal.invalid_objects`, there was not a convenient way to monitor a cluster for descriptor corruption. Having such an indicator would allow customers to perform preflight checks ahead of upgrades to avoid being stuck in a mixed version state. It would also allow CRL to more easily monitor cloud clusters for corruptions in the wild. This commit updates the schematelemetry job to additionally update the `sql.schema.invalid_objects` gauge and emit logs for any encountered corruptions. Informs: #104266 Epic: CRDB-28665 Release note (ops change): Added a new sql.schema.invalid_objects gauge metric. This gauge is periodically updated based on the schedule set by the sql.schema.telemetry.recurrence cluster setting. When it is updated, it counts the number of schema objects (tables, types, schemas, databases, and functions) that are in an invalid state according to CockroachDB’s internal validation checks. This metric is expected to be zero in a healthy cluster, and if it is not, it indicates that there is a problem that must be repaired.
Thanks for the work on this. This is good to close! |
We should automatically run debug doctor or something equivalent periodically in CRDB itself. Should the result of this uncover any descriptor corruption, we should notify the users with proper logging or alerts. For known cases of corruption, we should also investigate if it is feasible to automatically apply a fix.
Jira issue: CRDB-28437
Epic CRDB-28665
The text was updated successfully, but these errors were encountered: