Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log: add consistent tagging of failed assertions that are alertworthy #102041

Closed
dhartunian opened this issue Apr 21, 2023 · 6 comments
Closed

log: add consistent tagging of failed assertions that are alertworthy #102041

dhartunian opened this issue Apr 21, 2023 · 6 comments
Assignees
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23 O-postmortem Originated from a Postmortem action item.

Comments

@dhartunian
Copy link
Collaborator

dhartunian commented Apr 21, 2023

Sometimes in production we have situations that don't rise to the level of panic (programmer error) or Fatalf (node should crash) but still want to alert the operator that something serious is going on that requires immediate attention. This cannot be simply codified into an ERROR severity log since not all ERROR level lines rise to the level of requiring operator involvement.

This ticket tracks consistent messaging for these types of errors so that they can be alerted upon easily. This can be done either through a log tag, or through some pre-set formatting structure in the error message.

Jira issue: CRDB-27233

@dhartunian dhartunian added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. T-observability-inf labels Apr 21, 2023
@dhartunian dhartunian self-assigned this Apr 21, 2023
@joshimhoff
Copy link
Collaborator

I think the specific concept missing is "assertion has failed that indicates a code bug somewhere in CRDB period, but no need to crash the node given details of the assertion failure". So node liveness failing does not indicate this, even though it is bad. OTOH, see the discussion in this internal postmortem (https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/2985951249/2023-04-14+Postmortem+on+replicated+corruption+in+CC+clusters+following+use+of+crdb+internal.probe+ranges), culminating in:

[josh] is the warning in this case “there’s an intent in the liveness range!? Clearly something is fubar”. Missing concept in alert for a “very unexpected scenario”.

Intent on liveness range may not require restarting CRDB, but it does definitely indicate a code bug.

@joshimhoff
Copy link
Collaborator

CC @nvanbenschoten as this relates to an idea from Nathan / KV re: building a better assertion framework for roachprod-based tests.

@nvanbenschoten
Copy link
Member

There's a thread about better, easier runtime assertion in #94986. As that thread discusses, the three considerations are 1) ergonomics, 2) cost in production, and 3) effect in production.

I'm bullish that if we get the framework right, we'll see quick adoption of assertions throughout the codebase. This is what we saw with the adoption of the testify/require library. The library was easy enough to use that developers began reaching for it in place of older approaches. In fact, it was so low enough that developers began writing better tests with more assertions.

@srosenberg
Copy link
Member

srosenberg commented May 21, 2023

I think the specific concept missing is "assertion has failed that indicates a code bug somewhere in CRDB period, but no need to crash the node given details of the assertion failure".

So, are we merely hinting at a new log-level? I can't say I've come across other systems that sandwich a new log-level between WARN, ERROR, and FATAL. If I understand the ask, we want to easily surface some assertion failures without crashing a node?
It's worth noting that non-fatal assertions can be rather counter-productive [1], which is one of the reasons you don't see them in the wild.

[1] https://blog.mozilla.org/nnethercote/2011/06/07/what-is-the-point-of-non-fatal-assertions/

@srosenberg
Copy link
Member

This is what we saw with the adoption of the testify/require library. The library was easy enough to use that developers began reaching for it in place of older approaches. In fact, it was so low enough that developers began writing better tests with more assertions.

That's great to hear! However, we should also be wary of the error-prone API which comes with the testify/require library, e.g. [1], [2].

[1] #101028
[2] #100796

@dhartunian
Copy link
Collaborator Author

There is a bunch of working going on in #94986 to support better runtime assertions. Closing this as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23 O-postmortem Originated from a Postmortem action item.
Projects
None yet
Development

No branches or pull requests

4 participants