Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: runtime assertions #94986

Open
tbg opened this issue Jan 10, 2023 · 18 comments · Fixed by #106508
Open

testing: runtime assertions #94986

tbg opened this issue Jan 10, 2023 · 18 comments · Fixed by #106508
Labels
C-investigation Further steps needed to qualify. C-label will change. O-postmortem Originated from a Postmortem action item. O-sre For issues SRE opened or otherwise cares about tracking. P-3 Issues/test failures with no fix SLA T-testeng TestEng Team

Comments

@tbg
Copy link
Member

tbg commented Jan 10, 2023

Describe the problem

We don't have great guidelines/infra for runtime assertions. Some of us have, in previous jobs, worked on products that relied heavily on runtime assertions (that would be compiled out in the production builds) and have found this very useful.

Currently, there are a few issues:

  1. there isn't a great way to even write assertions succinctly. A require-style API would be great, but instead you have to do something like
if buildutils.CrdbTestBuild {
  if fooCondition != barCondition {
    log.Fatalf(ctx, "%s", errors.AssertionFailedf("Halp! fooCondition was %v!", barCondition)
  }
}

That's pretty noisy and tends to pollute the code, and also it's just long so people don't type it when they should.

  1. they're not enabled during nightly roachtests. This is because we only have this one class of assertions, and lots of very slow stuff might be gated behind it. We don't have a good separation between "cheap" and "expensive" assertions, and we don't have a clear distinction between "performance" and "correctness" roachtests at the moment.

#94979 has some links to ideas.

Jira issue: CRDB-23260

@tbg tbg added C-investigation Further steps needed to qualify. C-label will change. T-testeng TestEng Team labels Jan 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jan 10, 2023

cc @cockroachdb/test-eng

@erikgrinaker
Copy link
Contributor

There's been some prior internal discussion on this here: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1658242296301019.

@srosenberg
Copy link
Member

For 1), we do need to determine a require-style API that just works. Do you think we could just rely on Fatalf everywhere? I might be paranoid, but I'd like to be certain that an assert will do the right thing and terminate the process. Go's runtime.fatal seems ideal for that. E.g.,

cat assert.go
package main

import (
	_ "unsafe"
)

//go:linkname unrecoverableFatal runtime.fatal
func unrecoverableFatal(s string)

func assert(cond bool, msg string) {
	if !cond {
		unrecoverableFatal(msg)
	}
}

func main() {
	assert(true == false, "just maybe")
}
GOTRACEBACK=crash go run assert.go 2>&1|head -5
fatal error: just maybe

goroutine 1 [running]:
runtime.fatal({0x102fb52f5?, 0x140000021a0?})
	/usr/local/go/src/runtime/panic.go:1066 +0x40 fp=0x1400006ef50 sp=0x1400006ef20 pc=0x102f8bf10

@srosenberg
Copy link
Member

srosenberg commented Jan 13, 2023

For 2), we already have the instrumented binary as one of the roachtest artifacts [1]. (Its path is written into testImpl.cockroachShort.) We just need a clean way to opt in (or out), ideally via TestSpec.

There is also [2] which is related to 2).

[1]

--cockroach-short="${PWD}/bin/cockroach-short-ea" \

[2] #86678

@tbg
Copy link
Member Author

tbg commented Jan 24, 2023

Do you think we could just rely on Fatalf everywhere?

I think so, since that also does redaction, sentry reporting, etc, which we may want if we deploy these builds to long-lived clusters and perhaps even at customer testing sites (who knows).

Fatalf already attempts to make very sure it terminates the process even if disks stall, etc. I think there are some holes there but they can be fixed.

@tbg tbg mentioned this issue Jan 24, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jan 24, 2023
Prototype only.

Touches cockroachdb#94986.
Could close it if someone polished this up.

Epic: none
Release note: None
@tbg
Copy link
Member Author

tbg commented Jan 24, 2023

Here's something that I think I for one would be happy using: #95727

(Not planning to push this over the finish line but maybe someone will? And if nothing else it's food for thought).

renatolabs added a commit to renatolabs/cockroach that referenced this issue Mar 15, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions enabled by default. Performance tests (i.e., tests
that export to roachperf) will be able to continue using the standard
binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions enabled, most often due to timeouts.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this issue Mar 17, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions enabled by default. Performance tests (i.e., tests
that export to roachperf) will be able to continue using the standard
binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions enabled, most often due to timeouts.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this issue Mar 22, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions enabled by default. Performance tests (i.e., tests
that export to roachperf) will be able to continue using the standard
binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions enabled, most often due to timeouts.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
@arulajmani
Copy link
Collaborator

I'm probably missing something, but why do we not like logcrash.ReportOrPanic for doing runtime assertions?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 21, 2023

I think it mostly does what we want. I'm not sure if it's enabled in roachtests/roachprod and for local dev builds, we should make sure it is.

My main gripe with it is that I want to add assertions in performance-critical code paths, where the assertion check itself is too expensive to enable in production. For example in the MVCC iterator hot path. So we need a mechanism that will entirely compile out the assertions in production builds without any performance impact. We'll likely want three different variants:

  1. Report in production, crash in dev.
  2. Don't check in production, crash in dev.
  3. Always crash.

Also, we should have some syntactic sugar for it that's terser and easier to use. Maybe something like:

crash.If(1 != 1, "uh oh")
crash.Assert(1 == 1, "uh oh")
crash.AssertExpensive(computeMVCCStats(db) == ms, "recomputed stats differ from in-memory stats")
crash.AssertAlways(1 == 1, "uh oh")
crash.Now("uh oh")

@srosenberg
Copy link
Member

I wonder if we could piggyback off of log.V and the whole vmodule thing? It might be nice to dynamically and selectively turn on some assertions. I realize this might be at odds with the (compiler) optimizer, e.g., large assertion may disable inlining or other optimizations... but, is it that different from code paths which rely on conditional, verbose logging?

@srosenberg
Copy link
Member

srosenberg commented Apr 26, 2023

On second thought, maybe we're overthinking it? Has everyone seen [1]? Rather than cook up fancy interfaces, why don't we go back to basics? E.g., does one really need to pass an error message into assert? Why not put a comment right above it; as long as the line number is captured in the stack trace, you have the rest of the context in the source code; e.g., https://github.com/sqlite/sqlite/blob/41ce47c4f4bcae3882fdccec18a6100a85f4bba5/src/btmutex.c#L71-L97

[1] https://www.sqlite.org/assert.html

@erikgrinaker
Copy link
Contributor

Rather than cook up fancy interfaces, why don't we go back to basics? E.g., does one really need to pass an error message into assert?

Yes, we need a message. For assertions enabled in production builds, cluster operators need a human-readable message if they hit it, and we also want them in Sentry reports. They'll frequently also contain useful context via fmt args, e.g. the specific values that violate the assertion.

I wonder if we could piggyback off of log.V and the whole vmodule thing? It might be nice to dynamically and selectively turn on some assertions.

We often turn on vmodule for debugging production incidents. I wouldn't want that to also arm a bunch of fatal assertions.

I don't think any of this is particularly hard, someone just needs to set aside a day or two and get it done. I was hoping to do something over breather week, but got bogged down in GA blockers and test flake.

@srosenberg
Copy link
Member

srosenberg commented Apr 26, 2023

Yes, we need a message. For assertions enabled in production builds...

Is there a consensus that we'd want to enable assertions in production builds? I am open to both approaches, but note that having them in production builds goes against the conventional wisdom.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 26, 2023

I think there are some assertions where we know the node is broken and we want to kill it to prevent further damage to the node/app/cluster. Stuff like SST checksum mismatches or consistency checker failures that would serve incorrect data to clients. Unclear if we'd consider these assertions as such though, but probably if we want to send them via Sentry (which I think we do).

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 26, 2023

Also, many assertions would send a Sentry report in production builds, but not fatal the node. We'd want messages in those Sentry reports.

DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Oct 17, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions enabled by default. Performance tests (indicated by
the benchmark TestSpec) will continue using the standard binary, without
runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions enabled, most often due to timeouts.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Oct 26, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Oct 30, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Oct 31, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Oct 31, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Oct 31, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Oct 31, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Nov 1, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Nov 1, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Nov 2, 2023
111949: roachtest: randomly run with runtime assertions by default r=renatolabs,mikeCRL,rickystewart,herkolategan a=DarrylWong

This changes the semantics of `t.Cockroach()` to use a binary with runtime assertions enabled by default. Performance tests (indicated by the benchmark TestSpec) will be able to continue using the standard binary, without runtime assertions or metamorphic constants, as usual.

This commit also opts-out other tests that cannot easily be run with runtime assertions or metamorphic constants enabled, most often due to timeouts or metamorphic constant incompatibility.

Resolves #86678.
Informs #94986.

Epic: none

Release note: None

113693: go.mod: bump Pebble to 844f0582c2eb r=RaduBerinde a=itsbilal

```
844f0582 tool: fix tool to tolerate T in the log context
58cbdb46 internal/dsl: new package
481da041 db: add KeyStatistics.LatestKindsCount
86ec1c41 db: move size check for remote files to table stats job
424019fb db: guard against concurrent batch writes during Commit
5f131c12 db: default to zeroing fields in Batch.Reset
```

Release note: None.

Epic: none

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: DarrylWong <[email protected]>
Co-authored-by: Bilal Akhtar <[email protected]>
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Nov 3, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
DarrylWong pushed a commit to DarrylWong/fork that referenced this issue Nov 6, 2023
This changes the semantics of `t.Cockroach()` to use a binary with
runtime assertions and metamorphic constants enabled by default.
Performance tests (indicated by the benchmark TestSpec) will continue
using the standard binary, without runtime assertions, as usual.

This commit also opts-out other tests that cannot easily be run with
runtime assertions or metamorphic constants enabled, most often due to
timeouts and metamorphic constant incompatibility.

Resolves cockroachdb#86678.
Informs cockroachdb#94986.

Epic: none

Release note: None
@srosenberg srosenberg added the P-3 Issues/test failures with no fix SLA label Nov 20, 2023
@erikgrinaker erikgrinaker removed their assignment Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. O-postmortem Originated from a Postmortem action item. O-sre For issues SRE opened or otherwise cares about tracking. P-3 Issues/test failures with no fix SLA T-testeng TestEng Team
Projects
None yet
6 participants