-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: check error on Engine close #87232
Conversation
3eeb747
to
630f2fa
Compare
630f2fa
to
ba7c083
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few unit tests that leak iterators that I haven't tracked down yet, so they're marked as exempt from this behavior for now. I'm suspicious that there might be a legitimate iterator leak somewhere in here. I'm going to create a GA blocker issue for identifying and resolving these remaining leaks.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very +1 on the testing aspects of this. However, I am a bit worried that this can cause spurious node crashes when we close engines outside of node shutdown, such as here:
cockroach/pkg/kv/kvserver/store_snapshot.go
Lines 1315 to 1324 in 2dc2da8
// Create an engine to use as a buffer for the empty snapshot. | |
eng, err := storage.Open( | |
context.Background(), | |
storage.InMemory(), | |
storage.CacheSize(1<<20 /* 1 MiB */), | |
storage.MaxSize(512<<20 /* 512 MiB */)) | |
if err != nil { | |
return err | |
} | |
defer eng.Close() |
Are these errors always severe enough that it warrants a node crash?
Alternatively, we can consider gating this on CrdbTestBuild
, and logging an error otherwise.
// TODO(jackson): Track down the iterator leak and | ||
// remove the storage.LeaksIteratorsTODO option. | ||
engine := storage.NewDefaultInMemForTesting(storage.LeaksIteratorsTODO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely track this down before 22.2, in case it's a new leak.
This also reminds me that we'll likely want to use |
Good point (also log.Fatal buys us sentry reporting) |
ba7c083
to
c3e9601
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll gate it behind crdb-test for now, but I don't think we can swallow these errors in production either. Ideally, we'd propagate these errors to the call sites, but the Reader
interface makes that a little difficult at the moment. Perhaps we can disentangle Close
from Reader
, so that Engine
can still implement Reader
but with a Close
method that has an error return value.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @jbowens)
c3e9601
to
b9c6321
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked down the remaining leaks—all were in test-code only
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
b9c6321
to
dba2e27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll gate it behind crdb-test for now, but I don't think we can swallow these errors in production either.
I generally agree, but I'd like to build some confidence that we won't introduce spurious node crashes over relatively benign issues here. Happy to enable this by default (and consider an error return path) after some baking time.
Tracked down the remaining leaks—all were in test-code only
Nice! I guess we can remove the additional LeaksIteratorsTODO
stuff then?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally agree, but I'd like to build some confidence that we won't introduce spurious node crashes over relatively benign issues here. Happy to enable this by default (and consider an error return path) after some baking time.
On second thought, maybe this would be a good point in the release cycle to weed this out by enabling it by default. We still have a couple of months left until the release. You call.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
dba2e27
to
185450e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can remove the additional LeaksIteratorsTODO stuff then?
I think I accidentally force pushed away my changes from another machine. Fixed.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking down the leaks!
pkg/storage/pebble.go
Outdated
if buildutil.CrdbTestBuild { | ||
p.logger.Fatalf("error during engine close: %s", err) | ||
} else { | ||
p.logger.Infof("error during engine close: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd promote this to an Errorf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pebble Logger interface doesn't have an Errorf
—updated to stash the logging context on the engine and use log.Errorf
When the crdb_test build flag is provided, fatal the process if an engine Close returns an error. This ensures unit tests and the like observe errors, especially related to leaked iterators. Close cockroachdb#71481. Release justification: low-risk bug fixes and non-production code changes Release note: None
185450e
to
e4a8a43
Compare
TFTR! |
bors r=erikgrinaker |
Build succeeded: |
Panic if an error is encountered while closing the Engine. This ensures
unit tests and the like observe errors, especially related to leaked
iterators.
Close #71481.
Close #87507.
Release justification: low-risk bug fixes and non-production code changes
Release note: None