Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: check error on Engine close #87232

Merged
merged 1 commit into from
Sep 14, 2022
Merged

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Aug 31, 2022

Panic if an error is encountered while closing the Engine. This ensures
unit tests and the like observe errors, especially related to leaked
iterators.

Close #71481.
Close #87507.

Release justification: low-risk bug fixes and non-production code changes
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens force-pushed the clean-db-close branch 5 times, most recently from 3eeb747 to 630f2fa Compare September 1, 2022 14:03
@jbowens jbowens marked this pull request as ready for review September 1, 2022 20:53
@jbowens jbowens requested review from a team as code owners September 1, 2022 20:53
@jbowens jbowens changed the title storage: check error on engine close storage: check error on Engine close Sep 1, 2022
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few unit tests that leak iterators that I haven't tracked down yet, so they're marked as exempt from this behavior for now. I'm suspicious that there might be a legitimate iterator leak somewhere in here. I'm going to create a GA blocker issue for identifying and resolving these remaining leaks.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very +1 on the testing aspects of this. However, I am a bit worried that this can cause spurious node crashes when we close engines outside of node shutdown, such as here:

// Create an engine to use as a buffer for the empty snapshot.
eng, err := storage.Open(
context.Background(),
storage.InMemory(),
storage.CacheSize(1<<20 /* 1 MiB */),
storage.MaxSize(512<<20 /* 512 MiB */))
if err != nil {
return err
}
defer eng.Close()

Are these errors always severe enough that it warrants a node crash?

Alternatively, we can consider gating this on CrdbTestBuild, and logging an error otherwise.

Comment on lines 903 to 905
// TODO(jackson): Track down the iterator leak and
// remove the storage.LeaksIteratorsTODO option.
engine := storage.NewDefaultInMemForTesting(storage.LeaksIteratorsTODO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely track this down before 22.2, in case it's a new leak.

@erikgrinaker
Copy link
Contributor

This also reminds me that we'll likely want to use log.Fatalf rather than panic for most crashes like these, since a panic may not cause the node to terminate (it may e.g. have been handled by a recover somewhere in the call stack), and instead limp along in some indeterminate state.

@knz
Copy link
Contributor

knz commented Sep 6, 2022

Good point (also log.Fatal buys us sentry reporting)

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll gate it behind crdb-test for now, but I don't think we can swallow these errors in production either. Ideally, we'd propagate these errors to the call sites, but the Reader interface makes that a little difficult at the moment. Perhaps we can disentangle Close from Reader, so that Engine can still implement Reader but with a Close method that has an error return value.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @jbowens)

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked down the remaining leaks—all were in test-code only

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll gate it behind crdb-test for now, but I don't think we can swallow these errors in production either.

I generally agree, but I'd like to build some confidence that we won't introduce spurious node crashes over relatively benign issues here. Happy to enable this by default (and consider an error return path) after some baking time.

Tracked down the remaining leaks—all were in test-code only

Nice! I guess we can remove the additional LeaksIteratorsTODO stuff then?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree, but I'd like to build some confidence that we won't introduce spurious node crashes over relatively benign issues here. Happy to enable this by default (and consider an error return path) after some baking time.

On second thought, maybe this would be a good point in the release cycle to weed this out by enabling it by default. We still have a couple of months left until the release. You call.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can remove the additional LeaksIteratorsTODO stuff then?

I think I accidentally force pushed away my changes from another machine. Fixed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking down the leaks!

if buildutil.CrdbTestBuild {
p.logger.Fatalf("error during engine close: %s", err)
} else {
p.logger.Infof("error during engine close: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd promote this to an Errorf.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pebble Logger interface doesn't have an Errorf—updated to stash the logging context on the engine and use log.Errorf

When the crdb_test build flag is provided, fatal the process if an engine Close
returns an error. This ensures unit tests and the like observe errors,
especially related to leaked iterators.

Close cockroachdb#71481.

Release justification: low-risk bug fixes and non-production code changes
Release note: None
@jbowens
Copy link
Collaborator Author

jbowens commented Sep 13, 2022

TFTR!

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 14, 2022

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Sep 14, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants