Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shutdown panics by separating completer context #401

Merged
merged 2 commits into from
Jun 26, 2024

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented Jun 25, 2024

Back in #258 / 702d5b2, the batch completer was added to improve
throughput. As part of that refactor, it was turned into a startstop
service that took a context on start. We took the care to ensure that
the context provided to the completer was not the fetchCtx
(cancelled on Stop()) but instead was the raw user-provided ctx,
specifically to make sure the completer could finish its work even after
fetches were stopped.

This worked well if the whole shutdown process was done with Stop /
StopAndCancel, but it did not work if the user-provided context was
itself cancelled outside of River. In that scenario, the completer would
immediately begin shutting down upon cancellation, even without waiting
for producers to finish sending it any final jobs that needed to be
recorded. This went unnoticed until #379 / 0e57338 turned this scenario
into a panic instead of a silent misbehavior, which is what was
encountered in #400.

To fix this situation, we need to use Go 1.21's new
context.WithoutCancel API to fork the user-provided context so that we
maintain whatever else is stored in there (i.e. so anything used by slog
is still available) but we do not cancel this completer's context
ever. The completer will manage its own shutdown when its Stop() is
called as part of all of the other client services being stopped in
parallel.

Fixes #400.

This failing test case exposes the issue in #400 100% of the time, which
is caused by the `stopProducers()` call not actually waiting until the
producers are fully shut down before proceeding with the remaining
shutdown.
@bgentry bgentry force-pushed the bg-fix-completer-send-on-closed-channel-panic branch from 389aaf9 to 93532e8 Compare June 26, 2024 02:05
@bgentry bgentry changed the title add failing test case for completer shutdown panic fix completer shutdown panic Jun 26, 2024
@bgentry bgentry force-pushed the bg-fix-completer-send-on-closed-channel-panic branch from 93532e8 to ca03eaa Compare June 26, 2024 02:08
@bgentry bgentry changed the title fix completer shutdown panic Fix shutdowns panic by separating completer context Jun 26, 2024
Back in #258 / 702d5b2, the batch completer was added to improve
throughput. As part of that refactor, it was turned into a startstop
service that took a context on start. We took the care to ensure that
the context provided to the completer was _not_ the `fetchCtx`
(cancelled on `Stop()`) but instead was the raw user-provided `ctx`,
specifically to make sure the completer could finish its work even after
fetches were stopped.

This worked well if the whole shutdown process was done with `Stop` /
`StopAndCancel`, but it did not work if the user-provided context was
itself cancelled outside of River. In that scenario, the completer would
immediately begin shutting down upon cancellation, even without waiting
for producers to finish sending it any final jobs that needed to be
recorded. This went unnoticed until #379 / 0e57338 turned this scenario
into a panic instead of a silent misbehavior, which is what was
encountered in #400.

To fix this situation, we need to use Go 1.21's new
`context.WithoutCancel` API to fork the user-provided context so that we
maintain whatever else is stored in there (i.e. so anything used by slog
is still available) but we do not cancel this completer's context
_ever_. The completer will manage its own shutdown when its `Stop()` is
called as part of all of the other client services being stopped in
parallel.
@bgentry bgentry force-pushed the bg-fix-completer-send-on-closed-channel-panic branch from ca03eaa to 9f4d1f1 Compare June 26, 2024 02:10
@bgentry bgentry marked this pull request as ready for review June 26, 2024 02:10
@bgentry bgentry changed the title Fix shutdowns panic by separating completer context Fix shutdown panics by separating completer context Jun 26, 2024
@bgentry bgentry requested a review from brandur June 26, 2024 02:14
Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@bgentry bgentry merged commit 6c6e868 into master Jun 26, 2024
10 checks passed
@bgentry bgentry deleted the bg-fix-completer-send-on-closed-channel-panic branch June 26, 2024 02:28
brandur added a commit that referenced this pull request Jun 26, 2024
Prepare release for version 0.8.0 which includes some additions and
fixes like #402 and #401.
@brandur brandur mentioned this pull request Jun 26, 2024
brandur added a commit that referenced this pull request Jun 26, 2024
Prepare release for version 0.8.0 which includes some additions and
fixes like #402 and #401.
@elee1766
Copy link
Contributor

could this have been causing stuck jobs on graceful shutdown?

@bgentry
Copy link
Contributor Author

bgentry commented Jun 26, 2024

@elee1766 I believe so, we should add that to the changelog. The core issue is the completer (marks completed jobs in the db) wasn’t properly waiting for jobs to finish executing due to a refactor. So I think it could have resulted in some jobs not being marked as finished when they had in fact executed and returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic: send on closed channel
3 participants