Fix shutdown panics by separating completer context #401

bgentry · 2024-06-25T22:56:22Z

Back in #258 / 702d5b2, the batch completer was added to improve
throughput. As part of that refactor, it was turned into a startstop
service that took a context on start. We took the care to ensure that
the context provided to the completer was not the fetchCtx
(cancelled on Stop()) but instead was the raw user-provided ctx,
specifically to make sure the completer could finish its work even after
fetches were stopped.

This worked well if the whole shutdown process was done with Stop /
StopAndCancel, but it did not work if the user-provided context was
itself cancelled outside of River. In that scenario, the completer would
immediately begin shutting down upon cancellation, even without waiting
for producers to finish sending it any final jobs that needed to be
recorded. This went unnoticed until #379 / 0e57338 turned this scenario
into a panic instead of a silent misbehavior, which is what was
encountered in #400.

To fix this situation, we need to use Go 1.21's new
context.WithoutCancel API to fork the user-provided context so that we
maintain whatever else is stored in there (i.e. so anything used by slog
is still available) but we do not cancel this completer's context
ever. The completer will manage its own shutdown when its Stop() is
called as part of all of the other client services being stopped in
parallel.

Fixes #400.

This failing test case exposes the issue in #400 100% of the time, which is caused by the `stopProducers()` call not actually waiting until the producers are fully shut down before proceeding with the remaining shutdown.

Back in #258 / 702d5b2, the batch completer was added to improve throughput. As part of that refactor, it was turned into a startstop service that took a context on start. We took the care to ensure that the context provided to the completer was _not_ the `fetchCtx` (cancelled on `Stop()`) but instead was the raw user-provided `ctx`, specifically to make sure the completer could finish its work even after fetches were stopped. This worked well if the whole shutdown process was done with `Stop` / `StopAndCancel`, but it did not work if the user-provided context was itself cancelled outside of River. In that scenario, the completer would immediately begin shutting down upon cancellation, even without waiting for producers to finish sending it any final jobs that needed to be recorded. This went unnoticed until #379 / 0e57338 turned this scenario into a panic instead of a silent misbehavior, which is what was encountered in #400. To fix this situation, we need to use Go 1.21's new `context.WithoutCancel` API to fork the user-provided context so that we maintain whatever else is stored in there (i.e. so anything used by slog is still available) but we do not cancel this completer's context _ever_. The completer will manage its own shutdown when its `Stop()` is called as part of all of the other client services being stopped in parallel.

brandur

lgtm!

Prepare release for version 0.8.0 which includes some additions and fixes like #402 and #401.

elee1766 · 2024-06-26T19:33:52Z

could this have been causing stuck jobs on graceful shutdown?

bgentry · 2024-06-26T19:59:13Z

@elee1766 I believe so, we should add that to the changelog. The core issue is the completer (marks completed jobs in the db) wasn’t properly waiting for jobs to finish executing due to a refactor. So I think it could have resulted in some jobs not being marked as finished when they had in fact executed and returned.

add failing test case for completer shutdown panic

543f369

This failing test case exposes the issue in #400 100% of the time, which is caused by the `stopProducers()` call not actually waiting until the producers are fully shut down before proceeding with the remaining shutdown.

bgentry force-pushed the bg-fix-completer-send-on-closed-channel-panic branch from 389aaf9 to 93532e8 Compare June 26, 2024 02:05

bgentry changed the title ~~add failing test case for completer shutdown panic~~ fix completer shutdown panic Jun 26, 2024

bgentry force-pushed the bg-fix-completer-send-on-closed-channel-panic branch from 93532e8 to ca03eaa Compare June 26, 2024 02:08

bgentry changed the title ~~fix completer shutdown panic~~ Fix shutdowns panic by separating completer context Jun 26, 2024

bgentry force-pushed the bg-fix-completer-send-on-closed-channel-panic branch from ca03eaa to 9f4d1f1 Compare June 26, 2024 02:10

bgentry marked this pull request as ready for review June 26, 2024 02:10

bgentry changed the title ~~Fix shutdowns panic by separating completer context~~ Fix shutdown panics by separating completer context Jun 26, 2024

bgentry requested a review from brandur June 26, 2024 02:14

brandur approved these changes Jun 26, 2024

View reviewed changes

bgentry merged commit 6c6e868 into master Jun 26, 2024
10 checks passed

bgentry deleted the bg-fix-completer-send-on-closed-channel-panic branch June 26, 2024 02:28

bgentry mentioned this pull request Jun 26, 2024

elector: resign leadership w/ WithoutCancel ctx, not background #403

Merged

brandur added a commit that referenced this pull request Jun 26, 2024

Prepare version 0.8.0

ed9aae1

Prepare release for version 0.8.0 which includes some additions and fixes like #402 and #401.

brandur mentioned this pull request Jun 26, 2024

Prepare version 0.8.0 #404

Merged

brandur added a commit that referenced this pull request Jun 26, 2024

Prepare version 0.8.0 (#404)

b00c971

Prepare release for version 0.8.0 which includes some additions and fixes like #402 and #401.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix shutdown panics by separating completer context #401

Fix shutdown panics by separating completer context #401

bgentry commented Jun 25, 2024 •

edited

Loading

brandur left a comment

elee1766 commented Jun 26, 2024

bgentry commented Jun 26, 2024

Fix shutdown panics by separating completer context #401

Fix shutdown panics by separating completer context #401

Conversation

bgentry commented Jun 25, 2024 • edited Loading

brandur left a comment

Choose a reason for hiding this comment

elee1766 commented Jun 26, 2024

bgentry commented Jun 26, 2024

bgentry commented Jun 25, 2024 •

edited

Loading