[server] Increased Unsubscribe Wait #1213
Open
+98
−27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Problem
The problem is that
StoreIngestionTask
does not always wait for all inflight messages to be processed before transitioning the leader-follower state in thePartitionConsumptionState
.waitAfterUnsubscribe()
waits up to 10 seconds for the consumers nextpoll()
(which would indicate that the inflight messages from the lastpoll()
were processed). This can lead to state mismatches such as from the leader-follower transition and follower-leader transition. The10s
timeout has been hit 150K times in the past month.Mitigation
Several possible solutions were discussed but they all could be complicated. As an immediate action, we can increase the timeout value so that the consumer will more frequently safely unsubscribe instead of timing out. Increasing the timeout will also allow us to add a metric to gauge how long the actual slowdown on the drainer is.
Changes
server.wait.after.unsubscribe.timeout.ms
to turn the timeout wait inwaitAfterUnsubscribe()
into a configurable setting, and also increased the timeout:10s
to300s
/5m
KafkaConsumerService#unsubscribeAll()
is called, the timeout will remain10s
in order to not block shutdown. If the server config is lower than10s
, then that value will be used instead.wait_after_unsubscribe_latency
to track and gather data about how long consumers need to wait after unsubscribing until the nextpoll()
request is calledHow was this PR tested?
GHCI
Does this PR introduce any user-facing changes?