Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] Increased Unsubscribe Wait #1213

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

KaiSernLim
Copy link
Contributor

@KaiSernLim KaiSernLim commented Oct 3, 2024

Summary

Problem

The problem is that StoreIngestionTask does not always wait for all inflight messages to be processed before transitioning the leader-follower state in the PartitionConsumptionState.

waitAfterUnsubscribe() waits up to 10 seconds for the consumers next poll() (which would indicate that the inflight messages from the last poll() were processed). This can lead to state mismatches such as from the leader-follower transition and follower-leader transition. The 10s timeout has been hit 150K times in the past month.

Mitigation

Several possible solutions were discussed but they all could be complicated. As an immediate action, we can increase the timeout value so that the consumer will more frequently safely unsubscribe instead of timing out. Increasing the timeout will also allow us to add a metric to gauge how long the actual slowdown on the drainer is.

Changes

  1. Added server config server.wait.after.unsubscribe.timeout.ms to turn the timeout wait in waitAfterUnsubscribe() into a configurable setting, and also increased the timeout:
    1. Increased the default value of this timeout from 10s to 300s / 5m
    2. During shutdown / termination scenarios when KafkaConsumerService#unsubscribeAll() is called, the timeout will remain 10s in order to not block shutdown. If the server config is lower than 10s, then that value will be used instead.
  2. Added metric wait_after_unsubscribe_latency to track and gather data about how long consumers need to wait after unsubscribing until the next poll() request is called

How was this PR tested?

GHCI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.

@KaiSernLim KaiSernLim changed the title Increased Unsubscribe Wait WIP: Increased Unsubscribe Wait Oct 3, 2024
@KaiSernLim KaiSernLim self-assigned this Oct 4, 2024
@KaiSernLim KaiSernLim changed the title WIP: Increased Unsubscribe Wait [server] Increased Unsubscribe Wait Oct 7, 2024
@KaiSernLim KaiSernLim marked this pull request as ready for review October 7, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant