-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: allow more time for consumer group cleanup #8189
Conversation
|| (e instanceof GroupNotEmptyException && retryCount.getAndIncrement() < 5), | ||
() -> Duration.of(3, ChronoUnit.SECONDS) | ||
|| (e instanceof GroupNotEmptyException), | ||
(retry) -> Duration.of(3L * retry, ChronoUnit.SECONDS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First change: add linear backoff so that we will wait 3, 6, 9, and so forth seconds between each retry instead of just waiting 3 seconds every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own curiosity, how did we come up with using 3 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a comment there that says that it's the default heartbeat interval. I'm not sure if the person who made this choice was aware that we're actually waiting for the heartbeat interval + the session timeout, though.
Come to think of it, I bet that this thing has just recently started timing out as another casualty of the KIP to change the default session interval to 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i only saw the first half of the comment so I missed the explanation 😅 And that makes sense, good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the person who made this choice was aware that we're actually waiting for the heartbeat interval + the session timeout, though.
hehe, guilty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! Smoked you out!
Anyway, I think it was fine until we bumped that default session interval up so high.
() -> Duration.of(3, ChronoUnit.SECONDS) | ||
|| (e instanceof GroupNotEmptyException), | ||
(retry) -> Duration.of(3L * retry, ChronoUnit.SECONDS), | ||
10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second change: allow more retries. It was previously hard-coded in ExecutorUtil to 5, now it will be 10. So we would previously only wait up to 12 seconds before throwing an exception, and now we will wait for 135 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|| (e instanceof GroupNotEmptyException && retryCount.getAndIncrement() < 5), | ||
() -> Duration.of(3, ChronoUnit.SECONDS) | ||
|| (e instanceof GroupNotEmptyException), | ||
(retry) -> Duration.of(3L * retry, ChronoUnit.SECONDS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own curiosity, how did we come up with using 3 seconds?
Thanks, @nateab ! |
During local testing, I've noticed that consumer group cleanup for my transient queries
is frequently timing out and leaving defunct groups lying around. This patch increases the amount
of time to wait for the group to become empty.
Reviewer checklist