Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow more time for consumer group cleanup #8189

Merged
merged 1 commit into from
Sep 28, 2021

Conversation

vvcephei
Copy link
Member

During local testing, I've noticed that consumer group cleanup for my transient queries
is frequently timing out and leaving defunct groups lying around. This patch increases the amount
of time to wait for the group to become empty.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@vvcephei vvcephei requested a review from a team as a code owner September 28, 2021 19:58
@vvcephei vvcephei self-assigned this Sep 28, 2021
@vvcephei vvcephei added the query-engine Issues owned by the ksqlDB Query Engine team label Sep 28, 2021
|| (e instanceof GroupNotEmptyException && retryCount.getAndIncrement() < 5),
() -> Duration.of(3, ChronoUnit.SECONDS)
|| (e instanceof GroupNotEmptyException),
(retry) -> Duration.of(3L * retry, ChronoUnit.SECONDS),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First change: add linear backoff so that we will wait 3, 6, 9, and so forth seconds between each retry instead of just waiting 3 seconds every time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own curiosity, how did we come up with using 3 seconds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment there that says that it's the default heartbeat interval. I'm not sure if the person who made this choice was aware that we're actually waiting for the heartbeat interval + the session timeout, though.

Come to think of it, I bet that this thing has just recently started timing out as another casualty of the KIP to change the default session interval to 30 seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah i only saw the first half of the comment so I missed the explanation 😅 And that makes sense, good catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the person who made this choice was aware that we're actually waiting for the heartbeat interval + the session timeout, though.

hehe, guilty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! Smoked you out!

Anyway, I think it was fine until we bumped that default session interval up so high.

() -> Duration.of(3, ChronoUnit.SECONDS)
|| (e instanceof GroupNotEmptyException),
(retry) -> Duration.of(3L * retry, ChronoUnit.SECONDS),
10
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second change: allow more retries. It was previously hard-coded in ExecutorUtil to 5, now it will be 10. So we would previously only wait up to 12 seconds before throwing an exception, and now we will wait for 135 seconds.

Copy link
Member

@nateab nateab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

|| (e instanceof GroupNotEmptyException && retryCount.getAndIncrement() < 5),
() -> Duration.of(3, ChronoUnit.SECONDS)
|| (e instanceof GroupNotEmptyException),
(retry) -> Duration.of(3L * retry, ChronoUnit.SECONDS),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own curiosity, how did we come up with using 3 seconds?

@vvcephei
Copy link
Member Author

Thanks, @nateab !

@vvcephei vvcephei merged commit d2ff1ac into master Sep 28, 2021
@vvcephei vvcephei deleted the fix-consumer-group-cleanup branch September 28, 2021 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
query-engine Issues owned by the ksqlDB Query Engine team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants