Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*.Retry.BackoffFunc #1160

Merged
merged 3 commits into from
Feb 15, 2019
Merged

*.Retry.BackoffFunc #1160

merged 3 commits into from
Feb 15, 2019

Conversation

thomaslee
Copy link
Contributor

This is just a proof-of-concept to convey the idea, I intend to flesh things out based on feedback if this seems like something that would be likely to land.

The Retry.Backoff config settings as they currently stand can force users to make some hard choices. Take Config.Producer.Retry.Backoff for example: using the default configuration, a 300ms+ hiccup in the cluster could lead to data loss without careful error handling in user code. Say you want to survive up to a 10s hiccup without data loss -- you're left with a few bad options:

  1. You increase Producer.Retry.Backoff to 4-5s+ and live with the occasional large latency spike. (This includes things like partition leadership changes.); or
  2. You increase Producer.Retry.Max to 100 and keep Producer.Retry.Backoff in the low 100ms and live with the risk of increased duplicates.
  3. You fiddle endlessly with these numbers trying to find the right balance between "do not lose data" and "do not be slow".

This change offers a better middle ground: compute the backoff using an (optional) function. This opens the door for more sophisticated strategies (e.g. exponential backoff) without breaking existing code.

Seems like if we do this for producers, we might as well do it for Metadata, Consumer, etc. too.

I see exponential backoff was discussed over in #782 but I don't think the suggested patch ever materialized. So thought I'd have a shot.

@eapache
Copy link
Contributor

eapache commented Aug 29, 2018

@thomaslee
Copy link
Contributor Author

Yeah, the actual retry logic already exists so seems less risky to just give users a little flexibility wrt how the retry itself is calculated. Maybe retrier would make more sense when you're keen to make more sweeping changes to the innards of AsyncProducer et al?

FWIW the specific case we're running into is on some very low volume topics, we run afoul of connections.max.idle.ms. We get an EOF, then the retry logic kicks in and puts us to sleep for several seconds (we went with "option 1" above), so the first send after a long hiatus is super, duper slow. Other ephemeral stuff like partition leadership changes are just as likely to bite us, though. I'd be happy with anything that give us a little more flexibility to work around the smaller, expected hiccups.

@thomaslee
Copy link
Contributor Author

Cleaned up the change a bit & added a test. I'll hook up similar changes for Metadata.Retry.Backoff & Consumer.Retry.Backoff tomorrow if I don't hear any complaints between now and then.

@thomaslee
Copy link
Contributor Author

Addeed Config.{Consumer,Metadata}.Retry.BackoffFunc too & tests for all three. The only thing that feels a bit iffy here is the BackoffFunc for consumers has no upper bound. I think that's okay, but sort of a shame they don't all have the same signature.

Holler out if anything else looks suspicious.

@thomaslee
Copy link
Contributor Author

Just giving this a bump -- any thoughts?

@varun06
Copy link
Contributor

varun06 commented Feb 13, 2019

This is something that I was researching today, let's bring it back and merge if things look okay.

Copy link

@sam-obeid sam-obeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, and I see how this is useful! As long as we keep it optional as mentioned by @eapache, this is good to merge!

@varun06 varun06 merged commit 6a7bac8 into IBM:master Feb 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants