Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry HTTP for status >= 500 (exponential backoff) #3164

Merged
merged 3 commits into from
Jun 2, 2021

Conversation

attilapiros
Copy link
Contributor

@attilapiros attilapiros commented May 24, 2021

Description

Fix #3087
This PR introduces configurable retry with exponential backoff for HTTP operations.
As according to the API conventions when the HTTP status code is one of the followings:

  • 500 StatusInternalServerError
  • 503 StatusServiceUnavailable
  • 504 StatusServerTimeout

Then the suggested client recovery behavior is "retry with exponential backoff".

Although in case of 504 it is "Increase the value of the timeout param and retry with exponential backoff" but in this PR I would only focus on the retries. That can be implemented in a follow up PR.

For the intention behind please check: #3087.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change
  • Chore (non-breaking change which doesn't affect codebase;
    test, version modification, documentation, etc.)

Checklist

  • Code contributed by me aligns with current project license: Apache 2.0
  • I Added CHANGELOG entry regarding this change
  • I have implemented unit tests to cover my changes
  • I have added/updated the javadocs and other documentation accordingly
  • No new bugs, code smells, etc. in SonarCloud report
  • I tested my code in Kubernetes
  • I tested my code in OpenShift

@centos-ci
Copy link

Can one of the admins verify this patch?

@shawkins
Copy link
Contributor

I think the code looks good, a general question would be around the existing backoff strategy in

Should there be consistency between the properties / strategy? Watch only uses interval and limit. There's also a hardcoded value of 5 limiting the exponential scaling to 32x.

@attilapiros
Copy link
Contributor Author

@shawkins
Thanks for the review!
I will update the PR with the suggested changes next week (or even sooner depending on my other tasks).

@attilapiros
Copy link
Contributor Author

I have renamed the properties to follow the existing naming used at watchers:

  • kubernetes.request.retry.backoffLimit`
  • kubernetes.request.retry.backoffInterval`

And extracted the interval calculation into ExponentialBackoffIntervalCalculator which used by both OperationSupport and AbstractWatchManager.

This new class is not tested separately as the functionality is already tested by the existing watcher's unit test and it's logic is lightweight:
https://github.com/fabric8io/kubernetes-client/blob/3c95a6f7da2271392a76ae806d475018d33b66d3/kubernetes-client/src/test/java/io/fabric8/kubernetes/client/dsl/internal/AbstractWatchManagerTest.java#L83
But we can revisit this decision if you feel like.

The force-push was done because of the rebase on the top of master.

@attilapiros
Copy link
Contributor Author

I think the integration test failure is unrelated, probably a flaky test:

Error:  Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.729 s <<< FAILURE! - in io.fabric8.kubernetes.PersistentVolumeClaimIT
Error:  io.fabric8.kubernetes.PersistentVolumeClaimIT.update  Time elapsed: 0.182 s  <<< ERROR!
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at:
 https://10.1.1.35:8443/api/v1/namespaces/itest-a2b75c94/persistentvolumeclaims/persistentvolumeclaims-update. 
Message: the server rejected our request due to an error in our request. 
Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[], group=null, kind=null, name=null,
 retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=the server rejected our request due 
 to an error in our request, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null,
  additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.PersistentVolumeClaimIT.update(PersistentVolumeClaimIT.java:66)

@attilapiros
Copy link
Contributor Author

attilapiros commented May 29, 2021

I looked around and found this:
#3136 (comment)

So this error was a known problem before which was tried to resolved by #3148

My changes are on the top of the current HEAD so I have these commit:
attilapiros@d1ab437
attilapiros@390de03

cc @shawkins @manusa could you please take a look to this?

@attilapiros
Copy link
Contributor Author

Big thanks for the rerun to anybody who helped me! :)

@shawkins
Copy link
Contributor

could you please take a look to this?

Yes that is the same issue an edit is just a different from of patch(item). That test should be updated to remove the possibility of a 422.

@attilapiros attilapiros requested a review from oscerd June 1, 2021 04:02
Copy link
Contributor

@shawkins shawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shawkins shawkins mentioned this pull request Jun 2, 2021
11 tasks
README.md Outdated Show resolved Hide resolved
Co-authored-by: Marc Nuri <[email protected]>
Copy link
Member

@manusa manusa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx!

@sonarcloud
Copy link

sonarcloud bot commented Jun 2, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

96.9% 96.9% Coverage
0.0% 0.0% Duplication

@manusa manusa merged commit d19c752 into fabric8io:master Jun 2, 2021
@geoand
Copy link
Contributor

geoand commented Jul 1, 2021

Cool feature!

I have a question, do these new properties come into play when a java.net.ConnectException: Connection refused (Connection refused) is thrown by the client?
Looking at the code, it doesn't look like it. Would it not make sense to handle this case as well?

@rohanKanojia
Copy link
Member

@attilapiros @shawkins @manusa : Could you please share your thoughts on this?

@shawkins
Copy link
Contributor

shawkins commented Jul 1, 2021

Looking at the code, it doesn't look like it

Yes, it will just throw the IOExeption.

Would it not make sense to handle this case as well?

Yes, that's an assumption made in other places, like with informers https://github.com/kubernetes/client-go/blob/master/tools/cache/reflector.go#L416

FWIW the code in fabric8 is a little different for watches/informers wrt connection refused - if the initial watch/list (which would be subject to the common retry logic) succeeds, then any later IOExceptions will be assumed to be recoverable - indefinitely in the case of informers.

@geoand
Copy link
Contributor

geoand commented Jul 1, 2021

Thanks for the information.

Should I open a new issue for this feature request?

@shawkins
Copy link
Contributor

shawkins commented Jul 1, 2021

Should I open a new issue for this feature request?

Yes please.

@attilapiros
Copy link
Contributor Author

Sorry I am away from my computer but next week latest I can check it

@geoand
Copy link
Contributor

geoand commented Jul 1, 2021

#3291 is the new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support HTTP operation retry with exponential backoff (for status code >= 500)
7 participants