Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a NopCloser request body with retry middleware #1016

Merged
merged 1 commit into from
Feb 2, 2017

Conversation

bamarni
Copy link
Contributor

@bamarni bamarni commented Jan 4, 2017

fixes #1008

@SantoDE
Copy link
Collaborator

SantoDE commented Jan 12, 2017

Hey @bamarni ,

first of all, thanks for your contribution 👍

I think it's fine for the moment. If we encounter an issue, where we encounter them send by upstream, we can still iterate over it.

Could you please rebase your branch with the current master?

LGTM 👼

/cc @containous/traefik

@bamarni
Copy link
Contributor Author

bamarni commented Jan 12, 2017

Hi @SantoDE , you're welcome 😸

Maybe a cleaner solution would be to pass an explicit error handler to the oxy forwarder, which instead of passing the dial error in the http response would use the request context for example, then we would only retry for this specific error, what do you think?

@SantoDE
Copy link
Collaborator

SantoDE commented Jan 12, 2017

Hi @bamarni ,

that would probably be even better! The oxy forwarder already supports passing it an error handler so why not? :-)

@bamarni
Copy link
Contributor Author

bamarni commented Jan 12, 2017

Agreed, I'll check this out soon.

@bamarni bamarni changed the title use a seekable body in retry middleware update retry middleware logic Jan 14, 2017
@bamarni
Copy link
Contributor Author

bamarni commented Jan 14, 2017

@SantoDE : I've updated the PR and splitted it in 2 commits, please check their message description. The first one is about fixing #1008, the second is changing the retry logic to only catch network errors.

@bamarni bamarni force-pushed the issue-1008 branch 3 times, most recently from 5e53ade to 03c025c Compare January 14, 2017 12:46
@SantoDE
Copy link
Collaborator

SantoDE commented Jan 16, 2017

Hey @bamarni ,

sadly, you have to rebase again :(

For your commits, I'm okay with splitting it in 2 commits, if the fixing commit has a reference to what it fixes in the message :-)

ping @containous/traefik

@bamarni
Copy link
Contributor Author

bamarni commented Jan 17, 2017

sure, just rebased

@SantoDE
Copy link
Collaborator

SantoDE commented Jan 17, 2017

Thanks. Now it looks really LGTM 👼

/ping @containous/traefik

@trecloux
Copy link
Contributor

LGTM 👍

attempts := 1
for {
recorder := NewRecorder()
recorder.responseWriter = rw
retry.next.ServeHTTP(recorder, r)
if !isNetworkError(recorder.Code) || attempts >= retry.attempts {
if recorder.Error == nil || attempts >= retry.attempts {
Copy link
Member

@emilevauge emilevauge Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bamarni are you sure you only have network errors in there? Maybe I'm wrong but I don't see the filter anymore (even in the ErrorHandler).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for double checking. The previous filter was looking at response status as it's how the default oxy error handler sets it, but my error handler would pass the error as a new field in the recorder struct. If this field is not nil then it comes from the transport RoundTrip call, meaning it is a network error.

However what I'm not sure about is whether the next attempt would be properly forwarded to the next backend and not the same one, I guess this depends on the load balancer configuration? If it's a sticky session for instance retrying wouldn't really make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider these as network errors too? At least from my pov, if the forwarder doesn't manage to send a request or get a proper response from upstream for whatever reason I'd see it as a network error, as opposed to a valid http response from upstream with a 5XX status code, which would currently be retried.

Shall I stick to the first commit and open an other issue to discuss about the exact behaviour we want for the retry feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider these as network errors too?

Not all these errors are network errors. For example https://github.com/containous/oxy/blob/master/forward/fwd.go#L278.

Therefore, I think it would be better to revert to the old test if !isNetworkError(recorder.Code) || attempts >= retry.attempts { :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted it, however please note that it wouldn't change the behaviour for errors you're mentioning, they'd still lead to retries.

The errors I was tackling with this change are instead explicit http errors from upstream, to compare this according to proxy_next_upstream semantics, currently the behaviour is :

error timeout http_502 http_504 non_idempotent

And with my changes it would be :

error timeout non_idempotent

The motivation behind this is that imo http_5XX + non_idempotent is a bit aggressive for a default. If upstream sends http 504 for a POST request it still might have altered a resource but didn't manage to serve the response in a timely manner, and retrying might not be the best option, at least hard to assume. Maybe even non_idempotent shouldn't be by default? Or the retry behaviour should be configurable?

Copy link
Contributor

@Russell-IO Russell-IO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@bamarni
Copy link
Contributor Author

bamarni commented Feb 2, 2017

It seems like there are random failures in the build 😿 For example mine failed on TestAccessLog, which contains this code : https://github.com/containous/traefik/blob/009057cb87db4a0991909c9c813c0819a3d63e18/integration/access_log_test.go#L37

There are multiple places with this kind of random wait period in the test suite, which are subject to failures. How about introducing generic helpers, for example utils.WaitTcp(addr, timeout) instead of hardcoding a time period?

@emilevauge
Copy link
Member

@bamarni
Yeah we know... We will refactor tests in the next release.

@bamarni
Copy link
Contributor Author

bamarni commented Feb 2, 2017

@emilevauge : cool, let me know if you need extra help. As per this PR I've rebased it.

As the http client always closes the request body,
this makes sure the request can be retried if needed.

Fixes traefik#1008
@bamarni bamarni mentioned this pull request Feb 2, 2017
@bamarni bamarni changed the title update retry middleware logic Set a NopCloser request body with retry middleware Feb 2, 2017
@emilevauge emilevauge merged commit d0e2349 into traefik:master Feb 2, 2017
@ldez ldez added this to the 1.2 milestone Oct 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Traefik failing on POST request
6 participants