Avoid unconditional retries in replicator's http client #1177
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In some cases the higher level code from
couch_replicator_api_wrap
needs tohandle retries explicitly and cannot cope with retries happening in the lower
level http client. In such cases it sets
retries = 0
.For example:
https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275
The http client then should avoid unconditional retries and instead consult
retries
value. Ifretries = 0
, it shouldn't retry and instead bubble theexception up to the caller.
This bug was discovered when attachments were replicated to a target cluster
and the target cluster's resources were constrainted. Since attachment
PUT
requests were made from the context of an open_revs
GET
request,PUT
request timed out, and they would retry. However, because the retry didn't
bubble up to the
open_revs
code, the secondPUT
request would die with anoproc
error, since the old parser had exited by then. See issue #745 formore.
Testing recommendations
See issue #745 comments on how to set up testing. The code was tested locally with a Vagrant VM running Debian 8, Erlang 17.5. Hardware resources were 1 CPU, throttled to about 30%, disk throughput also throttled to about 10Mb/s.
stress
running in the background asstress --timeout 900m --cpu 1 --io 4
.