Avoid unconditional retries in replicator's http client #1177

nickva · 2018-02-21T23:09:38Z

In some cases the higher level code from couch_replicator_api_wrap needs to
handle retries explicitly and cannot cope with retries happening in the lower
level http client. In such cases it sets retries = 0.

For example:

https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275

The http client then should avoid unconditional retries and instead consult
retries value. If retries = 0, it shouldn't retry and instead bubble the
exception up to the caller.

This bug was discovered when attachments were replicated to a target cluster
and the target cluster's resources were constrainted. Since attachment PUT
requests were made from the context of an open_revs GET request, PUT
request timed out, and they would retry. However, because the retry didn't
bubble up to the open_revs code, the second PUT request would die with a
noproc error, since the old parser had exited by then. See issue #745 for
more.

Testing recommendations

See issue #745 comments on how to set up testing. The code was tested locally with a Vagrant VM running Debian 8, Erlang 17.5. Hardware resources were 1 CPU, throttled to about 30%, disk throughput also throttled to about 10Mb/s. stress running in the background as stress --timeout 900m --cpu 1 --io 4.

davisp

LGTM. Am not 100% concerned about removing the clause vs not. Whatever you think is best there.

davisp · 2018-02-21T23:47:26Z

src/couch_replicator/src/couch_replicator_httpc.erl

    % ibrowse worker terminated because remote peer closed the socket
    % -> not an error
-    throw({retry, HttpDb, Params});
+    maybe_retry({error, req_timedout}, Worker, HttpDb, Params);


Shouldn't we just remove this clause entirely and let it fall through to the catch-all at the bottom?

Oh totally, let's remove it. Good call.

In some cases the higher level code from `couch_replicator_api_wrap` needs to handle retries explicitly and cannot cope with retries happening in the lower level http client. In such cases it sets `retries = 0`. For example: https://github.com/apache/couchdb/blob/master/src/couch_replicator/src/couch_replicator_api_wrap.erl#L271-L275 The http client then should avoid unconditional retries and instead consult `retries` value. If `retries = 0`, it shouldn't retry and instead bubble the exception up to the caller. This bug was discovered when attachments were replicated to a target cluster and the target cluster's resources were constrainted. Since attachment `PUT` requests were made from the context of an open_revs `GET` request, `PUT` request timed out, and they would retry. However, because the retry didn't bubble up to the `open_revs` code, the second `PUT` request would die with a `noproc` error, since the old parser had exited by then. See issue #745 for more.

davisp approved these changes Feb 21, 2018

View reviewed changes

nickva force-pushed the fix-noproc-in-replicator-put branch from 399c07b to 16699e5 Compare February 22, 2018 00:57

nickva merged commit 6d959a7 into master Feb 22, 2018

nickva deleted the fix-noproc-in-replicator-put branch February 22, 2018 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unconditional retries in replicator's http client #1177

Avoid unconditional retries in replicator's http client #1177

nickva commented Feb 21, 2018

davisp left a comment

davisp Feb 21, 2018

nickva Feb 22, 2018

Avoid unconditional retries in replicator's http client #1177

Avoid unconditional retries in replicator's http client #1177

Conversation

nickva commented Feb 21, 2018

Testing recommendations

davisp left a comment

Choose a reason for hiding this comment

davisp Feb 21, 2018

Choose a reason for hiding this comment

nickva Feb 22, 2018

Choose a reason for hiding this comment