Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for etcd client oneshot cluster member cycling #16307

Merged

Conversation

rrati
Copy link

@rrati rrati commented Sep 12, 2017

Backport of etcd-io/etcd#8519

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 12, 2017
@rrati
Copy link
Author

rrati commented Sep 12, 2017

/assign @deads2k

@rrati rrati changed the title UPSTREAM: coreos/etcd: 8519: Fix for etcd client oneshot cluster member cycling Fix for etcd client oneshot cluster member cycling Sep 12, 2017
@deads2k
Copy link
Contributor

deads2k commented Sep 12, 2017

@rrati the pick was clean?

@rrati
Copy link
Author

rrati commented Sep 12, 2017

@deads2k More or less. The changes to the _test file aren't included because this version of etcd didn't have the tests. The changes to the client code applied cleanly.

@deads2k
Copy link
Contributor

deads2k commented Sep 12, 2017

lgtm

Do you have an issue or bug to tie to this?

@rrati
Copy link
Author

rrati commented Sep 12, 2017

A BZ, but no github issue

@liggitt
Copy link
Contributor

liggitt commented Sep 12, 2017

remind me how we're managing picks for non-kube repos? also, an issue to track bumping to a release that includes this would be good (and an issue for kube, which is impacted the same way)

@liggitt
Copy link
Contributor

liggitt commented Sep 14, 2017

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 14, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, rrati

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2017
@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue

@openshift-merge-robot openshift-merge-robot merged commit b3b8ade into openshift:release-3.6 Sep 15, 2017
@sttts
Copy link
Contributor

sttts commented Mar 12, 2018

Do we have a backport of this into origin master? The upstream PR only merged into etcd 3.3+.

@liggitt
Copy link
Contributor

liggitt commented Mar 12, 2018

it was backported into etcd 3.2.8 in etcd-io/etcd@15e9510 and picked up in 3.7+ in 6032f62#diff-1c6011bace39f9ad159c10fa9a674a3a

@mfojtik
Copy link
Contributor

mfojtik commented Mar 12, 2018

@liggitt how about 3.2.16 we have in 3.9 ?

@liggitt
Copy link
Contributor

liggitt commented Mar 12, 2018

yes, >= 3.2.8

verified this fix is in master:

} else if resp.StatusCode/100 == 5 {
switch resp.StatusCode {
case http.StatusInternalServerError, http.StatusServiceUnavailable:
// TODO: make sure this is a no leader response
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s has no leader", eps[k].String()))
default:
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s returns server error [%s]", eps[k].String(), http.StatusText(resp.StatusCode)))
}
err = cerr.Errors[0]
}
if err != nil {
if !isOneShot {
continue
}
c.Lock()
c.pinned = (k + 1) % leps
c.Unlock()
return nil, nil, err
}
if k != pinned {
c.Lock()
c.pinned = k
c.Unlock()
}

3.9:

} else if resp.StatusCode/100 == 5 {
switch resp.StatusCode {
case http.StatusInternalServerError, http.StatusServiceUnavailable:
// TODO: make sure this is a no leader response
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s has no leader", eps[k].String()))
default:
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s returns server error [%s]", eps[k].String(), http.StatusText(resp.StatusCode)))
}
err = cerr.Errors[0]
}
if err != nil {
if !isOneShot {
continue
}
c.Lock()
c.pinned = (k + 1) % leps
c.Unlock()
return nil, nil, err
}
if k != pinned {
c.Lock()
c.pinned = k
c.Unlock()
}

3.8:

} else if resp.StatusCode/100 == 5 {
switch resp.StatusCode {
case http.StatusInternalServerError, http.StatusServiceUnavailable:
// TODO: make sure this is a no leader response
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s has no leader", eps[k].String()))
default:
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s returns server error [%s]", eps[k].String(), http.StatusText(resp.StatusCode)))
}
err = cerr.Errors[0]
}
if err != nil {
if !isOneShot {
continue
}
c.Lock()
c.pinned = (k + 1) % leps
c.Unlock()
return nil, nil, err
}
if k != pinned {
c.Lock()
c.pinned = k
c.Unlock()
}

3.7:

} else if resp.StatusCode/100 == 5 {
switch resp.StatusCode {
case http.StatusInternalServerError, http.StatusServiceUnavailable:
// TODO: make sure this is a no leader response
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s has no leader", eps[k].String()))
default:
cerr.Errors = append(cerr.Errors, fmt.Errorf("client: etcd member %s returns server error [%s]", eps[k].String(), http.StatusText(resp.StatusCode)))
}
err = cerr.Errors[0]
}
if err != nil {
if !isOneShot {
continue
}
c.Lock()
c.pinned = (k + 1) % leps
c.Unlock()
return nil, nil, err
}
if k != pinned {
c.Lock()
c.pinned = k
c.Unlock()
}

@sttts
Copy link
Contributor

sttts commented Mar 12, 2018

@liggitt without any deeper knowledge of the client code: all your references point to the etcd2 client, not the clientv3 package.

@liggitt
Copy link
Contributor

liggitt commented Mar 12, 2018

hmm, true. I wonder if there was a similar issue with the etcd v3 client

@sttts
Copy link
Contributor

sttts commented Mar 12, 2018

But even in v3, there is a 10 second context passed to the dial func. How can that block for 15 minutes? (@mfojtik noticed that)

@sttts
Copy link
Contributor

sttts commented Mar 12, 2018

With persistent grpc connections we certainly do not talk about a Dial call here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants