Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: deprecate grpc.ErrClientConnTimeout errors #8505

Merged
merged 3 commits into from
Sep 5, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions Documentation/upgrades/upgrade_3_3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
## Upgrade etcd from 3.2 to 3.3

In the general case, upgrading from etcd 3.2 to 3.3 can be a zero-downtime, rolling upgrade:
- one by one, stop the etcd v3.2 processes and replace them with etcd v3.3 processes
- after running all v3.3 processes, new features in v3.3 are available to the cluster

Before [starting an upgrade](#upgrade-procedure), read through the rest of this guide to prepare.

### Client upgrade checklists

3.3 introduces breaking changes (TODO: update this before 3.3 release).

Previously, `grpc.ErrClientConnTimeout` error is returned on client dial time-outs. 3.3 instead returns `context.DeadlineExceeded` (see [#8504](https://github.com/coreos/etcd/issues/8504)).

Before

```go
// expect dial time-out on ipv4 blackhole
_, err := clientv3.New(clientv3.Config{
Endpoints: []string{"http://254.0.0.1:12345"},
DialTimeout: 2 * time.Second
})
if err == grpc.ErrClientConnTimeout {
// handle errors
}
```

After

```go
_, err := clientv3.New(clientv3.Config{
Endpoints: []string{"http://254.0.0.1:12345"},
DialTimeout: 2 * time.Second
})
if err == context.DeadlineExceeded {
// handle errors
}
```

### Server upgrade checklists

#### Upgrade requirements

To upgrade an existing etcd deployment to 3.3, the running cluster must be 3.2 or greater. If it's before 3.2, please [upgrade to 3.2](upgrade_3_2.md) before upgrading to 3.3.

Also, to ensure a smooth rolling upgrade, the running cluster must be healthy. Check the health of the cluster by using the `etcdctl endpoint health` command before proceeding.

#### Preparation

Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment.

Before beginning, [backup the etcd data](../op-guide/maintenance.md#snapshot-backup). Should something go wrong with the upgrade, it is possible to use this backup to [downgrade](#downgrade) back to existing etcd version. Please note that the `snapshot` command only backs up the v3 data. For v2 data, see [backing up v2 datastore](../v2/admin_guide.md#backing-up-the-datastore).

#### Mixed versions

While upgrading, an etcd cluster supports mixed versions of etcd members, and operates with the protocol of the lowest common version. The cluster is only considered upgraded once all of its members are upgraded to version 3.3. Internally, etcd members negotiate with each other to determine the overall cluster version, which controls the reported version and the supported features.

#### Limitations

Note: If the cluster only has v3 data and no v2 data, it is not subject to this limitation.

If the cluster is serving a v2 data set larger than 50MB, each newly upgraded member may take up to two minutes to catch up with the existing cluster. Check the size of a recent snapshot to estimate the total data size. In other words, it is safest to wait for 2 minutes between upgrading each member.

For a much larger total data size, 100MB or more , this one-time process might take even more time. Administrators of very large etcd clusters of this magnitude can feel free to contact the [etcd team][etcd-contact] before upgrading, and we'll be happy to provide advice on the procedure.

#### Downgrade

If all members have been upgraded to v3.3, the cluster will be upgraded to v3.3, and downgrade from this completed state is **not possible**. If any single member is still v3.2, however, the cluster and its operations remains "v3.2", and it is possible from this mixed cluster state to return to using a v3.2 etcd binary on all members.

Please [backup the data directory](../op-guide/maintenance.md#snapshot-backup) of all etcd members to make downgrading the cluster possible even after it has been completely upgraded.

### Upgrade procedure

This example shows how to upgrade a 3-member v3.2 ectd cluster running on a local machine.

#### 1. Check upgrade requirements

Is the cluster healthy and running v3.2.x?

```
$ ETCDCTL_API=3 etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
localhost:2379 is healthy: successfully committed proposal: took = 6.600684ms
localhost:22379 is healthy: successfully committed proposal: took = 8.540064ms
localhost:32379 is healthy: successfully committed proposal: took = 8.763432ms

$ curl http://localhost:2379/version
{"etcdserver":"3.2.7","etcdcluster":"3.2.0"}
```

#### 2. Stop the existing etcd process

When each etcd process is stopped, expected errors will be logged by other cluster members. This is normal since a cluster member connection has been (temporarily) broken:

```
14:13:31.491746 I | raft: c89feb932daef420 [term 3] received MsgTimeoutNow from 6d4f535bae3ab960 and starts an election to get leadership.
14:13:31.491769 I | raft: c89feb932daef420 became candidate at term 4
14:13:31.491788 I | raft: c89feb932daef420 received MsgVoteResp from c89feb932daef420 at term 4
14:13:31.491797 I | raft: c89feb932daef420 [logterm: 3, index: 9] sent MsgVote request to 6d4f535bae3ab960 at term 4
14:13:31.491805 I | raft: c89feb932daef420 [logterm: 3, index: 9] sent MsgVote request to 9eda174c7df8a033 at term 4
14:13:31.491815 I | raft: raft.node: c89feb932daef420 lost leader 6d4f535bae3ab960 at term 4
14:13:31.524084 I | raft: c89feb932daef420 received MsgVoteResp from 6d4f535bae3ab960 at term 4
14:13:31.524108 I | raft: c89feb932daef420 [quorum:2] has received 2 MsgVoteResp votes and 0 vote rejections
14:13:31.524123 I | raft: c89feb932daef420 became leader at term 4
14:13:31.524136 I | raft: raft.node: c89feb932daef420 elected leader c89feb932daef420 at term 4
14:13:31.592650 W | rafthttp: lost the TCP streaming connection with peer 6d4f535bae3ab960 (stream MsgApp v2 reader)
14:13:31.592825 W | rafthttp: lost the TCP streaming connection with peer 6d4f535bae3ab960 (stream Message reader)
14:13:31.693275 E | rafthttp: failed to dial 6d4f535bae3ab960 on stream Message (dial tcp [::1]:2380: getsockopt: connection refused)
14:13:31.693289 I | rafthttp: peer 6d4f535bae3ab960 became inactive
14:13:31.936678 W | rafthttp: lost the TCP streaming connection with peer 6d4f535bae3ab960 (stream Message writer)
```

It's a good idea at this point to [backup the etcd data](../op-guide/maintenance.md#snapshot-backup) to provide a downgrade path should any problems occur:

```
$ etcdctl snapshot save backup.db
```

#### 3. Drop-in etcd v3.3 binary and start the new etcd process

The new v3.3 etcd will publish its information to the cluster:

```
14:14:25.363225 I | etcdserver: published {Name:s1 ClientURLs:[http://localhost:2379]} to cluster a9ededbffcb1b1f1
```

Verify that each member, and then the entire cluster, becomes healthy with the new v3.3 etcd binary:

```
$ ETCDCTL_API=3 /etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
localhost:22379 is healthy: successfully committed proposal: took = 5.540129ms
localhost:32379 is healthy: successfully committed proposal: took = 7.321771ms
localhost:2379 is healthy: successfully committed proposal: took = 10.629901ms
```

Upgraded members will log warnings like the following until the entire cluster is upgraded. This is expected and will cease after all etcd cluster members are upgraded to v3.3:

```
14:15:17.071804 W | etcdserver: member c89feb932daef420 has a higher version 3.3.0
14:15:21.073110 W | etcdserver: the local etcd version 3.2.7 is not up-to-date
14:15:21.073142 W | etcdserver: member 6d4f535bae3ab960 has a higher version 3.3.0
14:15:21.073157 W | etcdserver: the local etcd version 3.2.7 is not up-to-date
14:15:21.073164 W | etcdserver: member c89feb932daef420 has a higher version 3.3.0
```

#### 4. Repeat step 2 to step 3 for all other members

#### 5. Finish

When all members are upgraded, the cluster will report upgrading to 3.3 successfully:

```
14:15:54.536901 N | etcdserver/membership: updated the cluster version from 3.2 to 3.3
14:15:54.537035 I | etcdserver/api: enabled capabilities for version 3.3
```

```
$ ETCDCTL_API=3 /etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
localhost:2379 is healthy: successfully committed proposal: took = 2.312897ms
localhost:22379 is healthy: successfully committed proposal: took = 2.553476ms
localhost:32379 is healthy: successfully committed proposal: took = 2.517902ms
```

[etcd-contact]: https://groups.google.com/forum/#!forum/etcd-dev
4 changes: 2 additions & 2 deletions clientv3/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -320,7 +320,7 @@ func (c *Client) dial(endpoint string, dopts ...grpc.DialOption) (*grpc.ClientCo
if err != nil {
if toErr(ctx, err) != rpctypes.ErrAuthNotEnabled {
if err == ctx.Err() && ctx.Err() != c.ctx.Err() {
err = grpc.ErrClientConnTimeout
err = context.DeadlineExceeded
}
return nil, err
}
Expand Down Expand Up @@ -397,7 +397,7 @@ func newClient(cfg *Config) (*Client, error) {
case <-waitc:
}
if !hasConn {
err := grpc.ErrClientConnTimeout
err := context.DeadlineExceeded
select {
case err = <-client.dialerrc:
default:
Expand Down
5 changes: 2 additions & 3 deletions clientv3/client_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ import (
"github.com/coreos/etcd/etcdserver/api/v3rpc/rpctypes"
"github.com/coreos/etcd/pkg/testutil"
"golang.org/x/net/context"
"google.golang.org/grpc"
)

func TestDialCancel(t *testing.T) {
Expand Down Expand Up @@ -117,8 +116,8 @@ func TestDialTimeout(t *testing.T) {
case <-time.After(5 * time.Second):
t.Errorf("#%d: failed to timeout dial on time", i)
case err := <-donec:
if err != grpc.ErrClientConnTimeout {
t.Errorf("#%d: unexpected error %v, want %v", i, err, grpc.ErrClientConnTimeout)
if err != context.DeadlineExceeded {
t.Errorf("#%d: unexpected error %v, want %v", i, err, context.DeadlineExceeded)
}
}
}
Expand Down
9 changes: 4 additions & 5 deletions clientv3/integration/dial_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ import (
"github.com/coreos/etcd/pkg/transport"

"golang.org/x/net/context"
"google.golang.org/grpc"
)

var (
Expand Down Expand Up @@ -62,8 +61,8 @@ func TestDialTLSExpired(t *testing.T) {
DialTimeout: 3 * time.Second,
TLS: tls,
})
if err != grpc.ErrClientConnTimeout {
t.Fatalf("expected %v, got %v", grpc.ErrClientConnTimeout, err)
if err != context.DeadlineExceeded {
t.Fatalf("expected %v, got %v", context.DeadlineExceeded, err)
}
}

Expand All @@ -78,8 +77,8 @@ func TestDialTLSNoConfig(t *testing.T) {
Endpoints: []string{clus.Members[0].GRPCAddr()},
DialTimeout: time.Second,
})
if err != grpc.ErrClientConnTimeout {
t.Fatalf("expected %v, got %v", grpc.ErrClientConnTimeout, err)
if err != context.DeadlineExceeded {
t.Fatalf("expected %v, got %v", context.DeadlineExceeded, err)
}
}

Expand Down
8 changes: 4 additions & 4 deletions integration/v3_grpc_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1537,7 +1537,7 @@ func TestTLSGRPCRejectInsecureClient(t *testing.T) {
// nil out TLS field so client will use an insecure connection
clus.Members[0].ClientTLSInfo = nil
client, err := NewClientV3(clus.Members[0])
if err != nil && err != grpc.ErrClientConnTimeout {
if err != nil && err != context.DeadlineExceeded {
t.Fatalf("unexpected error (%v)", err)
} else if client == nil {
// Ideally, no client would be returned. However, grpc will
Expand Down Expand Up @@ -1573,7 +1573,7 @@ func TestTLSGRPCRejectSecureClient(t *testing.T) {
client, err := NewClientV3(clus.Members[0])
if client != nil || err == nil {
t.Fatalf("expected no client")
} else if err != grpc.ErrClientConnTimeout {
} else if err != context.DeadlineExceeded {
t.Fatalf("unexpected error (%v)", err)
}
}
Expand Down Expand Up @@ -1730,8 +1730,8 @@ func testTLSReload(t *testing.T, cloneFunc func() transport.TLSInfo, replaceFunc
// 5. expect dial time-out when loading expired certs
select {
case gerr := <-errc:
if gerr != grpc.ErrClientConnTimeout {
t.Fatalf("expected %v, got %v", grpc.ErrClientConnTimeout, gerr)
if gerr != context.DeadlineExceeded {
t.Fatalf("expected %v, got %v", context.DeadlineExceeded, gerr)
}
case <-time.After(5 * time.Second):
t.Fatal("failed to receive dial timeout error")
Expand Down