Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to handle watch retry #8914

Closed
cnljf opened this issue Nov 24, 2017 · 35 comments
Closed

how to handle watch retry #8914

cnljf opened this issue Nov 24, 2017 · 35 comments

Comments

@cnljf
Copy link

cnljf commented Nov 24, 2017

hi:
My env is etcd server:3.2.9 and go client:3.2.9. Use user password auth.
I meet a problem. When I restart a etcd server,the watch channel closed,and I retry to watch without create a new client,the watch cannot success。Here is code:

`
 func() ProcessWatch(wc WatchChan) {
        for resp := range wc {
              if resp.Canceled {
                 log.Log(resp.Err())
              }
              ....
        }

       //retry a new watch
       newWc := Watch(...)
       go ProcessWatch(newWc)
 }
`

1:When the wc closed, the resp.Canceled is not true, so can not log resp.Err().
2: The retry watch will fail, then the retry watch will always try.

Here is tcpdump result:

image

The long link is alive,but there is PermissionDenied Error。

@xiang90
Copy link
Contributor

xiang90 commented Nov 26, 2017

/cc @mitake

can you take a look?

@mitake
Copy link
Contributor

mitake commented Nov 27, 2017

@xiang90 @cnljf ok, I'll take a look later

@mitake
Copy link
Contributor

mitake commented Nov 27, 2017

@cnljf could you share more detailed information about your deployments? How many nodes are in your cluster? What token type are you using (simple or JWT)?

@cnljf
Copy link
Author

cnljf commented Nov 27, 2017

It's a three node cluster.
etcd Version: 3.2.9 Git SHA: f1d7dd8 Go Version: go1.8.4 Go OS/Arch: linux/amd64
Token add by add user
image

@mitake
Copy link
Contributor

mitake commented Nov 28, 2017

@cnljf could you show every parameter for creating your client? Do the parameters have every endpoint?

Also, does this problem happen if you disable auth?

@cnljf
Copy link
Author

cnljf commented Nov 28, 2017

clientv3.New(clientv3.Config{ Endpoints: []string{*.*.*.*:2379}, DialTimeout: 2 * time.Second, Username: "guest", Password: "****", })
It only use one endpoint to create.

I has two cluster:
One is 3.1.5, it's watch don't need auth and work normally.
One is 3.2.9, it has this problem.

This problem occur in my beta and online environment. But when I simulate this problem in my desktop,it only occur once.

Here is some dependency I use, hope can help:
`
[[projects]]
name = "github.com/coreos/etcd"
packages = ["auth/authpb","clientv3","etcdserver/api/v3rpc/rpctypes","etcdserver/etcdserverpb","mvcc/mvccpb"]
revision = "f1d7dd87da3e8feab4aaf675b8e29c6a5ed5f58b"
version = "v3.2.9"

[[projects]]
branch = "master"
name = "github.com/golang/protobuf"
packages = ["proto","ptypes/any"]
revision = "130e6b02ab059e7b717a096f397c5b60111cae74"

[[projects]]
branch = "master"
name = "golang.org/x/net"
packages = ["context","http2","http2/hpack","idna","internal/timeseries","lex/httplex","trace"]
revision = "1087133bc4af3073e18add999345c6ae75918503"

[[projects]]
branch = "master"
name = "golang.org/x/text"
packages = ["collate","collate/build","internal/colltab","internal/gen","internal/tag","internal/triegen","internal/ucd","language","secure/bidirule","transform","unicode/bidi","unicode/cldr","unicode/norm","unicode/rangetable"]
revision = "c01e4764d870b77f8abe5096ee19ad20d80e8075"

[[projects]]
branch = "master"
name = "google.golang.org/genproto"
packages = ["googleapis/rpc/status"]
revision = "f676e0f3ac6395ff1a529ae59a6670878a8371a6"

[[projects]]
name = "google.golang.org/grpc"
packages = [".","codes","connectivity","credentials","grpclb/grpc_lb_v1","grpclog","internal","keepalive","metadata","naming","peer","stats","status","tap","transport"]
revision = "b3ddf786825de56a4178401b7e174ee332173b66"
version = "v1.5.2"
`

@mitake
Copy link
Contributor

mitake commented Nov 28, 2017

Could you try passing all endpoints to clientv3.Config? If the client have multiple endpoints, failover should work.

BTW do you mean that you don't see this problem even if you use a single endpoint to client creation in 3.1.5?

@cnljf
Copy link
Author

cnljf commented Nov 28, 2017

When I create the client, I call clientv3.Sync(). Do it equal to create client with all endpoints?

@mitake
Copy link
Contributor

mitake commented Nov 28, 2017

Yes clientv3.Sync() can be used for the purpose. It is strange...

But when I simulate this problem in my desktop,it only occur once.

Do you mean the problem is non deterministic and hard to reproduce?

@cnljf
Copy link
Author

cnljf commented Nov 28, 2017

Yes, it's hard to reproduce.
But I keep a unnormal scene in online environment.

In addition, If you need some information, I can try to reproduce from beta environment.

@mitake
Copy link
Contributor

mitake commented Nov 28, 2017

I see. If you get any new additional information related to it, could you share? I'm still trying to reproduce the problem.

@cnljf
Copy link
Author

cnljf commented Nov 28, 2017

Can you indicate some information you may need? I can have a target information to fetch.

@mitake
Copy link
Contributor

mitake commented Nov 28, 2017

I want logs of all nodes if it is possible. If you can provide more detailed information about your client behaviour, it is nice.

@mitake mitake self-assigned this Nov 28, 2017
@cnljf
Copy link
Author

cnljf commented Nov 30, 2017

I just shutdown one node in beta env, the client can't watch normally.
Here is etcd server log:
`
2017-11-30 14:01:17.681825 I | auth: deleting token qilITaAgtFjfSAyg.4681813 for user guest
2017-11-30 14:01:21.681762 I | auth: deleting token TRDFiakiiOCffKkL.4681818 for user guest
2017-11-30 14:01:23.177530 W | auth: invalid auth token: EZXKZQdeBKLEwYiT.4612998
2017-11-30 14:01:23.177990 W | auth: invalid auth token: LrnsMWmoUMsfecms.4613000

2017-11-30 14:01:23.204399 W | rafthttp: lost the TCP streaming connection with peer 4c3613973f03af18 (stream Message reader)
2017-11-30 14:01:23.204431 W | rafthttp: lost the TCP streaming connection with peer 4c3613973f03af18 (stream MsgApp v2 reader)

2017-11-30 14:01:23.377820 E | rafthttp: failed to dial 4c3613973f03af18 on stream Message (dial tcp 192.168.1.3:2380: getsockopt: connection refused)
2017-11-30 14:01:23.378372 I | rafthttp: peer 4c3613973f03af18 became inactive

2017-11-30 14:04:17.832067 W | rafthttp: health check for peer 4c3613973f03af18 could not connect: dial tcp 192.168.1.3:2380: getsockopt: connection refused
`

The log “auth: invalid auth token” repeat frequently. Because client is trying rewatch.

@mitake
Copy link
Contributor

mitake commented Nov 30, 2017

Thanks for reporting the detail. It seems that reauth mechanism for generating a new token wouldn't be working well in the case of watch failure. I'll dig this problem.

BTW if you use CN based auth or jwt token, you can avoid this problem as an easy work around.

@cnljf
Copy link
Author

cnljf commented Nov 30, 2017

Thank you. I will try your advise.

@cnljf
Copy link
Author

cnljf commented Nov 30, 2017

Is there jwt token doc or demo?

@mitake
Copy link
Contributor

mitake commented Nov 30, 2017

You can find it here: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md#auth-flags
If you have additional questions, please let me know.

@cnljf
Copy link
Author

cnljf commented Dec 11, 2017

hi, I try to use jwt token, but it doesn't work. here is the cmd line:
/usr/local/etcd-3.2.9/etcd --config-file=/etc/etcd/2391.yml --auth-token=jwt,pub-key=/etc/etcd/app.rsa.pub,priv-key=/etc/etcd/app.rsa,sign-method=RS512
And has this log:
2017-12-11 17:36:37.259649 W | auth: simple token is not cryptographically signed ... 2017-12-11 17:38:57.115057 D | auth: authorized root, token is geZYpRUIDAKiCiDH.62
Do I miss something?

@mitake
Copy link
Contributor

mitake commented Dec 13, 2017

hmm, it's strange. Could you share your 2391.yml?

@cnljf
Copy link
Author

cnljf commented Dec 13, 2017

I rename 2391.yml to 2391.txt. Because git dosen't support yml.
2391.txt

@mitake
Copy link
Contributor

mitake commented Dec 14, 2017

hmm, does every member share the same configuration and command line options?

@cnljf
Copy link
Author

cnljf commented Dec 14, 2017

yes.
I will try again. Contact you later.

@cnljf
Copy link
Author

cnljf commented Dec 18, 2017

Hi, I have tried again. Not successful.
Here is the whole log with debug mode. Some sensitive infomation has been changed. But it's still exact.
2389.log

@mitake
Copy link
Contributor

mitake commented Dec 20, 2017

At least the node which produced the log is using simple token. It is strange. Could you provide log files of every node?

@cnljf
Copy link
Author

cnljf commented Dec 21, 2017

Here is 2391 and 2392 log:
2391.log
2393.log

@mitake
Copy link
Contributor

mitake commented Dec 21, 2017

@cnljf I tried to reproduce your problem but it wasn't possible. Your cluster is using simple token instead of jwt. Could you check the --auth-token option again? Also could you share your commit id of etcd repository?

@cnljf
Copy link
Author

cnljf commented Dec 21, 2017

etcd Version: 3.2.9
Git SHA: f1d7dd8
Can you give me your reproduce config, etcd version and start command line?

@mitake
Copy link
Contributor

mitake commented Dec 22, 2017

I tested both of the version v3.2.9 and the master branch: c8dc19b. Here is my Procfile:

etcd1: bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --auth-token=jwt,pub-key=integration/fixtures/server.crt,priv-key=integration/fixtures/server.key.insecure,sign-method=RS256
etcd2: bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --auth-token=jwt,pub-key=integration/fixtures/server.crt,priv-key=integration/fixtures/server.key.insecure,sign-method=RS256
etcd3: bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --auth-token=jwt,pub-key=integration/fixtures/server.crt,priv-key=integration/fixtures/server.key.insecure,sign-method=RS256

@cnljf
Copy link
Author

cnljf commented Dec 22, 2017

How do you generate pub-key and priv-key? Use ssh-keygen?

@mitake
Copy link
Contributor

mitake commented Dec 22, 2017

I used openssl command. You can find already generated files for the testing purpose in the repository (paths can be found in my Procfile).

@cnljf
Copy link
Author

cnljf commented Dec 25, 2017

Hi, I can use jwt successfully. Here is the fail reason:

  1. The private-key and public-key format is wrong. Then use openssl to generate key.
  2. If use --config-file,other command line argument is ignored. So I remove auth-token argument to config file.

@cnljf
Copy link
Author

cnljf commented Dec 26, 2017

Watch retry is normal after use jwt.
Thanks!

@mitake
Copy link
Contributor

mitake commented Dec 26, 2017

Thanks for trying and good to know jwt is working well for your purpose. I'll work on simple token issue.

@stale
Copy link

stale bot commented Apr 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants