freezing when tikv is down #791

whans · 2021-09-02T08:03:15Z

whans
Sep 2, 2021

7 pd-server node, kill 2 pd-server node, juicefs freezing.

[2021/09/02 15:57:58.152 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.35:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:57:59.021 +08:00] [ERROR] [client.go:599] ["[pd] getTS error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [stack="github.com/tikv/pd/client.(*client).handleDispatcher\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:599"]
[2021/09/02 15:57:59.022 +08:00] [ERROR] [pd.go:234] ["updateTS error"] [txnScope=global] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:717\ngithub.com/tikv/pd/client.(*client).handleDispatcher\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:587\nruntime.goexit\n\t/snap/go/7954/src/runtime/asm_amd64.s:1371\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:913\ngithub.com/tikv/pd/client.(*client).GetTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:933\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/util/pd_interceptor.go:79\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:141\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nsync.(*Map).Range\n\t/snap/go/7954/src/sync/map.go:345\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:230\nruntime.goexit\n\t/snap/go/7954/src/runtime/asm_amd64.s:1371"] [stack="github.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\t/snap/go/7954/src/sync/map.go:345\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:230"]
[2021/09/02 15:57:59.317 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:00.608 +08:00] [WARN] [prewrite.go:198] ["slow prewrite request"] [startTS=427443780657872897] [region="{ region id: 4669, ver: 35, confVer: 1007 }"] [attempts=280]
[2021/09/02 15:58:04.317 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:09.318 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:14.319 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:15.831 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.35:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:20.832 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.35:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:25.834 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:30.834 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?

Environment:

JuiceFS version (use ./juicefs --version) or Hadoop Java SDK version:
Cloud provider or hardware configuration running JuiceFS:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Object storage (cloud provider and region):
Redis info (version, cloud provider managed or self maintained):
Network connectivity (JuiceFS to Redis, JuiceFS to object storage):
Others:

davies · 2021-09-02T08:28:22Z

davies
Sep 2, 2021
Maintainer

@whans Can you post the stacktrace of juicefs using

curl localhsot:6060/debug/pprof/goroutine?debug=1

The PORT 6060 could be different, you can use lsof -p XXX to find out.

0 replies

whans · 2021-09-02T08:51:27Z

whans
Sep 2, 2021
Author

goroutine_profile.txt

0 replies

davies · 2021-09-02T12:08:50Z

davies
Sep 2, 2021
Maintainer

Is there a new elected leader in PD cluster?

The goroutines are blocked in TiKV client (waiting for retry).

0 replies

sunxiaoguang · 2021-09-02T12:21:46Z

sunxiaoguang
Sep 2, 2021

Hello @whans , could you please help us find out all the logs when system stops responding. Thanks.

0 replies

sunxiaoguang · 2021-09-02T12:25:51Z

sunxiaoguang
Sep 2, 2021

Hello @whans , could you please help us find out all the logs when system stops responding. Thanks.

Including the connection string used for TiKV. It's something like this 'tikv://<pd_addr>[,<pd_addr>...]/'

0 replies

whans · 2021-09-02T14:12:46Z

whans
Sep 2, 2021
Author

juicefs mount --max-uploads=48 --buffer-size=6000 -d tikv://10.188.19.30:2379,10.188.19.31:2379,10.188.19.32:2379,10.188.19.33:2379,10.188.19.34:2379,10.188.19.35:2379,10.188.19.36:2379/test /mnt/juicefstest

juicefs is freezing when two pd-server(10.188.19.35:2379,10.188.19.36:2379) shut down abnormally

maybe it's tikv client issue.

0 replies

whans · 2021-09-02T15:29:04Z

whans
Sep 2, 2021
Author

@davies has the new leader, but tikv go client doesn't switch.

0 replies

sunxiaoguang · 2021-09-03T03:45:14Z

sunxiaoguang
Sep 3, 2021

Hello @whans , could you please help us find out all the logs when system stops responding. Thanks.

Hello @whans could you please send out more logs starting from the time system got stuck till 5 minutes later. That will help us diagnosing this issue, thanks.

0 replies

whans · 2021-09-03T03:57:50Z

whans
Sep 3, 2021
Author

juicefs.log

0 replies

sunxiaoguang · 2021-09-03T14:29:53Z

sunxiaoguang
Sep 3, 2021

juicefs.log

Thanks @whans

From the log, Can I assume you are running TiKV and PD on the same server and you some how shutdown the whole server instead of killing pd-server processes only?

The log indicates that client couldn't connect to leader for some regions, which could happen when you have 3 replicas and shutdown 2 TiKV instances consecutively. If that's the case, the common regions used to be hosted on those two servers doesn't have enough time to bring up new replica to fix the loss. For those regions that only have 1 replica left can neither elect a new leader nor make any progress since majority members doesn't exist anymore.

Could you please confirm my assumption, so we can give suggestion according to your situation.

0 replies

whans · 2021-09-06T05:13:33Z

whans
Sep 6, 2021
Author

@sunxiaoguang
Yes, KiTV and PD on the same server, and power off the server.

I has 10 TiKV and PD server, kill 3 of them.

0 replies

davies · 2021-09-06T05:15:59Z

davies
Sep 6, 2021
Maintainer

@whans Then some of the regions are not available, we prefer to wait longer than fail faster.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

freezing when tikv is down #791

{{title}}

Replies: 12 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

freezing when tikv is down #791

whans Sep 2, 2021

Replies: 12 comments

davies Sep 2, 2021 Maintainer

whans Sep 2, 2021 Author

davies Sep 2, 2021 Maintainer

sunxiaoguang Sep 2, 2021

sunxiaoguang Sep 2, 2021

whans Sep 2, 2021 Author

whans Sep 2, 2021 Author

sunxiaoguang Sep 3, 2021

whans Sep 3, 2021 Author

sunxiaoguang Sep 3, 2021

whans Sep 6, 2021 Author

davies Sep 6, 2021 Maintainer

whans
Sep 2, 2021

davies
Sep 2, 2021
Maintainer

whans
Sep 2, 2021
Author

davies
Sep 2, 2021
Maintainer

sunxiaoguang
Sep 2, 2021

sunxiaoguang
Sep 2, 2021

whans
Sep 2, 2021
Author

whans
Sep 2, 2021
Author

sunxiaoguang
Sep 3, 2021

whans
Sep 3, 2021
Author

sunxiaoguang
Sep 3, 2021

whans
Sep 6, 2021
Author

davies
Sep 6, 2021
Maintainer