Replies: 12 comments
-
@whans Can you post the stacktrace of juicefs using
The PORT 6060 could be different, you can use |
Beta Was this translation helpful? Give feedback.
-
Is there a new elected leader in PD cluster? The goroutines are blocked in TiKV client (waiting for retry). |
Beta Was this translation helpful? Give feedback.
-
Hello @whans , could you please help us find out all the logs when system stops responding. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Including the connection string used for TiKV. It's something like this 'tikv://<pd_addr>[,<pd_addr>...]/' |
Beta Was this translation helpful? Give feedback.
-
juicefs mount --max-uploads=48 --buffer-size=6000 -d tikv://10.188.19.30:2379,10.188.19.31:2379,10.188.19.32:2379,10.188.19.33:2379,10.188.19.34:2379,10.188.19.35:2379,10.188.19.36:2379/test /mnt/juicefstest juicefs is freezing when two pd-server(10.188.19.35:2379,10.188.19.36:2379) shut down abnormally maybe it's tikv client issue. |
Beta Was this translation helpful? Give feedback.
-
@davies has the new leader, but tikv go client doesn't switch. |
Beta Was this translation helpful? Give feedback.
-
Hello @whans could you please send out more logs starting from the time system got stuck till 5 minutes later. That will help us diagnosing this issue, thanks. |
Beta Was this translation helpful? Give feedback.
-
Thanks @whans From the log, Can I assume you are running TiKV and PD on the same server and you some how shutdown the whole server instead of killing pd-server processes only? The log indicates that client couldn't connect to leader for some regions, which could happen when you have 3 replicas and shutdown 2 TiKV instances consecutively. If that's the case, the common regions used to be hosted on those two servers doesn't have enough time to bring up new replica to fix the loss. For those regions that only have 1 replica left can neither elect a new leader nor make any progress since majority members doesn't exist anymore. Could you please confirm my assumption, so we can give suggestion according to your situation. |
Beta Was this translation helpful? Give feedback.
-
@sunxiaoguang I has 10 TiKV and PD server, kill 3 of them. |
Beta Was this translation helpful? Give feedback.
-
@whans Then some of the regions are not available, we prefer to wait longer than fail faster. |
Beta Was this translation helpful? Give feedback.
-
7 pd-server node, kill 2 pd-server node, juicefs freezing.
[2021/09/02 15:57:58.152 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.35:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:57:59.021 +08:00] [ERROR] [client.go:599] ["[pd] getTS error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [stack="github.com/tikv/pd/client.(*client).handleDispatcher\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:599"]
[2021/09/02 15:57:59.022 +08:00] [ERROR] [pd.go:234] ["updateTS error"] [txnScope=global] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:717\ngithub.com/tikv/pd/client.(*client).handleDispatcher\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:587\nruntime.goexit\n\t/snap/go/7954/src/runtime/asm_amd64.s:1371\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:913\ngithub.com/tikv/pd/client.(*client).GetTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/[email protected]/client/client.go:933\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/util/pd_interceptor.go:79\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:141\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nsync.(*Map).Range\n\t/snap/go/7954/src/sync/map.go:345\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:230\nruntime.goexit\n\t/snap/go/7954/src/runtime/asm_amd64.s:1371"] [stack="github.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\t/snap/go/7954/src/sync/map.go:345\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\t/root/hanson/go/pkg/mod/github.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:230"]
[2021/09/02 15:57:59.317 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:00.608 +08:00] [WARN] [prewrite.go:198] ["slow prewrite request"] [startTS=427443780657872897] [region="{ region id: 4669, ver: 35, confVer: 1007 }"] [attempts=280]
[2021/09/02 15:58:04.317 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:09.318 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:14.319 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:15.831 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.35:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:20.832 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.35:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:25.834 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
[2021/09/02 15:58:30.834 +08:00] [WARN] [client_batch.go:497] ["init create streaming fail"] [target=10.188.19.36:20160] [forwardedHost=] [error="context deadline exceeded"]
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?
Environment:
./juicefs --version
) or Hadoop Java SDK version:cat /etc/os-release
):uname -a
):Beta Was this translation helpful? Give feedback.
All reactions