-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tikv panic:peer id increased after region is merged, message peer id 2228, local peer id 1140, region id: 649 #12232
Comments
/type bug |
The panic is likely caused by: tikv/components/raftstore/src/store/peer.rs Lines 989 to 994 in ac32d5c
which write the pending_merge_state into the region state, but the merge could be rollbacked later.
From the logs:
other peers
/cc @BusyJay |
Maybe we should just log a warning and ignore the assert. |
Setting the |
How? Shouldn't the epoch and ID be checked before other peers destroying itself? |
I think this panic is caused by: So maybe we should add a check here to see if the target peer id in the message is the same as in the local state. What do you think @NingLin-P @BusyJay |
Other peers may have the same epoch (or staler epoch) but are still on the region's current peer list. |
I don't see how this check can fix the problem.
The peer will destroy itself by:
For 1, 2 and 5, it's clear that the merge must be aborted. For 3 and 6, if merge_state is trustworthy, then the merge must be succeeded. For 4, the peer is destroyed only when it has collected enough information from other peers and it's merged by the snapshot peer. So the peer can know if it will be merged in the end when being destroyed, hence can decide whether to keep merge_state. |
So even the merge is aborted, some peers of the source region can still be destroyed accidentally by gc message? I don't think it is expected. I have also constructed a test case to show it is possible in current code base https://gist.github.com/NingLin-P/966e92dae16afa5bd61ec4109c7ea3c5 |
Yes.
Indeed. So merge_state need to be cleared if it's known to be aborted. |
I think Because the And even the merge is indeed succeeded, the |
Yes. What I'm described is it always know if it's merged or not when being destroyed.
Not true. If a target peer is destroyed before commit merging, keep_data can be false while the source region is merged.
I'm OK with it as in most cases the left peer should be able to contact one of the quorum. |
…eeded (#12251) close #12232 raftstore: only persist merge target if the merge is known to be succeeded Signed-off-by: linning <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
…eeded (#12251) (#12271) close #12232, ref #12251 raftstore: only persist merge target if the merge is known to be succeeded Signed-off-by: ti-srebot <[email protected]> Co-authored-by: NingLin-P <[email protected]> Co-authored-by: you06 <[email protected]>
…eeded (#12251) (#12297) close #12232, ref #12251 raftstore: only persist merge target if the merge is known to be succeeded Signed-off-by: ti-srebot <[email protected]> Signed-off-by: linning <[email protected]> Co-authored-by: NingLin-P <[email protected]> Co-authored-by: linning <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
…eeded (#12251) (#12295) close #12232, ref #12251 raftstore: only persist merge target if the merge is known to be succeeded Signed-off-by: ti-srebot <[email protected]> Signed-off-by: linning <[email protected]> Co-authored-by: NingLin-P <[email protected]> Co-authored-by: linning <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
…eeded (#12251) (#12296) close #12232, ref #12251 raftstore: only persist merge target if the merge is known to be succeeded Signed-off-by: ti-srebot <[email protected]> Signed-off-by: linning <[email protected]> Co-authored-by: NingLin-P <[email protected]> Co-authored-by: linning <[email protected]> Co-authored-by: glorv <[email protected]> Co-authored-by: Ti Chi Robot <[email protected]>
Bug Report
What version of TiKV are you using?
[2022/03/18 20:12:33.638 +00:00] [INFO] [lib.rs:80] ["Welcome to TiKV"]
[2022/03/18 20:12:33.638 +00:00] [INFO] [lib.rs:85] ["Release Version: 6.0.0-alpha"]
[2022/03/18 20:12:33.638 +00:00] [INFO] [lib.rs:85] ["Edition: Community"]
[2022/03/18 20:12:33.638 +00:00] [INFO] [lib.rs:85] ["Git Commit Hash: d9cc441"]
[2022/03/18 20:12:33.638 +00:00] [INFO] [lib.rs:85] ["Git Commit Branch: heads/refs/tags/v6.0.0-nightly"]
[2022/03/18 20:12:33.638 +00:00] [INFO] [lib.rs:85] ["UTC Build Time: Unknown (env var does not exist when building)"]
[2022/03/18 20:12:33.639 +00:00] [INFO] [lib.rs:85] ["Rust Version: rustc 1.60.0-nightly (1e12aef3f 2022-02-13)"]
What operating system and CPU are you using?
do not care
Steps to reproduce
run jepsen case:jepsen-misc
http://172.16.4.180:31714/artifacts/testground/plan-exec-721189/plan-exec-721189-2743424832/main-logs
What did you expect?
case run success.
What did happened?
tikv panic.
[2022/03/18 20:15:01.240 +00:00] [INFO] [apply.rs:1882] ["remove peer successfully"] [region="id: 2001 start_key: 6D44423A33000000FF00FB000000000000FF00685449443A3137FF0000FD0000000000FA end_key: 6D44423A33000000FF00FB000000000000FF00685449443A3231FF0000FD0000000000FA region_epoch { conf_ver: 287 version: 156 } peers { id: 2002 store_id: 5 } peers { id: 2004 store_id: 7 } peers { id: 2231 store_id: 1 role: Learner } peers { id: 2233 store_id: 6 }"] [peer="id: 2231 store_id: 1 role: Learner"] [peer_id=2233] [region_id=2001]
[2022/03/18 20:15:01.240 +00:00] [INFO] [raft.rs:2646] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {2002, 2004, 2233} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=2233] [region_id=2001]
[2022/03/18 20:15:01.244 +00:00] [FATAL] [lib.rs:468] ["peer id increased after region is merged, message peer id 2228, local peer id 1140, region id: 649 start_key: 6D44423A33000000FF00FB000000000000FF00685449443A3137FF0000FD0000000000FA end_key: 6D44423A33000000FF00FB000000000000FF00685461626C653AFF3135FF0000000000FF000000F700000000FB region_epoch { conf_ver: 276 version: 153 } peers { id: 1140 store_id: 5 } peers { id: 1145 store_id: 6 } peers { id: 1155 store_id: 7 }"] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/tikv_util/src/lib.rs:467:18\n 1: std::panicking::rust_panic_with_hook\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/panicking.rs:702:17\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/panicking.rs:588:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/sys_common/backtrace.rs:138:18\n 4: rust_begin_unwind\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/panicking.rs:584:5\n 5: core::panicking::panic_fmt\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/core/src/panicking.rs:143:14\n 6: raftstore::store::fsm::store::StoreFsmDelegate<EK,ER,T>::check_msg\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:1671:21\n raftstore::store::fsm::store::StoreFsmDelegate<EK,ER,T>::on_raft_message\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:1804:32\n raftstore::store::fsm::store::StoreFsmDelegate<EK,ER,T>::handle_msgs\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:589:37\n 7: <raftstore::store::fsm::store::RaftPoller<EK,ER,T> as batch_system::batch::PollHandler<raftstore::store::fsm::peer::PeerFsm<EK,ER>,raftstore::store::fsm::store::StoreFsm>>::handle_control\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:743:9\n batch_system::batch::Poller<N,C,Handler>::poll\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/batch-system/src/batch.rs:407:27\n 8: raftstore::store::worker::refresh_config::PoolController<N,C,H>::increase_by::{{closure}}\n at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/worker/refresh_config.rs:71:21\n std::sys_common::backtrace::rust_begin_short_backtrace\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/sys_common/backtrace.rs:122:18\n 9: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/thread/mod.rs:498:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/panicking.rs:492:40\n std::panicking::try\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/panicking.rs:456:19\n std::panic::catch_unwind\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/panic.rs:137:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/thread/mod.rs:497:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/core/src/ops/function.rs:227:5\n 10: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/alloc/src/boxed.rs:1854:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/alloc/src/boxed.rs:1854:9\n std::sys::unix::thread::Thread::new::thread_start\n at /rustc/1e12aef3fab243407f9d71ba9956cb2a1bf105d5/library/std/src/sys/unix/thread.rs:108:17\n 11: start_thread\n 12: clone\n"] [location=/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tikv/components/raftstore/src/store/fsm/store.rs:1671] [thread_name=raftstore-6-1]
The text was updated successfully, but these errors were encountered: