Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in Backoffer::backoffWithMaxSleep #8685

Closed
JaySon-Huang opened this issue Jan 14, 2024 · 3 comments · Fixed by #8693
Closed

crash in Backoffer::backoffWithMaxSleep #8685

JaySon-Huang opened this issue Jan 14, 2024 · 3 comments · Fixed by #8693

Comments

@JaySon-Huang
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:377] [########################################] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:378] ["(from thread 1805825) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:408] ["Address: 0x8"] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:423] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {52753429,10769,750} find error: EpochNotMatch current epoch of region 52753429 is conf_ver: 10769 version: 753, but you sent conf_ver: 10769 version: 750"] [source=pingcap.tikv] [thread_id=1805828]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {22790595,9749,603} find error: EpochNotMatch current epoch of region 22790595 is conf_ver: 9749 version: 605, but you sent conf_ver: 9749 version: 603"] [source=pingcap.tikv] [thread_id=1805829]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {22583491,12923,588} find error: EpochNotMatch current epoch of region 22583491 is conf_ver: 12923 version: 590, but you sent conf_ver: 12923 version: 588"] [source=pingcap.tikv] [thread_id=1805824]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {63134322,10811,1093} find error: EpochNotMatch current epoch of region 63134322 is conf_ver: 10811 version: 1095, but you sent conf_ver: 10811 version: 1093"] [source=pingcap.tikv] [thread_id=1805830]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {61920978,10811,921} find error: EpochNotMatch current epoch of region 61920978 is conf_ver: 10811 version: 923, but you sent conf_ver: 10811 version: 921"] [source=pingcap.tikv] [thread_id=1805831]
[2024/01/14 15:20:46.040 +00:00] [WARN] [<unknown>] ["region {65153272,10847,1443} find error: peer is not leader for region 65153272, leader may Some(id: 65153273 store_id: 11)"] [source=pingcap.tikv] [thread_id=1805832]
[2024/01/14 15:20:46.040 +00:00] [WARN] [<unknown>] ["region {65153272,10847,1443} find error: EpochNotMatch current epoch of region 65153272 is conf_ver: 10847 version: 1445, but you sent conf_ver: 10847 version: 1443"] [source=pingcap.tikv] [thread_id=1805832]
[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:377] [########################################] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:378] ["(from thread 1805825) Received signal Segmentation fault(11)."] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:408] ["Address: 0x8"] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.038 +00:00] [ERROR] [BaseDaemon.cpp:423] ["Address not mapped to object."] [source=BaseDaemon] [thread_id=1805827]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {52753429,10769,750} find error: EpochNotMatch current epoch of region 52753429 is conf_ver: 10769 version: 753, but you sent conf_ver: 10769 version: 750"] [source=pingcap.tikv] [thread_id=1805828]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {22790595,9749,603} find error: EpochNotMatch current epoch of region 22790595 is conf_ver: 9749 version: 605, but you sent conf_ver: 9749 version: 603"] [source=pingcap.tikv] [thread_id=1805829]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {22583491,12923,588} find error: EpochNotMatch current epoch of region 22583491 is conf_ver: 12923 version: 590, but you sent conf_ver: 12923 version: 588"] [source=pingcap.tikv] [thread_id=1805824]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {63134322,10811,1093} find error: EpochNotMatch current epoch of region 63134322 is conf_ver: 10811 version: 1095, but you sent conf_ver: 10811 version: 1093"] [source=pingcap.tikv] [thread_id=1805830]
[2024/01/14 15:20:46.039 +00:00] [WARN] [<unknown>] ["region {61920978,10811,921} find error: EpochNotMatch current epoch of region 61920978 is conf_ver: 10811 version: 923, but you sent conf_ver: 10811 version: 921"] [source=pingcap.tikv] [thread_id=1805831]
[2024/01/14 15:20:46.040 +00:00] [WARN] [<unknown>] ["region {65153272,10847,1443} find error: peer is not leader for region 65153272, leader may Some(id: 65153273 store_id: 11)"] [source=pingcap.tikv] [thread_id=1805832]
[2024/01/14 15:20:46.040 +00:00] [WARN] [<unknown>] ["region {65153272,10847,1443} find error: EpochNotMatch current epoch of region 65153272 is conf_ver: 10847 version: 1445, but you sent conf_ver: 10847 version: 1443"] [source=pingcap.tikv] [thread_id=1805832]
[2024/01/14 15:20:46.045 +00:00] [ERROR] [BaseDaemon.cpp:570] ["
       0x56a5f40    faultSignalHandler(int, siginfo_t*, void*) [tiflash+90857280]
                    libs/libdaemon/src/BaseDaemon.cpp:221
  0xffff9d1ed83c    <unknown symbol> [+2108]
       0x6505150    pingcap::kv::Backoffer::backoffWithMaxSleep(pingcap::kv::BackoffType, int, pingcap::Exception const&) [tiflash+105926992]
                    contrib/client-c/src/kv/Backoff.cc:61
       0x6519e00    pingcap::kv::LockResolver::checkSecondaries(pingcap::kv::Backoffer&, unsigned long, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >&, pingcap::kv::RegionVerID, std::__1::shared_ptr<pingcap::kv::AsyncResolveData>) [tiflash+106012160]
                    contrib/client-c/src/kv/LockResolver.cc:467
       0x651cb0c    void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, pingcap::kv::LockResolver::checkAllSecondaries(pingcap::kv::Backoffer&, std::__1::shared_ptr<pingcap::kv::Lock>, pingcap::kv::TxnStatus&)::$_1> >(void*) [tiflash+106023692]
                    /usr/local/bin/../include/c++/v1/thread:291
  0xffff99af0d38    start_thread [libpthread.so.0+32056]"] [source=BaseDaemon] [thread_id=1805827]

4. What is your TiFlash version? (Required)

v6.5.4

@JaySon-Huang JaySon-Huang added type/bug The issue is confirmed as a bug. component/compute labels Jan 14, 2024
@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Jan 14, 2024

@windtalker
Copy link
Contributor

It looks like the root cause is in https://github.com/tikv/client-c/blob/master/src/coprocessor/Client.cc#L641-L654 the expected behavior is to use 1 backoff for one region, but in https://github.com/tikv/client-c/blob/master/src/coprocessor/Client.cc#L539, it uses the backoff of current region to resolve the lock, and inside resolve lock, it will group by the lock by the region id(https://github.com/tikv/client-c/blob/master/src/kv/LockResolver.cc#L320) and this could result in spliting the locks into multiple regions(due to region split?), and resolveLockAsync use multiple thread to resolve lock parallelly using the same backoff, but backoff itself is not thread safe, so there is some potential data race.

@JaySon-Huang
Copy link
Contributor Author

Reopen because the client-c fixes does not pick back to the tiflash repo yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants