Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential dead lock if switching different master frequently #2512

Closed
1 of 2 tasks
git-hulk opened this issue Aug 30, 2024 · 1 comment · Fixed by #2516
Closed
1 of 2 tasks

Potential dead lock if switching different master frequently #2512

git-hulk opened this issue Aug 30, 2024 · 1 comment · Fixed by #2516
Labels
bug type bug

Comments

@git-hulk
Copy link
Member

Search before asking

  • I had searched in the issues and found no similar issues.

Version

unstable

Minimal reproduce step

None

What did you expect to see?

Won't cause deadlock at any situation

What did you see instead?

The worker threads are stuck after switching master frequently:

(gdb) thread 8

[Switching to thread 8 (Thread 0x7fbeb4bf8640 (LWP 3166409))]

#0  __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x7ffc2c7fe668) at ./nptl/futex-internal.c:103

103     in ./nptl/futex-internal.c

(gdb) bt

#0  __futex_abstimed_wait_common (cancel=false, private=<optimized out>, abstime=0x0, clockid=0, expected=3, futex_word=0x7ffc2c7fe668) at ./nptl/futex-internal.c:103

#1  __GI___futex_abstimed_wait64 (futex_word=futex_word@entry=0x7ffc2c7fe668, expected=expected@entry=3, clockid=clockid@entry=0, abstime=abstime@entry=0x0,

    private=<optimized out>) at ./nptl/futex-internal.c:128

#2  0x00007fbebeb227fb in __pthread_rwlock_rdlock_full64 (abstime=0x0, clockid=0, rwlock=0x7ffc2c7fe660) at ./nptl/pthread_rwlock_common.c:460

#3  ___pthread_rwlock_rdlock (rwlock=0x7ffc2c7fe660) at ./nptl/pthread_rwlock_rdlock.c:26

#4  0x0000561e231e2028 in std::__glibcxx_rwlock_rdlock (__rwlock=<optimized out>) at /usr/include/c++/11/shared_mutex:78

#5  std::__shared_mutex_pthread::lock_shared (this=<optimized out>) at /usr/include/c++/11/shared_mutex:229

#6  std::shared_mutex::lock_shared (this=<optimized out>) at /usr/include/c++/11/shared_mutex:426

#7  std::shared_lock<std::shared_mutex>::shared_lock (__m=..., this=<optimized out>) at /usr/include/c++/11/shared_mutex:727

#8  Server::WorkConcurrencyGuard (this=<optimized out>) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/server.cc:637

#9  redis::Connection::ExecuteCommands (this=<optimized out>, to_process_cmds=<optimized out>)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/redis_connection.cc:355

#10 0x0000561e230dc2ac in redis::Connection::OnRead (bev=<optimized out>, this=0x7fbeb23cb580)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/redis_connection.cc:92

#11 EvbufCallbackBase<redis::Connection, true, true, true>::readCB (bev=<optimized out>, ctx=0x7fbeb23cb580)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/common/event_util.h:70

#12 0x0000561e2383be12 in bufferevent_run_deferred_callbacks_unlocked (cb=<optimized out>, arg=0x7fbeb23cb300)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/bufferevent.c:208

#13 0x0000561e2384271d in event_process_active_single_queue (base=base@entry=0x7fbebe64c900, activeq=0x7fbebe651060, max_to_process=max_to_process@entry=2147483647,

    endtime=endtime@entry=0x0) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/event.c:1720

#14 0x0000561e23843117 in event_process_active (base=0x7fbebe64c900) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/event.c:1783

#15 event_base_loop (base=0x7fbebe64c900, flags=0) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/event.c:2006

#16 0x0000561e231f5f3e in Worker::Run (tid=..., this=0x7fbebe6ab900) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/worker.cc:293

#17 operator() (__closure=<optimized out>, __closure=<optimized out>) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/worker.cc:546

#18 operator() (__closure=<optimized out>) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/common/thread_util.h:38

#19 std::__invoke_impl<void, util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > (__f=...)

    at /usr/include/c++/11/bits/invoke.h:61

#20 std::__invoke<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > (__fn=...)

    at /usr/include/c++/11/bits/invoke.h:96

#21 std::thread::_Invoker<std::tuple<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > >::_M_invoke<0>

    (this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:259

#22 std::thread::_Invoker<std::tuple<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > >::operator() (

    this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:266

#23 std::thread::_State_impl<std::thread::_Invoker<std::tuple<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > > >::_M_run(void) (this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:211

#24 0x0000561e23a3ec34 in execute_native_thread_routine ()

#25 0x00007fbebeb1cac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442

#26 0x00007fbebebae850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) thread 10

[Switching to thread 10 (Thread 0x7fbeb6ffe640 (LWP 3166411))]

#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=3180509, futex_word=0x7fbe574ff910) at ./nptl/futex-internal.c:57

57      in ./nptl/futex-internal.c

(gdb) bt

#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=3180509, futex_word=0x7fbe574ff910) at ./nptl/futex-internal.c:57

#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=3180509, futex_word=0x7fbe574ff910) at ./nptl/futex-internal.c:87

#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fbe574ff910, expected=3180509, clockid=clockid@entry=0, abstime=abstime@entry=0x0,

    private=private@entry=128) at ./nptl/futex-internal.c:139

#3  0x00007fbebeb1e624 in __pthread_clockjoin_ex (threadid=140455485371968, thread_return=0x0, clockid=0, abstime=0x0, block=<optimized out>)

    at ./nptl/pthread_join_common.c:105

#4  0x0000561e23a3eca7 in std::thread::join() ()

#5  0x0000561e2328c52e in util::ThreadOperationImpl<&std::thread::join>(std::thread&, char const*) [clone .constprop.0] (t=..., op=0x561e23a94184 "join")

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/common/thread_util.cc:39

#6  0x0000561e230c6bd4 in util::ThreadJoin (t=...) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/common/thread_util.cc:47

#7  ReplicationThread::Stop (this=<optimized out>) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/cluster/replication.cc:367

#8  0x0000561e231e62bb in Server::AddMaster (this=0x7ffc2c7fe150, host=..., port=6379, force_reconnect=<optimized out>)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/server.cc:261

#9  0x0000561e230bfa42 in Cluster::SetMasterSlaveRepl (this=0x7fbeba875600) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/cluster/cluster.cc:245

#10 0x0000561e230c0697 in Cluster::SetClusterNodes (this=0x7fbeba875600, nodes_str=..., version=<optimized out>, force=<optimized out>)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/cluster/cluster.cc:205

#11 0x0000561e230ee13d in redis::CommandClusterX::Execute (this=0x7fbe3001e000, srv=0x7ffc2c7fe150, conn=<optimized out>, output=0x7fbeb6ff5450)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/commands/cmd_cluster.cc:224

#12 0x0000561e231e236f in redis::Connection::ExecuteCommands (this=<optimized out>, to_process_cmds=<optimized out>)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/redis_connection.cc:426

#13 0x0000561e230dc2ac in redis::Connection::OnRead (bev=<optimized out>, this=0x7fbe16db5680)

   at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/redis_connection.cc:92

#14 EvbufCallbackBase<redis::Connection, true, true, true>::readCB (bev=<optimized out>, ctx=0x7fbe16db5680)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/common/event_util.h:70

#15 0x0000561e2383be12 in bufferevent_run_deferred_callbacks_unlocked (cb=<optimized out>, arg=0x7fbe16db5400)

    at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/bufferevent.c:208

#16 0x0000561e2384271d in event_process_active_single_queue (base=base@entry=0x7fbebe64cf00, activeq=0x7fbebe6512c0, max_to_process=max_to_process@entry=2147483647,

    endtime=endtime@entry=0x0) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/event.c:1720

#17 0x0000561e23843117 in event_process_active (base=0x7fbebe64cf00) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/event.c:1783

#18 event_base_loop (base=0x7fbebe64cf00, flags=0) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/build/_deps/libevent-src/event.c:2006

#19 0x0000561e231f5f3e in Worker::Run (tid=..., this=0x7fbebe6abb00) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/worker.cc:293

#20 operator() (__closure=<optimized out>, __closure=<optimized out>) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/server/worker.cc:546

#21 operator() (__closure=<optimized out>) at /root/kvrocks-zip-build-cd/kvrocks-zip-build/kvrocks/src/common/thread_util.h:38

#22 std::__invoke_impl<void, util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > (__f=...)

    at /usr/include/c++/11/bits/invoke.h:61

#23 std::__invoke<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > (__fn=...)

    at /usr/include/c++/11/bits/invoke.h:96

#24 std::thread::_Invoker<std::tuple<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > >::_M_invoke<0>

    (this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:259

#25 std::thread::_Invoker<std::tuple<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > >::operator() (

    this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:266

#26 std::thread::_State_impl<std::thread::_Invoker<std::tuple<util::CreateThread<WorkerThread::Start()::<lambda()> >(char const*, WorkerThread::Start()::<lambda()>)::<lambda()> > > >::_M_run(void) (this=<optimized out>) at /usr/include/c++/11/bits/std_thread.h:211

#27 0x0000561e23a3ec34 in execute_native_thread_routine ()

#28 0x00007fbebeb1cac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442

#29 0x00007fbebebae850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb)

Anything Else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@git-hulk git-hulk added the bug type bug label Aug 30, 2024
@git-hulk
Copy link
Member Author

git-hulk commented Aug 30, 2024

After analyzing the gdb stack, we found one of the worker threads was pending when waiting for the ReplicationThread::Stop. And ReplicationThread::Stop is blocking on WorkConcurrencyGuard which should be acquired by the current worker.

That said, the ReplicationThread is waiting for WorkExclusivityGuard but it's acquired by the worker thread which is waiting for itself.

AntiTopQuark pushed a commit to AntiTopQuark/kvrocks that referenced this issue Sep 2, 2024
…apache#2516)

This closes apache#2512.

Currently, the replication thread will wait for the worker's exclusive guard stop before closing db.
But it now stops the worker from running new commands after acquiring the worker's exclusive guard,
and it might cause deadlock if switches at the same time.

The following steps will show how it may happen:

- T0: client A sent `slaveof MASTER_IP0 MASTER_PORT0`, then the replication thread was started and waiting for the exclusive guard.

- T1: client B sent `slaveof MASTER_IP1 MASTER_PORT1` and `AddMaster` will stop the previous replication thread, which is waiting for the exclusive guard. But the exclusive guard is acquired by the current thread.

The workaround is also straightforward, just stop workers from running new commands by enabling `is_loading_` to 
true before acquiring the lock in the replication thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug type bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant