Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock with condition variables in Net::IMAP #14

Closed
stanhu opened this issue Mar 5, 2021 · 8 comments · Fixed by #15
Closed

Possible deadlock with condition variables in Net::IMAP #14

stanhu opened this issue Mar 5, 2021 · 8 comments · Fixed by #15

Comments

@stanhu
Copy link

stanhu commented Mar 5, 2021

We're using Net::IMAP via the MailRoom gem, and quite frequently we are seeing issues with the process not terminating even though we attempt to run Thread#join with a 60-second timeout.

A GDB backtrace shows that this is stuck waiting for a conditional variable:

Thread 2 (Thread 0x7f2bf81ff700 (LWP 11)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7f2bfa3df810) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f2bfa3df840, cond=0x7f2bfa3df7e8) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x7f2bfa3df7e8, mutex=mutex@entry=0x7f2bfa3df840) at pthread_cond_wait.c:655
#3  0x00007f2bffe343fd in rb_native_cond_wait (mutex=0x7f2bfa3df840, cond=0x7f2bfa3df7e8) at thread_pthread.c:1197
#4  native_cond_sleep (th=th@entry=0x7f2bfa3df780, rel=rel@entry=0x0) at thread_pthread.c:1197
#5  0x00007f2bffe346f4 in native_sleep (th=th@entry=0x7f2bfa3df780, rel=0x0) at thread_pthread.c:2113
#6  0x00007f2bffe376de in sleep_forever (fl=1, th=0x7f2bfa3df780) at thread.c:1252
#7  rb_thread_sleep_deadly_allow_spurious_wakeup () at thread.c:1343
#8  rb_mutex_sleep_forever (time=time@entry=8) at thread_sync.c:451
#9  0x00007f2bffcce134 in rb_ensure (b_proc=b_proc@entry=0x7f2bffe37660 <rb_mutex_sleep_forever>, data1=data1@entry=8, e_proc=e_proc@entry=0x7f2bffe36160 <mutex_lock_uninterruptible>,
    data2=data2@entry=139826840321280) at eval.c:1129
#10 0x00007f2bffe361bd in rb_mutex_sleep (self=139826840321280, timeout=<optimized out>) at thread_sync.c:477
#11 0x00007f2bffe796e1 in vm_call0_cfunc_with_frame (argv=0x7f2bf80fc858, cd=<optimized out>, calling=0x7f2bf80fc700, ec=0x7f2bfa209bd0) at vm_eval.c:91
#12 vm_call0_cfunc (argv=0x7f2bf80fc858, cd=<optimized out>, calling=0x7f2bf80fc700, ec=0x7f2bfa209bd0) at vm_eval.c:105
#13 vm_call0_body (ec=0x7f2bfa209bd0, calling=0x7f2bf80fc700, cd=<optimized out>, argv=0x7f2bf80fc858) at vm_eval.c:140
#14 0x00007f2bffe7bfd8 in rb_funcallv_with_cc (cd=0x7f2bfffa75c0 <rb_funcallv_data>, recv=139826840321280, mid=24193, argc=1, argv=<optimized out>) at vm_core.h:1805
#15 0x00007f2bffcce134 in rb_ensure (b_proc=b_proc@entry=0x7f2bffe2f360 <do_sleep>, data1=data1@entry=139826822105168, e_proc=e_proc@entry=0x7f2bffe2df60 <delete_from_waitq>, data2=data2@entry=139826822105184)
    at eval.c:1129
#16 0x00007f2bffe2f472 in rb_condvar_wait (argc=<optimized out>, argv=0x7f2bf80fcad0, self=139826858411400) at thread_sync.c:1407
#17 0x00007f2bffe796e1 in vm_call0_cfunc_with_frame (argv=0x7f2bf80fcad0, cd=<optimized out>, calling=0x7f2bf80fc980, ec=0x7f2bfa209bd0) at vm_eval.c:91
#18 vm_call0_cfunc (argv=0x7f2bf80fcad0, cd=<optimized out>, calling=0x7f2bf80fc980, ec=0x7f2bfa209bd0) at vm_eval.c:105
#19 vm_call0_body (ec=0x7f2bfa209bd0, calling=0x7f2bf80fc980, cd=<optimized out>, argv=0x7f2bf80fcad0) at vm_eval.c:140
#20 0x00007f2bffe79d5f in rb_vm_call0 (ec=ec@entry=0x7f2bfa209bd0, recv=recv@entry=139826858411400, id=id@entry=24353, argc=<optimized out>, argv=<optimized out>, me=me@entry=0x7f2bfef8b9d0, kw_splat=0)
    at vm_eval.c:52
#21 0x00007f2bffe79ff8 in rb_vm_call_kw (ec=0x7f2bfa209bd0, recv=139826858411400, id=24353, argc=<optimized out>, argc@entry=2, argv=<optimized out>, argv@entry=0x7f2bf80fcab0, me=0x7f2bfef8b9d0,
    kw_splat=<optimized out>) at vm_eval.c:268
#22 0x00007f2bffe7a453 in rb_call0 (ec=<optimized out>, recv=<optimized out>, mid=<optimized out>, argc=argc@entry=2, argv=argv@entry=0x7f2bf80fcab0, call_scope=call_scope@entry=CALL_FCALL,
    self=<optimized out>) at vm_eval.c:392
#23 0x00007f2bffe7ac7e in rb_call (scope=CALL_FCALL, argv=0x7f2bf80fcab0, argc=2, mid=<optimized out>, recv=<optimized out>) at vm_eval.c:718
#24 rb_funcallv (recv=<optimized out>, mid=<optimized out>, argc=argc@entry=2, argv=argv@entry=0x7f2bf80fcad0) at vm_eval.c:958
#25 0x00007f2bff2f62e1 in monitor_wait_for_cond_body (v=v@entry=139826822106112) at monitor.c:152
#26 0x00007f2bffcce134 in rb_ensure (b_proc=b_proc@entry=0x7f2bff2f62a0 <monitor_wait_for_cond_body>, data1=data1@entry=139826822106112, e_proc=e_proc@entry=0x7f2bff2f64a0 <monitor_enter_for_cond>,
    data2=data2@entry=139826822106112) at eval.c:1129
#27 0x00007f2bff2f641e in monitor_wait_for_cond (monitor=139826840321320, cond=<optimized out>, timeout=8) at monitor.c:180
#28 0x00007f2bffe66c59 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7f2bfe9c8b70, calling=<optimized out>, reg_cfp=0x7f2bf81fd268, ec=0x7f2bfa209bd0) at vm_insnhelper.c:2514
#29 vm_call_cfunc (ec=0x7f2bfa209bd0, reg_cfp=0x7f2bf81fd268, calling=<optimized out>, cd=0x7f2bfe9c8b70) at vm_insnhelper.c:2539
#30 0x00007f2bffe72092 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#31 vm_exec_core (ec=0x7f2bfa3df810, initial=128) at insns.def:801
#32 0x00007f2bffe78365 in rb_vm_exec (ec=0x7f2bfa209bd0, mjit_enable_p=1) at vm.c:1929
#33 0x00007f2bffe82130 in invoke_block (captured=0x7f2bf81fd3d0, opt_pc=<optimized out>, type=<optimized out>, cref=<optimized out>, self=139826840321400, iseq=0x7f2bfef91600, ec=0x7f2bfa209bd0) at vm.c:1044
#34 invoke_iseq_block_from_c (me=0x0, is_lambda=0, cref=<optimized out>, passed_block_handler=<optimized out>, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826840321400,
    captured=0x7f2bf81fd3d0, ec=0x7f2bfa209bd0) at vm.c:1116
#35 invoke_block_from_c_bh (argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>,
    force_blockarg=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1134
#36 vm_yield (argc=<optimized out>, argv=<optimized out>, kw_splat=<optimized out>, ec=<optimized out>) at vm.c:1179
#37 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at vm_eval.c:1227
#38 rb_yield_values (n=<optimized out>) at vm_eval.c:1252
#39 0x00007f2bffe82eef in vm_yield (kw_splat=0, argv=<optimized out>, argc=<optimized out>, ec=<optimized out>) at vm.c:1165
#40 rb_yield_0 (argv=<optimized out>, argc=<optimized out>) at vm_eval.c:1227
#41 rb_yield_values2 (argc=<optimized out>, argv=<optimized out>) at vm_eval.c:1273
#42 0x00007f2bffcce134 in rb_ensure (b_proc=0x7f2bff2f6380 <monitor_sync_body>, data1=139826840321320, e_proc=<optimized out>, data2=<optimized out>) at eval.c:1129
#43 0x00007f2bffe66c59 in vm_call_cfunc_with_frame (empty_kw_splat=<optimized out>, cd=0x7f2bfa4f19e0, calling=<optimized out>, reg_cfp=0x7f2bf81fd380, ec=0x7f2bfa209bd0) at vm_insnhelper.c:2514
#44 vm_call_cfunc (ec=0x7f2bfa209bd0, reg_cfp=0x7f2bf81fd380, calling=<optimized out>, cd=0x7f2bfa4f19e0) at vm_insnhelper.c:2539
#45 0x00007f2bffe72150 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_insnhelper.c:4023
#46 vm_exec_core (ec=0x7f2bfa3df810, initial=128) at insns.def:782
#47 0x00007f2bffe78365 in rb_vm_exec (ec=0x7f2bfa209bd0, mjit_enable_p=1) at vm.c:1929
#48 0x00007f2bffe792bc in invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826857723800,
    captured=<optimized out>, ec=0x7f2bfa209bd0) at vm.c:1116
#49 invoke_block_from_c_proc (me=0x0, is_lambda=<optimized out>, passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826857723800, proc=0x7f2bfa209bd0, ec=0x7f2bfa209bd0)
    at vm.c:1216
#50 vm_invoke_proc (passed_block_handler=0, kw_splat=<optimized out>, argv=<optimized out>, argc=0, self=139826857723800, proc=0x7f2bfa209bd0, ec=0x7f2bfa209bd0) at vm.c:1238
#51 rb_vm_invoke_proc (ec=0x7f2bfa209bd0, proc=proc@entry=0x7f2bf98a4510, argc=0, argv=<optimized out>, kw_splat=<optimized out>, passed_block_handler=passed_block_handler@entry=0) at vm.c:1259
#52 0x00007f2bffe3646d in thread_do_start (th=0x7f2bfa3df780) at thread.c:697
#53 0x00007f2bffe384ef in thread_start_func_2 (th=0x7f2bfa3df780, stack_start=<optimized out>) at thread.c:745
#54 0x00007f2bffe38a34 in thread_start_func_1 (th_ptr=<optimized out>) at thread_pthread.c:969
#55 0x00007f2bff9c5fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#56 0x00007f2bff4574cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f2bff33ff80 (LWP 7)):
#0  0x00007f2bff44c916 in __GI_ppoll (fds=fds@entry=0x7fff52ea2f78, nfds=nfds@entry=1, timeout=<optimized out>, timeout@entry=0x7fff52ea2f80, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x00007f2bffe33880 in rb_sigwait_sleep (th=th@entry=0x7f2bfee2d000, sigwait_fd=sigwait_fd@entry=3, rel=rel@entry=0x7fff52ea3088) at hrtime.h:148
#2  0x00007f2bffe34589 in native_sleep (th=0x7f2bfee2d000, rel=0x7fff52ea3088) at thread_pthread.c:2099
#3  0x00007f2bffe37b64 in rb_thread_terminate_all () at thread.c:587
#4  0x00007f2bffccfbaa in rb_ec_cleanup (ec=ec@entry=0x7f2bfee09050, ex=<optimized out>) at eval.c:218
#5  0x00007f2bffccfed3 in ruby_run_node (n=0x7f2bfa503a98) at eval.c:336
#6  0x000055dabf0fe10b in main (argc=<optimized out>, argv=<optimized out>) at ./main.c:50

Net::IMAP uses several conditional variables:

  1. idle : https://github.com/ruby/ruby/blob/48f324e92f9b36edc267f9871e35039cbd1c2eb9/lib/net/imap.rb#L965
  2. get_tagged_response: https://github.com/ruby/ruby/blob/48f324e92f9b36edc267f9871e35039cbd1c2eb9/lib/net/imap.rb#L1215
  3. send_literal: https://github.com/ruby/ruby/blob/48f324e92f9b36edc267f9871e35039cbd1c2eb9/lib/net/imap.rb#L1368

We're using a 60-second idle timeout for idle. However, the last two do NOT have a timeout, so it's possible we're getting stuck in one of those cases.

I noticed that send_literal only checks the state of @exception after the wait returns. Do we need to do this?

diff --git a/lib/net/imap.rb b/lib/net/imap.rb
index 720acbc86d..2e1fd6e7b9 100644
--- a/lib/net/imap.rb
+++ b/lib/net/imap.rb
@@ -1365,6 +1365,7 @@ def send_literal(str, tag = nil)
         @continued_command_tag = tag
         @continuation_request_exception = nil
         begin
+          raise @exception if @exception
           @continuation_request_arrival.wait
           e = @continuation_request_exception || @exception
           raise e if e

I also wonder if we need a timeout for these other condition variables.

@engwan
Copy link

engwan commented Mar 5, 2021

This is a sample exception from our logs that led to the process not terminating:

/usr/lib/ruby/2.6.0/openssl/buffering.rb:57:in `sysread': Connection reset by peer (Errno::ECONNRESET)
	from /usr/lib/ruby/2.6.0/openssl/buffering.rb:57:in `fill_rbuff'
	from /usr/lib/ruby/2.6.0/openssl/buffering.rb:207:in `gets'
	from /usr/lib/ruby/2.6.0/net/imap.rb:1230:in `get_response'
	from /usr/lib/ruby/2.6.0/net/imap.rb:1147:in `receive_responses'
	from /usr/lib/ruby/2.6.0/net/imap.rb:1120:in `block in initialize'

Looking at the code, this would have been rescued in https://github.com/ruby/ruby/blob/c1af7b1e1d408f9796a5f46c9ed36bc5adea4aa2/lib/net/imap.rb#L1148

And both conditional variables would have been woken by https://github.com/ruby/ruby/blob/c1af7b1e1d408f9796a5f46c9ed36bc5adea4aa2/lib/net/imap.rb#L1203-L1204

@engwan
Copy link

engwan commented Mar 5, 2021

I noticed that send_literal only checks the state of @exception after the wait returns. Do we need to do this?

I see what you mean here. I guess it's possible where the receiver thread rescues the Exception, sets @exception, calls #broadcast, and then #wait gets called after.

@stanhu
Copy link
Author

stanhu commented Mar 5, 2021

As far as I can tell, send_literal doesn't seem to be called in our case, so I wonder if it's one the two other condition variables that's getting stuck.

I can't seem to reproduce this locally. The only commonality I've found is that this has been reported when running MailRoom in Google Kubernetes Engine (GKE).

stanhu added a commit to stanhu/mail_room that referenced this issue Mar 8, 2021
When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but `Net::IMAP` is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.
@shugo
Copy link
Member

shugo commented Mar 11, 2021

@stanhu Could you show Ruby-level backtrace using rb_ps defined in .gdbinit bundled with Ruby?

gdbdump-ruby may be useful.

@stanhu
Copy link
Author

stanhu commented Mar 15, 2021

@shugo A customer sent us this:

root@gitlab-mailroom-86f4d6b5c7-srcpw:/# curl -L https://raw.githubusercontent.com/ruby/ruby/ruby_2_7/.gdbinit -o .gdbinit
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 36060  100 36060    0     0   111k      0 --:--:-- --:--:-- --:--:--  111k
root@gitlab-mailroom-86f4d6b5c7-srcpw:/# gdbdump -x .gdbinit 9
$1 = (rb_vm_t *) 0x7f5919227000
* #<Thread:0x7f5919392e10 rb_thread_t:0x7f591922d000 native_thread:0x7f5919782fc0>
* #<Thread:0x7f59147918e0 rb_thread_t:0x7f59147e4780 native_thread:0x7f59125ff700>
13:0x7f591a2792a0 <mutex_sleep at ./include/ruby/intern.h:343>:in `sleep'
12:0x7f591a272380 <rb_condvar_wait at thread_sync.c:1398>:in `wait'
11:0x7f5919637390 <monitor_wait_for_cond at monitor.c:172>:in `wait_for_cond'
10:/usr/lib/ruby/2.7.0/monitor.rb:108:in `wait'
9:/usr/lib/ruby/2.7.0/net/imap.rb:1215:in `get_tagged_response'
8:/usr/lib/ruby/2.7.0/net/imap.rb:974:in `ensure in block in idle'
7:/usr/lib/ruby/2.7.0/net/imap.rb:974:in `block in idle'
6:0x7f5919637790 <monitor_synchronize at monitor.c:199>:in `synchronize'
5:/usr/lib/ruby/2.7.0/monitor.rb:202:in `mon_synchronize'
4:/usr/lib/ruby/2.7.0/net/imap.rb:958:in `idle'
3:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/connection.rb:118:in `idle'
2:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/connection.rb:49:in `wait'
1:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/mailbox_watcher.rb:35:in `block in run'


Or with debug of the debug 😊
root@gitlab-mailroom-86f4d6b5c7-srcpw:/# gdbdump -d -x .gdbinit 9
C: sudo gdb -silent -nw -x .gdbinit /usr/bin/ruby 9
O: Reading symbols from /usr/bin/ruby...done.
O: Attaching to program: /usr/bin/ruby, process 9
O: [New LWP 24]
O: [Thread debugging using libthread_db enabled]
O: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
O: 0x00007f591988f916 in __GI_ppoll (fds=fds@entry=0x7ffcb7789f08,
O:     nfds=nfds@entry=1, timeout=<optimized out>, timeout@entry=0x7ffcb7789f10,
O:     sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39
O: To enable execution of this file add
O:      add-auto-load-safe-path /.gdbinit
O: line to your configuration file "/root/.gdbinit".
O: To completely disable this security protection add
O:      set auto-load safe-path /
O: line to your configuration file "/root/.gdbinit".
O: For more information about this security protection see the
O: "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
O:      info "(gdb)Auto-loading safe path"
E: 39   ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
E: warning: File "/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
C: set pagination off
C: rb_ps
O: $1 = (rb_vm_t *) 0x7f5919227000
O: * #<Thread:0x7f5919392e10 rb_thread_t:0x7f591922d000 native_thread:0x7f5919782fc0>
O: * #<Thread:0x7f59147918e0 rb_thread_t:0x7f59147e4780 native_thread:0x7f59125ff700>
O: 13:0x7f591a2792a0 <mutex_sleep at ./include/ruby/intern.h:343>:in `sleep'
O: 12:0x7f591a272380 <rb_condvar_wait at thread_sync.c:1398>:in `wait'
O: 11:0x7f5919637390 <monitor_wait_for_cond at monitor.c:172>:in `wait_for_cond'
O: 10:/usr/lib/ruby/2.7.0/monitor.rb:108:in `wait'
O: 9:/usr/lib/ruby/2.7.0/net/imap.rb:1215:in `get_tagged_response'
O: 8:/usr/lib/ruby/2.7.0/net/imap.rb:974:in `ensure in block in idle'
O: 7:/usr/lib/ruby/2.7.0/net/imap.rb:974:in `block in idle'
O: 6:0x7f5919637790 <monitor_synchronize at monitor.c:199>:in `synchronize'
O: 5:/usr/lib/ruby/2.7.0/monitor.rb:202:in `mon_synchronize'
O: 4:/usr/lib/ruby/2.7.0/net/imap.rb:958:in `idle'
O: 3:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/connection.rb:118:in `idle'
O: 2:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/connection.rb:49:in `wait'
O: 1:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/mailbox_watcher.rb:35:in `block in run'
$1 = (rb_vm_t *) 0x7f5919227000
* #<Thread:0x7f5919392e10 rb_thread_t:0x7f591922d000 native_thread:0x7f5919782fc0>
* #<Thread:0x7f59147918e0 rb_thread_t:0x7f59147e4780 native_thread:0x7f59125ff700>
13:0x7f591a2792a0 <mutex_sleep at ./include/ruby/intern.h:343>:in `sleep'
12:0x7f591a272380 <rb_condvar_wait at thread_sync.c:1398>:in `wait'
11:0x7f5919637390 <monitor_wait_for_cond at monitor.c:172>:in `wait_for_cond'
10:/usr/lib/ruby/2.7.0/monitor.rb:108:in `wait'
9:/usr/lib/ruby/2.7.0/net/imap.rb:1215:in `get_tagged_response'
8:/usr/lib/ruby/2.7.0/net/imap.rb:974:in `ensure in block in idle'
7:/usr/lib/ruby/2.7.0/net/imap.rb:974:in `block in idle'
6:0x7f5919637790 <monitor_synchronize at monitor.c:199>:in `synchronize'
5:/usr/lib/ruby/2.7.0/monitor.rb:202:in `mon_synchronize'
4:/usr/lib/ruby/2.7.0/net/imap.rb:958:in `idle'
3:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/connection.rb:118:in `idle'
2:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/connection.rb:49:in `wait'
1:/usr/lib/ruby/gems/2.7.0/gems/gitlab-mail_room-0.0.8/lib/mail_room/mailbox_watcher.rb:35:in `block in run'
C: detach
O: Detaching from program: /usr/bin/ruby, process 9
C: quit

This does point to the wait in get_tagged_response:

net-imap/lib/net/imap.rb

Lines 1297 to 1301 in c04bf8f

def get_tagged_response(tag, cmd)
until @tagged_responses.key?(tag)
raise @exception if @exception
@tagged_response_arrival.wait
end

Although is this output only showing one thread?

@stanhu
Copy link
Author

stanhu commented Mar 15, 2021

It appears the deadlock is in the ensure block after the idle:

net-imap/lib/net/imap.rb

Lines 1055 to 1060 in c04bf8f

ensure
unless @receiver_thread_terminating
remove_response_handler(response_handler)
put_string("DONE#{CRLF}")
response = get_tagged_response(tag, "IDLE")
end

@stanhu
Copy link
Author

stanhu commented Mar 15, 2021

This does suggest that this synchronize block never ran:

net-imap/lib/net/imap.rb

Lines 1287 to 1294 in c04bf8f

synchronize do
@receiver_thread_terminating = true
@tagged_response_arrival.broadcast
@continuation_request_arrival.broadcast
if @idle_done_cond
@idle_done_cond.signal
end
end

If it did, @receiver_thread_terminating would have been set to true, thus preventing that ensure block from running.

It's possible the synchronize block was trying to wait for get_tagged_response to finish, though. Maybe the get_tagged_response in idle needs to have a timeout as well?

stanhu added a commit to stanhu/mail_room that referenced this issue Mar 15, 2021
When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but `Net::IMAP` is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.
shugo added a commit that referenced this issue Mar 16, 2021
@shugo
Copy link
Member

shugo commented Mar 16, 2021

@stanhu Thanks for your information.

Could you try #15?
It adds a new option idle_response_timeout, whose default value is 5 seconds.

@shugo shugo closed this as completed in #15 Mar 16, 2021
stanhu added a commit to stanhu/mail_room that referenced this issue Dec 15, 2021
When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but `Net::IMAP` is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.
stanhu added a commit to stanhu/mail_room that referenced this issue May 12, 2023
When MailRoom is run in Kubernetes, we have found occasions where
MailRoom appears to have attempted to stop running, but `Net::IMAP` is
stuck waiting for threads (ruby/net-imap#14).

This commit adds an HTTP liveness checker to enable detection of a
terminated MailRoom pod.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants