Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epoll_wait() causes aborts when socket is closed concurrently #424

Closed
tgrabiec opened this issue Jul 29, 2014 · 3 comments
Closed

epoll_wait() causes aborts when socket is closed concurrently #424

tgrabiec opened this issue Jul 29, 2014 · 3 comments
Labels

Comments

@tgrabiec
Copy link
Member

Looks like socket_file::poll() can be called after socket_file::close()

0xffff800003abd050 >>init           1 1406635248.320246220 socket_file_close    0xffffa00003738480
0xffff8000035d7050 /libhttpserver.  0 1406635248.320254564 socket_file_poll     0xffffa00003738480

Causing an abort like this:

gdb$ bt
#0  0x00000000003d9232 in cli_hlt () at /home/tgrabiec/src/osv/arch/x64/processor.hh:242
#1  halt_no_interrupts () at /home/tgrabiec/src/osv/arch/x64/arch.hh:48
#2  osv::halt () at /home/tgrabiec/src/osv/core/power.cc:34
#3  0x0000000000223fe0 in abort (fmt=fmt@entry=0x5d5880 "Assertion failed: %s (%s: %s: %d)\n") at /home/tgrabiec/src/osv/runtime.cc:139
#4  0x0000000000224089 in __assert_fail (expr=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at /home/tgrabiec/src/osv/runtime.cc:145
#5  0x00000000003a2807 in lockfree::mutex::unlock (this=0xffff80000373ac00) at /home/tgrabiec/src/osv/core/lfmutex.cc:220
#6  0x000000000023d228 in socket_file::poll (this=0xffffa00003738480, events=0x8000001b) at /home/tgrabiec/src/osv/bsd/sys/kern/sys_socket.cc:245
#7  0x00000000003d5405 in poll_scan (_pfd=...) at /home/tgrabiec/src/osv/core/poll.cc:125
#8  0x00000000003d59b4 in do_poll (pfd=..., _timeout=...) at /home/tgrabiec/src/osv/core/poll.cc:282
#9  0x00000000003d734d in wait (timeout_ms=<optimized out>, maxevents=0x80, events=0x2000002ff270, this=0xffffa00002f50800) at /home/tgrabiec/src/osv/core/epoll.cc:190
#10 epoll_wait (epfd=<optimized out>, events=0x2000002ff270, maxevents=0x80, timeout_ms=<optimized out>) at /home/tgrabiec/src/osv/core/epoll.cc:321

Manual for epoll suggests that it should handle concurrent close:

     Q6  Will  closing  a  file  descriptor  cause it to be removed from all
           epoll sets automatically?

       A6  Yes, but be aware of the following point.  A file descriptor  is  a
           reference  to  an  open file description (see open(2)).  Whenever a
           descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD,  or
           fork(2),  a  new  file  descriptor  referring to the same open file
           description is created.  An  open  file  description  continues  to
           exist  until all file descriptors referring to it have been closed.
@raphaelsc
Copy link
Member

@avikivity, @tgrabiec

Possibly a related bug (while running /tests/tst-bsd-tcp1.so at 'make check'):

page fault outside application, addr: 0x0000000000000000
[registers]
RIP: 0x00000000003a7337 lockfree::mutex::lock()+39
RFL: 0x0000000000010206 CS: 0x0000000000000008 SS: 0x0000000000000010
RAX: 0x0000000000000001 RBX: 0xfffffffffffff8d0 RCX: 0x0000200000000005 RDX: 0x000000000023c4e0
RSI: 0x0000000000000000 RDI: 0x0000000000000000 RBP: 0x00002000002fed40 R8: 0xffff8000035b95c0
R9: 0x0000000000df7ef0 R10: 0xffffa000035b95c0 R11: 0x00002000002fedb0 R12: 0x0000000000000000
R13: 0x0000000000df7ef0 R14: 0xffffa000035b95c0 R15: 0x00002000002fedb0 RSP: 0x00002000002fecf0
Aborted

[backtrace]
0x0000000000327d4f <???+3308879>
0x0000000000328ec3 <mmu::vm_fault(unsigned long, exception_frame*)+147>
0x000000000037a689 <page_fault+105>
0x0000000000379576 <???+3642742>
0x000000000023c4fd <socket_file::epoll_del(epoll_ptr)+29>
0x00000000003de90f <epoll_file_closed(epoll_ptr)+207>
0x00000000003feabd file::stop_polls()+93
0x00000000003feb0c <fdrop+60>
0x00000000003fec72 <fdclose(int)+66>
0x00000000003f9657 <close+23>
0x0000100000c10258 <boost::asio::detail::socket_ops::close(int, unsigned char&, bool, boost::system::error_code&)+56>
0x0000100000c10e5f <???+12652127>
0x0000100000c0dd35 <???+12639541>
0x0000100000c0e110 <???+12640528>
0x0000100000c0e616 <boost::unit_test::ut_detail::callback0_impl_t<boost::unit_test::ut_detail::unused, void (*)()>::invoke()+6>
0x0000100001069940 <???+17209664>
0xffffa000033f68df <???+54487263>
0xffffa0000311d7d7 <???+51501015>
0x000010000106994f <???+17209679>
0x90c307894810c082 <???+1209057410>

(gdb) bt
#0  0x00000000003e0792 in cli_hlt () at /data/osv/arch/x64/processor.hh:242
#1  halt_no_interrupts () at /data/osv/arch/x64/arch.hh:48
#2  osv::halt () at /data/osv/core/power.cc:34
#3  0x0000000000223240 in abort (fmt=fmt@entry=0x5e6fe5 "Aborted\n") at /data/osv/runtime.cc:139
#4  0x00000000002232d0 in abort () at /data/osv/runtime.cc:109
#5  0x0000000000327d50 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff80000328f078) at /data/osv/core/mmu.cc:1191
#6  0x0000000000328ec4 in mmu::vm_fault (addr=<optimized out>, addr@entry=0, ef=ef@entry=0xffff80000328f078) at /data/osv/core/mmu.cc:1213
#7  0x000000000037a68a in page_fault (ef=0xffff80000328f078) at /data/osv/arch/x64/mmu.cc:38
#8  <signal handler called>
#9  fetch_add (__m=std::memory_order_acquire, __i=1, this=0x0) at /data/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/bits/atomic_base.h:614
#10 lockfree::mutex::lock (this=0x0) at /data/osv/core/lfmutex.cc:29
#11 0x000000000023c4fe in socket_file::epoll_del (this=0xffffa000033c5f80, ep=...) at /data/osv/bsd/sys/kern/sys_socket.cc:294
#12 0x00000000003de910 in del (key=..., this=0xffff8000033e7040) at /data/osv/core/epoll.cc:135
#13 epoll_file_closed (ptr=...) at /data/osv/core/epoll.cc:353
#14 0x00000000003feabe in file::stop_polls (this=this@entry=0xffffa000033c5f80) at /data/osv/fs/vfs/kern_descrip.cc:216
#15 0x00000000003feb0d in fdrop (fp=0xffffa000033c5f80) at /data/osv/fs/vfs/kern_descrip.cc:200
#16 0x00000000003fec73 in fdclose (fd=<optimized out>) at /data/osv/fs/vfs/kern_descrip.cc:94
#17 0x00000000003f9658 in close (fd=<optimized out>) at /data/osv/fs/vfs/main.cc:211
#18 0x0000100000c10259 in boost::asio::detail::socket_ops::close (s=5, state=@0x2000002ff034: 16 '\020', destruction=<optimized out>, ec=...)
    at /data/osv/external/x64/misc.bin/usr/include/boost/asio/detail/impl/socket_ops.ipp:299
#19 0x0000100000c10e60 in boost::asio::detail::reactive_socket_service_base::destroy (this=<optimized out>, impl=...)
    at /data/osv/external/x64/misc.bin/usr/include/boost/asio/detail/impl/reactive_socket_service_base.ipp:90
#20 0x0000100000c0dd36 in destroy (impl=..., this=<optimized out>)
    at /data/osv/external/x64/misc.bin/usr/include/boost/asio/socket_acceptor_service.hpp:119
#21 ~basic_io_object (this=0x2000002ff030, __in_chrg=<optimized out>)
    at /data/osv/external/x64/misc.bin/usr/include/boost/asio/basic_io_object.hpp:196
#22 ~basic_socket_acceptor (this=0x2000002ff030, __in_chrg=<optimized out>)
    at /data/osv/external/x64/misc.bin/usr/include/boost/asio/basic_socket_acceptor.hpp:54
#23 test_shutdown_wr::test_method (this=this@entry=0x2000002ff0df) at /data/osv/tests/tst-bsd-tcp1.cc:228
#24 0x0000100000c0e111 in test_shutdown_wr_invoker () at /data/osv/tests/tst-bsd-tcp1.cc:224
#25 0x0000100000c0e617 in invoke<void (*)()> (this=<optimized out>, f=<optimized out>)
    at /data/osv/external/x64/misc.bin/usr/include/boost/test/utils/callback.hpp:56
#26 boost::unit_test::ut_detail::callback0_impl_t<boost::unit_test::ut_detail::unused, void (*)()>::invoke (this=<optimized out>)
    at /data/osv/external/x64/misc.bin/usr/include/boost/test/utils/callback.hpp:89
---Type <return> to continue, or q <return> to quit---
#27 0x0000100001069941 in boost::unit_test::ut_detail::callback0_impl_t<int, boost::unit_test::(anonymous namespace)::zero_return_wrapper_t<boost::unit_test::callback0<boost::unit_test::ut_detail::unused> > >::invoke() ()
#28 0x000010000104c796 in boost::execution_monitor::catch_signals(boost::unit_test::callback0<int> const&) ()
#29 0x000010000104cfc3 in boost::execution_monitor::execute(boost::unit_test::callback0<int> const&) ()
#30 0x0000100001069a42 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::unit_test::test_case const&) ()
#31 0x0000100001053d64 in boost::unit_test::framework_impl::visit(boost::unit_test::test_case const&) ()
#32 0x00001000010818a3 in boost::unit_test::traverse_test_tree(boost::unit_test::test_suite const&, boost::unit_test::test_tree_visitor&) ()
#33 0x000010000104f28a in boost::unit_test::framework::run(unsigned long, bool) ()
#34 0x0000100001067d84 in boost::unit_test::unit_test_main(bool (*)(), int, char**) ()
#35 0x00000000003ecacb in osv::application::run_main (this=this@entry=0xffffa00002c2b818, path="/tests/tst-bsd-tcp1.so", argc=argc@entry=1, 
    argv=argv@entry=0xffffa000020fb870) at /data/osv/core/app.cc:193
#36 0x00000000003ed2f8 in osv::application::run_main (this=this@entry=0xffffa00002c2b818) at /data/osv/core/app.cc:208
#37 0x000000000020b9e2 in osv::application::main (this=0xffffa00002c2b818) at /data/osv/core/app.cc:155
#38 0x00000000003ee119 in operator() (__closure=0x0, app=<optimized out>) at /data/osv/core/app.cc:115
#39 osv::application::__lambda3::_FUN (app=<optimized out>) at /data/osv/core/app.cc:117
#40 0x000000000041ac96 in operator() (__closure=0xffffa0000311f800) at /data/osv/libc/pthread.cc:79
#41 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::__lambda6>::_M_invoke(const std::_Any_data &) (__functor=...) at /data/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2071
#42 0x00000000003cafcb in main (this=0xffff80000328a050) at /data/osv/core/sched.cc:952
#43 sched::thread_main_c (t=0xffff80000328a050) at /data/osv/arch/x64/arch-switch.hh:137
#44 0x000000000037a4f6 in thread_main () at /data/osv/arch/x64/entry.S:113

@penberg penberg closed this as completed in 17d9a2d Aug 8, 2014
tgrabiec added a commit to tgrabiec/osv that referenced this issue Aug 8, 2014
After close() is called, the file should not be used any more. We need
to stop pollers before doing that, otherwise epoll_wait() can attempt
calling poll() on a file which was closed. Doing so will result in
aborts like the one below, because the socket, which is attempted to
be locked has been already freed.

[backtrace]
0x00000000002232a8 <__assert_fail+24>
0x00000000003a7e27 <lockfree::mutex::unlock()+487>
0x00000000003dfe0d <epoll_file::wait(epoll_event*, int, int)+1165>
0x00000000003de240 <epoll_wait+112>
0x0000100004409ad2 <Java_sun_nio_ch_EPollArrayWrapper_epollWait+210>

Fixes cloudius-systems#424.
@tgrabiec
Copy link
Member Author

tgrabiec commented Aug 8, 2014

This solves only one part of the problem. The issue still remains because epoll_wait() does not synchronize with file removal and it's still possible that it will attempt to call methods on a closed object if file gets closed after dropping epoll_file::f_lock.

I think that bringing back Avi's epoll: fix up lock ordering issues, which got reverted, will solve this, because it holds epoll_file::f_lock around accesses to slave files.

0x0000000000473190 <page_fault+273>
0x0000000000472046 <???+4661318>
0x00000000004b5889 <std::atomic<sched::thread*>::load(std::memory_order) const+31>
0x00000000004b36a2 <lockfree::mutex::unlock()+50>
0x000000000024cbcd <socket_file::epoll_add()+379>
0x000000000053c1ac <epoll_file::wait(epoll_event*, int, int)+1024>
0x0000000000539e21 <epoll_wait+197>
0x0000100004c09ad2 <Java_sun_nio_ch_EPollArrayWrapper_epollWait+210>

@avikivity
Copy link
Member

Yes, I had a version of this patch in my tree, which never got
committed, which may be why I did not see these problems.

On 08/08/2014 08:56 PM, Tomasz Grabiec wrote:

This solves only one part of the problem. The issue still remains
because |epoll_wait()| does not synchronize with file removal and it's
still possible that it will attempt to call methods on a closed object
if file gets closed after dropping |epoll_file::f_lock|.

I think that bringing back Avi's |epoll: fix up lock ordering issues|,
which got reverted, will solve this, because it holds
|epoll_file::f_lock| around accesses to slave files.

|0x0000000000473190 <page_fault+273>
0x0000000000472046 <???+4661318>
0x00000000004b5889 std::atomic<sched::thread*::load(std::memory_order) const+31>
0x00000000004b36a2 lockfree::mutex::unlock()+50
0x000000000024cbcd <socket_file::epoll_add()+379>
0x000000000053c1ac <epoll_file::wait(epoll_event*, int, int)+1024>
0x0000000000539e21 <epoll_wait+197>
0x0000100004c09ad2 <Java_sun_nio_ch_EPollArrayWrapper_epollWait+210>

|
|


Reply to this email directly or view it on GitHub
#424 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants