Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tomcat NIO connector fails #293

Closed
asias opened this issue May 6, 2014 · 17 comments
Closed

tomcat NIO connector fails #293

asias opened this issue May 6, 2014 · 17 comments

Comments

@asias
Copy link
Contributor

asias commented May 6, 2014

Change tomcat to use NIO connector instead of BIO by editing:

$ vim osv/apps/tomcat/upstream/apache-tomcat-7.0.42/conf/server.xml

$ make image=tomcat
$ sudo make scripts/run.py -nv
May 06, 2014 12:14:36 AM org.apache.catalina.startup.Catalina start
INFO: Server startup in 1753 ms
pure virtual method called
terminate called without an active exception
Aborted

[backtrace]
0x417694 <__gnu_cxx::__verbose_terminate_handler()+340>
0x3bb0c2 <do_poll(std::vector<poll_file, std::allocator<poll_file> >&, int)+290>
0x3bc867 <epoll_wait+391>
0x100003209a82 <Java_sun_nio_ch_EPollArrayWrapper_epollWait+210>
0x200000590b70 <???+5835632>
0x200000598b1f <???+5868319>
0x4083f8e96f <???+-2080839313>

$ sh perform-one-test.sh
$ cat perform-one-test.sh

!/usr/bin/env bash

GUEST_IP=192.168.122.89

if [ -z "$GUEST_IP" ]; then
echo "Please set GUEST_IP"
exit 1
fi

TIMESTAMP=$(date +%Y%m%d_%H%M%S)

echo "=== Warmup ==="
wrk -t4 -c256 -d30s http://${GUEST_IP}:8081/servlet/json

echo "=== Main test ==="
wrk_out_file=wrk.${TIMESTAMP}
echo "Output goes to $wrk_out_file"
wrk --latency -t4 -c128 -d1m http://${GUEST_IP}:8081/servlet/json > ${wrk_out_file}

echo "=== Results ==="
./wrk-parse.py ${wrk_out_file}

@asias
Copy link
Contributor Author

asias commented May 6, 2014

More backtrace:

#0 0x00000000003be012 in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:241
#1 halt_no_interrupts () at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:46
#2 osv::halt () at /home/asias/src/cloudius-systems/osv/core/power.cc:34
#3 0x000000000021d9b0 in abort (fmt=fmt@entry=0x5ab4c1 "Aborted\n") at /home/asias/src/cloudius-systems/osv/runtime.cc:141
#4 0x000000000021da30 in abort () at /home/asias/src/cloudius-systems/osv/runtime.cc:119
#5 0x0000000000417695 in __gnu_cxx::__verbose_terminate_handler() ()
#6 0x00000000004166b6 in __cxxabiv1::__terminate(void (*)()) ()
#7 0x00000000004166e3 in std::terminate() ()
#8 0x00000000004188ef in __cxa_pure_virtual ()
#9 0x00000000003bab45 in poll_scan (_pfd=std::vector of length 128, capacity 128 = {...}) at /home/asias/src/cloudius-systems/osv/core/poll.cc:126
#10 0x00000000003bb0c3 in do_poll (pfd=std::vector of length 0, capacity 0, _timeout=_timeout@entry=1000) at /home/asias/src/cloudius-systems/osv/core/poll.cc:290
#11 0x00000000003bc868 in wait (timeout_ms=1000, maxevents=8192, events=0xffffc0001d9e8000, this=0xffffc00036545e00)
at /home/asias/src/cloudius-systems/osv/core/epoll.cc:123
#12 epoll_wait (epfd=, events=0xffffc0001d9e8000, maxevents=8192, timeout_ms=1000) at /home/asias/src/cloudius-systems/osv/core/epoll.cc:236
#13 0x0000100003209ab7 in Java_sun_nio_ch_EPollArrayWrapper_epollWait ()
#14 0x00002000005c90b1 in ?? ()
#15 0x00000000fbada040 in ?? ()
#16 0x0000000000000007 in ?? ()
#17 0x0000000200000092 in ?? ()
#18 0x00000000fbada040 in ?? ()
#19 0x00000000536840dd in ?? ()
#20 0x00000000000b534d in ?? ()
#21 0x0000000000000021 in ?? ()
#22 0x0000200000594360 in ?? ()
#23 0x00000000fbada040 in ?? ()
#24 0x00002000005cd520 in ?? ()
#25 0x00000000000003e8 in ?? ()
#26 0xfbada030fbada020 in ?? ()
#27 0x00000000fbada020 in ?? ()
#28 0x00000000fbada030 in ?? ()
#29 0x00000000fb83a050 in ?? ()
#30 0x00000000fbada040 in ?? ()
#31 0x00000000eb98b630 in ?? ()
#32 0x00000000fb6b20a8 in ?? ()

@nyh
Copy link
Contributor

nyh commented May 11, 2014

On Tue, May 6, 2014 at 5:01 AM, Asias He [email protected] wrote:

More backtrace:

#0 0x00000000003be012 in cli_hlt () at
/home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:241
#1 #1 halt_no_interrupts ()
at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:46
#2 #2 osv::halt () at
/home/asias/src/cloudius-systems/osv/core/power.cc:34
#3 #3 0x000000000021d9b0 in
abort (fmt=fmt@entry=0x5ab4c1 "Aborted\n") at
/home/asias/src/cloudius-systems/osv/runtime.cc:141
#4 #4 0x000000000021da30
in abort () at /home/asias/src/cloudius-systems/osv/runtime.cc:119
#5 #5 0x0000000000417695
in __gnu_cxx::__verbose_terminate_handler() ()
#6 #6 0x00000000004166b6
in __cxxabiv1::__terminate(void (*)()) ()
#7 #7 0x00000000004166e3
in std::terminate() ()
#8 #8 0x00000000004188ef
in __cxa_pure_virtual ()
#9 #9 0x00000000003bab45
in poll_scan (_pfd=std::vector of length 128, capacity 128 = {...}) at
/home/asias/src/cloudius-systems/osv/core/poll.cc:126

poll_scan calls fp->poll(). This poll() is a pure virtual virtual function,
but all subclasses of "file" must implement it, so it should no longer be a
pure virtual in subclasses.

The only thing I can think of that can cause this (but this is just a wild
guess...) is that we have a race between closeing a file (the destruction
of fp) and an epoll_wait() (calling fp->poll()), so when fp->poll() is
called the subclass was already destructed and we're left with the
superclass, with its pure virtual function.

It's been a while since I looked at this code, but looking now at
file::~file, it seems when we destroy a file object, we call
epoll_file_closed() to remove this file from existing epoll lists, but we
do this a bit too late - when we get to file::~file() we already have a
corrupt object (stripped of its subclass content), so a concurrent
epoll_wait() will cause fp->poll() to be called on a corrupt object.

By the way, it also seems that the epoll_file_closed() function, if called
concurrently with epoll_wait(), can cause problems because those are not
protected by a mutex (unless I'm missing something).

Asias, do you think that this workload really does epoll_wait()
concurrently with a close()?

@asias
Copy link
Contributor Author

asias commented May 12, 2014

On Sun, May 11, 2014 at 7:35 PM, nyh [email protected] wrote:

On Tue, May 6, 2014 at 5:01 AM, Asias He [email protected] wrote:

More backtrace:

#0 0x00000000003be012 in cli_hlt () at
/home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:241
#1 #1 halt_no_interrupts
()
at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:46
#2 #2 osv::halt () at
/home/asias/src/cloudius-systems/osv/core/power.cc:34
#3 #3 0x000000000021d9b0
in
abort (fmt=fmt@entry=0x5ab4c1 "Aborted\n") at
/home/asias/src/cloudius-systems/osv/runtime.cc:141
#4 #4 0x000000000021da30
in abort () at /home/asias/src/cloudius-systems/osv/runtime.cc:119
#5 #5 0x0000000000417695
in __gnu_cxx::__verbose_terminate_handler() ()
#6 #6 0x00000000004166b6
in __cxxabiv1::__terminate(void (*)()) ()
#7 #7 0x00000000004166e3
in std::terminate() ()
#8 #8 0x00000000004188ef
in __cxa_pure_virtual ()
#9 #9 0x00000000003bab45
in poll_scan (_pfd=std::vector of length 128, capacity 128 = {...}) at
/home/asias/src/cloudius-systems/osv/core/poll.cc:126

poll_scan calls fp->poll(). This poll() is a pure virtual virtual function,
but all subclasses of "file" must implement it, so it should no longer be a
pure virtual in subclasses.

The only thing I can think of that can cause this (but this is just a wild
guess...) is that we have a race between closeing a file (the destruction
of fp) and an epoll_wait() (calling fp->poll()), so when fp->poll() is
called the subclass was already destructed and we're left with the
superclass, with its pure virtual function.

It's been a while since I looked at this code, but looking now at
file::~file, it seems when we destroy a file object, we call
epoll_file_closed() to remove this file from existing epoll lists, but we
do this a bit too late - when we get to file::~file() we already have a
corrupt object (stripped of its subclass content), so a concurrent
epoll_wait() will cause fp->poll() to be called on a corrupt object.

By the way, it also seems that the epoll_file_closed() function, if called
concurrently with epoll_wait(), can cause problems because those are not
protected by a mutex (unless I'm missing something).

Asias, do you think that this workload really does epoll_wait()
concurrently with a close()?

I am not sure. Is there any way I can verify it?

Asias

@avikivity
Copy link
Member

On 05/12/2014 08:45 AM, Asias He wrote:

On Sun, May 11, 2014 at 7:35 PM, nyh [email protected] wrote:

On Tue, May 6, 2014 at 5:01 AM, Asias He [email protected]
wrote:

More backtrace:

#0 0x00000000003be012 in cli_hlt () at
/home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:241
#1 #1 halt_no_interrupts
()
at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:46
#2 #2 osv::halt () at
/home/asias/src/cloudius-systems/osv/core/power.cc:34
#3 #3 0x000000000021d9b0
in
abort (fmt=fmt@entry=0x5ab4c1 "Aborted\n") at
/home/asias/src/cloudius-systems/osv/runtime.cc:141
#4 #4
0x000000000021da30
in abort () at /home/asias/src/cloudius-systems/osv/runtime.cc:119
#5 #5
0x0000000000417695
in __gnu_cxx::__verbose_terminate_handler() ()
#6 #6
0x00000000004166b6
in __cxxabiv1::__terminate(void (*)()) ()
#7 #7
0x00000000004166e3
in std::terminate() ()
#8 #8
0x00000000004188ef
in __cxa_pure_virtual ()
#9 #9
0x00000000003bab45
in poll_scan (_pfd=std::vector of length 128, capacity 128 = {...}) at
/home/asias/src/cloudius-systems/osv/core/poll.cc:126

poll_scan calls fp->poll(). This poll() is a pure virtual virtual
function,
but all subclasses of "file" must implement it, so it should no
longer be a
pure virtual in subclasses.

The only thing I can think of that can cause this (but this is just
a wild
guess...) is that we have a race between closeing a file (the
destruction
of fp) and an epoll_wait() (calling fp->poll()), so when fp->poll() is
called the subclass was already destructed and we're left with the
superclass, with its pure virtual function.

It's been a while since I looked at this code, but looking now at
file::~file, it seems when we destroy a file object, we call
epoll_file_closed() to remove this file from existing epoll lists,
but we
do this a bit too late - when we get to file::~file() we already have a
corrupt object (stripped of its subclass content), so a concurrent
epoll_wait() will cause fp->poll() to be called on a corrupt object.

By the way, it also seems that the epoll_file_closed() function, if
called
concurrently with epoll_wait(), can cause problems because those are not
protected by a mutex (unless I'm missing something).

Asias, do you think that this workload really does epoll_wait()
concurrently with a close()?

I am not sure. Is there any way I can verify it?

You can examine the type of the object:

(gdb) set print object on
(gdb) p fp (or p this)

If it says 'file', then Nadav is correct. How we actually get there can
be due to a bug in the workload (calling epoll concurrently with close
is a bug, since a concurrent open can result in a new fd with the same
number). Can be verified with an strace.

@asias
Copy link
Contributor Author

asias commented Jun 30, 2014

(gdb) f 20
#20 0x000000000051319b in poll_scan (_pfd=std::vector of length 68, capacity 68 = {...}) at /home/asias/src/cloudius-systems/osv/core/poll.cc:126
126 entry->revents = fp->poll(entry->events);
(gdb) p fp
$3 = (file *) 0xffffa0001a269480

@asias
Copy link
Contributor Author

asias commented Jul 3, 2014

Please reopen it since the commit does not really fix the issue.

On Thu, Jul 3, 2014 at 2:35 PM, Avi Kivity [email protected] wrote:

Closed #293 #293 via
a26ec3e
a26ec3e
.


Reply to this email directly or view it on GitHub
#293 (comment).

Asias

@avikivity avikivity reopened this Jul 3, 2014
@avikivity
Copy link
Member

Reopened, while the patch is needed, it is not sufficient.

@slivne slivne added this to the release 0.12 milestone Jul 31, 2014
@tgrabiec
Copy link
Member

tgrabiec commented Aug 4, 2014

Could it be related to #424 ?

@avikivity
Copy link
Member

It may. Hard to see exactly how.

What I am sure about is all the reference counting in that area is
completely rotten.

On 08/04/2014 09:16 PM, Tomasz Grabiec wrote:

Could it be related to #424
#424 ?


Reply to this email directly or view it on GitHub
#293 (comment).

@asias
Copy link
Contributor Author

asias commented Aug 11, 2014

'''
With commit 28c67d8
'''

'''
Aug 11, 2014 7:51:21 AM org.apache.coyote.AbstractProtocol start
INFO: Starting ProtocolHandler ["http-nio-8081"]
Aug 11, 2014 7:51:21 AM org.apache.catalina.startup.Catalina start
INFO: Server startup in 2952 ms
Assertion failed: !(so->so_state & 0x0001) (/home/asias/src/cloudius-systems/osv/bsd/sys/kern/uipc_socket.cc: soclose: 604)

[backtrace]
0x0000000000223328 <__assert_fail+24>
0x0000000000233568 <soclose+664>
0x00000000003fe83c <fdrop+60>
0x00000000003fea3f <fdset(int, file*)+111>
0x00000000003fbcb9 <dup3+73>
0x0000100004807c84 <Java_sun_nio_ch_FileDispatcherImpl_preClose0+36>
0x00002000006134b5 <???+6370485>
0x0000200000606057 <???+6316119>
0x0000200000955c5f <???+9788511>
0x00000000e2000565 <???+-503315099>
0x00002000007ce71f <???+8185631>
'''

'''
(gdb) bt
#0 0x00000000003e03b2 in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242
#1 halt_no_interrupts () at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:48
#2 osv::halt () at /home/asias/src/cloudius-systems/osv/core/power.cc:34
#3 0x00000000002232ef in abort (fmt=fmt@entry=0x5e66a0 "Assertion failed: %s (%s: %s: %d)\n") at /home/asias/src/cloudius-systems/osv/runtime.cc:143
#4 0x0000000000223329 in __assert_fail (expr=, file=, line=, func=)
at /home/asias/src/cloudius-systems/osv/runtime.cc:149
#5 0x0000000000233569 in soclose (so=0xffffa0001752da00) at /home/asias/src/cloudius-systems/osv/bsd/sys/kern/uipc_socket.cc:604
#6 0x00000000003fe83d in fdrop (fp=0xffffa001026e3e80) at /home/asias/src/cloudius-systems/osv/fs/vfs/kern_descrip.cc:199
#7 0x00000000003fea40 in fdset (fd=fd@entry=106, fp=) at /home/asias/src/cloudius-systems/osv/fs/vfs/kern_descrip.cc:118
#8 0x00000000003fbcba in dup3 (oldfd=, newfd=106, flags=) at /home/asias/src/cloudius-systems/osv/fs/vfs/main.cc:1297
#9 0x0000100004807c85 in Java_sun_nio_ch_IOUtil_fdVal ()
#10 0x00002000006134b6 in ?? ()
#11 0x0000000000000000 in ?? ()

(gdb) i th
[New Thread 2]
Id Target Id Frame
2 Thread 2 (CPU#1 [halted ]) 0x000000000037b2ea in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242

  • 1 Thread 1 (CPU#0 [halted ]) 0x00000000003e03b2 in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242
    (gdb) t 2
    [Switching to thread 2 (Thread 2)]
    #0 0x000000000037b2ea in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242
    242 asm volatile ("cli; hlt" : : : "memory");
    (gdb) bt
    #0 0x000000000037b2ea in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242
    Hch/buildsystem fixes #1 nmi (ef=0xffff800001388068) at /home/asias/src/cloudius-systems/osv/arch/x64/exceptions.cc:255
    remove duplicate typedefs for u* types #2
    add a README file to document how to build and run OSv #3 sched::cpu::do_idle (this=this@entry=0xffff80000115c040) at /home/asias/src/cloudius-systems/osv/core/sched.cc:388
    Implement DHCP #4 0x00000000003cab60 in sched::cpu::idle (this=0xffff80000115c040) at /home/asias/src/cloudius-systems/osv/core/sched.cc:401
    throughput decrease when using more vcpus #5 0x00000000003cab8f in operator() (__closure=) at /home/asias/src/cloudius-systems/osv/core/sched.cc:186
    Allow SSH connectivity to the CLI #6 std::_Function_handler<void(), sched::cpu::init_idle_thread()::__lambda7>::_M_invoke(const std::_Any_data &) (__functor=...)
    at /home/asias/src/cloudius-systems/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2071
    Java crash #7 0x00000000003cadeb in main (this=0xffff800001383040) at /home/asias/src/cloudius-systems/osv/core/sched.cc:931
    Make Java InetAddress.getHostName() work #8 sched::thread_main_c (t=0xffff800001383040) at /home/asias/src/cloudius-systems/osv/arch/x64/arch-switch.hh:137
    One loose end in FPU saving #9 0x000000000037a536 in thread_main () at /home/asias/src/cloudius-systems/osv/arch/x64/entry.S:113
    '''

@nyh
Copy link
Contributor

nyh commented Aug 11, 2014

On Mon, Aug 11, 2014 at 10:54 AM, Asias He [email protected] wrote:

'''
With commit 28c67d8
28c67d8

I don't see how this commit, which only changed the implementation of
abort(), could probably cause a bug. Maybe it just made it more visible and
before the crash was silent?

@asias
Copy link
Contributor Author

asias commented Aug 11, 2014

On Mon, Aug 11, 2014 at 4:11 PM, nyh [email protected] wrote:

On Mon, Aug 11, 2014 at 10:54 AM, Asias He [email protected]
wrote:

'''
With commit 28c67d8
<
28c67d8

I don't see how this commit, which only changed the implementation of
abort(), could probably cause a bug. Maybe it just made it more visible and
before the crash was silent?

I am not testing this particular commit. I am testing if we have fixed the
NIO issue in osv/master because of other fixes.

Asias

@nyh
Copy link
Contributor

nyh commented Aug 11, 2014

On Mon, Aug 11, 2014 at 11:16 AM, Asias He [email protected] wrote:

I am not testing this particular commit. I am testing if we have fixed the
NIO issue in osv/master because of other fixes.

Oh, sorry. I thought you did a bisection and came up with this comment :-)

Nadav Har'El
[email protected]

@asias
Copy link
Contributor Author

asias commented Aug 19, 2014

Sometimes I saw assert on A, sometimes on B.

int
soclose(struct socket *so)
{
    int error = 0;
    uipc_d("soclose() so=%" PRIx64, (uint64_t)so);
    KASSERT(!(so->so_state & SS_NOFDREF), ("soclose: SS_NOFDREF on enter"));  <<<---- A
    ...
    ....
    flush_net_channel(so);
    KASSERT((so->so_state & SS_NOFDREF) == 0, ("soclose: NOFDREF")); <<<--- B
    so->so_state |= SS_NOFDREF;
    so->fp = NULL;
    sorele(so);
    CURVNET_RESTORE();
    return (error);
}

@asias
Copy link
Contributor Author

asias commented Aug 19, 2014

I added the following to verify that we reached Line X twice.

int fdrop(struct file *fp)
{
    if (__sync_fetch_and_sub(&fp->f_count, 1) != 1)
        return 0;

    /* We are about to free this file structure, but we still do things with it
     * so set the refcount to INT_MIN, fhold/fdrop may get called again
     * and we don't want to reach this point more than once.
     * INT_MIN is also safe against fget() seeing this file.
     */

    if (fp->ref++ > 0)
        abort();

    fp->f_count = INT_MIN;         <<<<----------- X 
    fp->stop_polls();
    fp->close();
    rcu_dispose(fp);
    return 1;
}

(gdb) bt
#0  0x00000000003df4c2 in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242
#1  halt_no_interrupts () at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:48
#2  osv::halt () at /home/asias/src/cloudius-systems/osv/core/power.cc:34
#3  0x000000000022361f in abort (fmt=fmt@entry=0x5e6b75 "Aborted\n") at /home/asias/src/cloudius-systems/osv/runtime.cc:143
#4  0x0000000000223640 in abort () at /home/asias/src/cloudius-systems/osv/runtime.cc:110
#5  0x00000000003fd75d in fdrop (fp=0xffffa000db643200) at /home/asias/src/cloudius-systems/osv/fs/vfs/kern_descrip.cc:198
#6  0x00000000003fd920 in fdset (fd=fd@entry=119, fp=<optimized out>) at /home/asias/src/cloudius-systems/osv/fs/vfs/kern_descrip.cc:118
#7  0x00000000003fab8a in dup3 (oldfd=<optimized out>, newfd=119, flags=<optimized out>) at /home/asias/src/cloudius-systems/osv/fs/vfs/main.cc:1297
#8  0x0000100004007c85 in ?? ()
#9  0x0000000722604e68 in ?? ()
#10 0x000020001b22d760 in ?? ()
#11 0x000020001b22d6f0 in ?? ()
#12 0x00002000009aa77e in ?? ()
#13 0x00000007e0d098e0 in ?? ()
#14 0x0000100001a2d2d4 in ?? ()
#15 0x00000007100021a8 in ?? ()
#16 0x0000000000000000 in ?? ()

(gdb) f 5
#5  0x00000000003fd75d in fdrop (fp=0xffffa000db643200) at /home/asias/src/cloudius-systems/osv/fs/vfs/kern_descrip.cc:198
198             abort();
(gdb) l
193          * and we don't want to reach this point more than once.
194          * INT_MIN is also safe against fget() seeing this file.
195          */
196
197         if (fp->ref++ > 0)
198             abort();
199
200         fp->f_count = INT_MIN;
201         fp->stop_polls();
202         fp->close();

(gdb) p fp->ref
$5 = {
  <std::__atomic_base<int>> = {
    _M_i = 2
  }, <No data fields>}

@asias
Copy link
Contributor Author

asias commented Aug 19, 2014

I saved backtrace to fp->bt using backtrace_safe in the first time we hit line X.

We are doing dup3 -> fdset -> fdrop in the first time.
We are doing so_wake_poll -> poll_wake -> fdrop in the second time.



(gdb) p fp->bt
$1 = {0x3fd970 <fdset(int, file*)+112>, 0x3fabba <dup3(int, int, int)+74>, 0x100004007c85, 0x2000006134b6, 0x200000606058, 0x200000606058, 0x200000606058,
  0x200000606058, 0x200000606058, 0x2000007a903c, 0xe38412a9e21e5625, 0x0 <repeats 117 times>}
(gdb) p fp->bt_len
$2 = 11
(gdb) p fp->ref
$3 = {
  <std::__atomic_base<int>> = {
    _M_i = 2
  }, <No data fields>}
(gdb) bt
#0  0x00000000003df4f2 in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242
#1  halt_no_interrupts () at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:48
#2  osv::halt () at /home/asias/src/cloudius-systems/osv/core/power.cc:34
#3  0x000000000022361f in abort (fmt=fmt@entry=0x5e6c15 "Aborted\n") at /home/asias/src/cloudius-systems/osv/runtime.cc:143
#4  0x0000000000223640 in abort () at /home/asias/src/cloudius-systems/osv/runtime.cc:110
#5  0x00000000003fd7a4 in fdrop (fp=0xffff9001016aa000) at /home/asias/src/cloudius-systems/osv/fs/vfs/kern_descrip.cc:213
#6  0x00000000003db450 in poll_wake (fp=0xffff9001016aa000, events=260) at /home/asias/src/cloudius-systems/osv/core/poll.cc:187
#7  0x0000000000231009 in so_wake_poll (so=so@entry=0xffffa000d6a08200, sb=sb@entry=0xffffa000d6a08320)
    at /home/asias/src/cloudius-systems/osv/bsd/sys/kern/uipc_sockbuf.cc:231
#8  0x0000000000231094 in sowakeup (so=0xffffa000d6a08200, sb=0xffffa000d6a08320) at /home/asias/src/cloudius-systems/osv/bsd/sys/kern/uipc_sockbuf.cc:253
#9  0x000000000026fe61 in sowwakeup_locked (so=0xffffa000d6a08200) at /home/asias/src/cloudius-systems/osv/bsd/sys/sys/socketvar.h:270
#10 tcp_do_segment (m=m@entry=0xffffa000d6f0a500, th=th@entry=0xffff8000d6e6902e, so=0xffffa000d6a08200, tp=tp@entry=0xffffa0001d320800,
    drop_hdrlen=drop_hdrlen@entry=52, tlen=tlen@entry=0, iptos=iptos@entry=0 '\000', ti_locked=ti_locked@entry=2, want_close=@0xffff800002e28e00: false)
    at /home/asias/src/cloudius-systems/osv/bsd/sys/netinet/tcp_input.cc:2229
#11 0x0000000000271cad in tcp_input (m=0xffffa000d6f0a500, off0=20) at /home/asias/src/cloudius-systems/osv/bsd/sys/netinet/tcp_input.cc:955
#12 0x00000000002637aa in ip_input (m=<optimized out>) at /home/asias/src/cloudius-systems/osv/bsd/sys/netinet/ip_input.cc:774
#13 0x0000000000250942 in netisr_dispatch_src (proto=1, source=<optimized out>, m=0xffffa000d6f0a500)
    at /home/asias/src/cloudius-systems/osv/bsd/sys/net/netisr.cc:769
#14 0x000000000024891f in ether_input_internal (m=0xffffa000d6f0a500, ifp=0xffff900002d25000)
    at /home/asias/src/cloudius-systems/osv/bsd/sys/net/if_ethersubr.cc:741
#15 ether_nh_input (m=0xffffa000d6f0a500) at /home/asias/src/cloudius-systems/osv/bsd/sys/net/if_ethersubr.cc:754
#16 0x0000000000250942 in netisr_dispatch_src (proto=9, source=<optimized out>, m=0xffffa000d6f0a500)
    at /home/asias/src/cloudius-systems/osv/bsd/sys/net/netisr.cc:769
#17 0x0000000000357c3b in virtio::net::receiver (this=0xffff800002e03040) at /home/asias/src/cloudius-systems/osv/drivers/virtio-net.cc:502
#18 0x0000000000357f2f in operator() (__closure=<optimized out>) at /home/asias/src/cloudius-systems/osv/drivers/virtio-net.cc:246
#19 std::_Function_handler<void(), virtio::net::net(pci::device&)::__lambda12>::_M_invoke(const std::_Any_data &) (__functor=...)
    at /home/asias/src/cloudius-systems/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2071
#20 0x00000000003c98bb in main (this=0xffff800002e032d0) at /home/asias/src/cloudius-systems/osv/core/sched.cc:935
#21 sched::thread_main_c (t=0xffff800002e032d0) at /home/asias/src/cloudius-systems/osv/arch/x64/arch-switch.hh:137
#22 0x000000000037a8f6 in thread_main () at /home/asias/src/cloudius-systems/osv/arch/x64/entry.S:113

@avikivity
Copy link
Member

Aha. So while the first fdrop() is calling file::close() (which is
socket::close), the socket wakes up and kills the file.

We need to move the call to file::close() from fdrop() to fdset() and
fdclose() (before the fdrop()s there). This will eliminate the race, as
there can be only one call to fdclose() or fdset().

On 08/19/2014 11:03 AM, Asias He wrote:

I safed backtrace to fp->bt using backtrace_safe in the first time we
hit line X.

We are doing dup3 -> fdset -> fdrop in the first time.
We are doing so_wake_poll -> poll_wake -> fdrop in the second time.

|
|

(gdb) p fp->bt
$1 = {0x3fd970 <fdset(int, file*)+112>, 0x3fabba <dup3(int, int, int)+74>, 0x100004007c85, 0x2000006134b6, 0x200000606058, 0x200000606058, 0x200000606058,
0x200000606058, 0x200000606058, 0x2000007a903c, 0xe38412a9e21e5625, 0x0 <repeats 117 times>}
(gdb) p fp->bt_len
$2 = 11
(gdb) p fp->ref
$3 = {
std::__atomic_base = {
_M_i = 2
}, }
(gdb) bt
#0 0x00000000003df4f2 in cli_hlt () at /home/asias/src/cloudius-systems/osv/arch/x64/processor.hh:242
#1 halt_no_interrupts () at /home/asias/src/cloudius-systems/osv/arch/x64/arch.hh:48
#2 osv::halt () at /home/asias/src/cloudius-systems/osv/core/power.cc:34
#3 0x000000000022361f in abort (fmt=fmt@entry=0x5e6c15 "Aborted\n") at /home/asias/src/cloudius-systems/osv/runtime.cc:143
#4 0x0000000000223640 in abort () at /home/asias/src/cloudius-systems/osv/runtime.cc:110
#5 0x00000000003fd7a4 in fdrop (fp=0xffff9001016aa000) at /home/asias/src/cloudius-systems/osv/fs/vfs/kern_descrip.cc:213
#6 0x00000000003db450 in poll_wake (fp=0xffff9001016aa000, events=260) at /home/asias/src/cloudius-systems/osv/core/poll.cc:187
#7 0x0000000000231009 in so_wake_poll (so=so@entry=0xffffa000d6a08200, sb=sb@entry=0xffffa000d6a08320)
at /home/asias/src/cloudius-systems/osv/bsd/sys/kern/uipc_sockbuf.cc:231
#8 0x0000000000231094 in sowakeup (so=0xffffa000d6a08200, sb=0xffffa000d6a08320) at /home/asias/src/cloudius-systems/osv/bsd/sys/kern/uipc_sockbuf.cc:253
#9 0x000000000026fe61 in sowwakeup_locked (so=0xffffa000d6a08200) at /home/asias/src/cloudius-systems/osv/bsd/sys/sys/socketvar.h:270
#10 tcp_do_segment (m=m@entry=0xffffa000d6f0a500, th=th@entry=0xffff8000d6e6902e, so=0xffffa000d6a08200, tp=tp@entry=0xffffa0001d320800,
drop_hdrlen=drop_hdrlen@entry=52, tlen=tlen@entry=0, iptos=iptos@entry=0 '\000', ti_locked=ti_locked@entry=2, want_close=@0xffff800002e28e00: false)
at /home/asias/src/cloudius-systems/osv/bsd/sys/netinet/tcp_input.cc:2229
#11 0x0000000000271cad in tcp_input (m=0xffffa000d6f0a500, off0=20) at /home/asias/src/cloudius-systems/osv/bsd/sys/netinet/tcp_input.cc:955
#12 0x00000000002637aa in ip_input (m=) at /home/asias/src/cloudius-systems/osv/bsd/sys/netinet/ip_input.cc:774
#13 0x0000000000250942 in netisr_dispatch_src (proto=1, source=, m=0xffffa000d6f0a500)
at /home/asias/src/cloudius-systems/osv/bsd/sys/net/netisr.cc:769
#14 0x000000000024891f in ether_input_internal (m=0xffffa000d6f0a500, ifp=0xffff900002d25000)
at /home/asias/src/cloudius-systems/osv/bsd/sys/net/if_ethersubr.cc:741
#15 ether_nh_input (m=0xffffa000d6f0a500) at /home/asias/src/cloudius-systems/osv/bsd/sys/net/if_ethersubr.cc:754
#16 0x0000000000250942 in netisr_dispatch_src (proto=9, source=, m=0xffffa000d6f0a500)
at /home/asias/src/cloudius-systems/osv/bsd/sys/net/netisr.cc:769
#17 0x0000000000357c3b in virtio::net::receiver (this=0xffff800002e03040) at /home/asias/src/cloudius-systems/osv/drivers/virtio-net.cc:502
#18 0x0000000000357f2f in operator() (__closure=) at /home/asias/src/cloudius-systems/osv/drivers/virtio-net.cc:246
#19 std::_Function_handler<void(), virtio::net::net(pci::device&)::__lambda12>::_M_invoke(const std::_Any_data &) (__functor=...)
at /home/asias/src/cloudius-systems/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2071
#20 0x00000000003c98bb in main (this=0xffff800002e032d0) at /home/asias/src/cloudius-systems/osv/core/sched.cc:935
#21 sched::thread_main_c (t=0xffff800002e032d0) at /home/asias/src/cloudius-systems/osv/arch/x64/arch-switch.hh:137
#22 0x000000000037a8f6 in thread_main () at /home/asias/src/cloudius-systems/osv/arch/x64/entry.S:113
|


Reply to this email directly or view it on GitHub
#293 (comment).

asias added a commit to asias/osv that referenced this issue Aug 25, 2014
There is a race window between f_count reaches 0 and we set f_count to
INT_MIN. This race would result in multiple call to fp->close() and
rcu_dispose(fp) if we call them in fdrop().

To fix, we move fp->stop_polls and fp->close() from fdrop() into in
fdset() and fdclose(). This will eliminate the multipe call to
fp->close(), as there can be only one call to fdclose() or fdset().

Also, we set f_count to INT_MIN using CAS, this way, we make sure
rcu_dispose(fp) won't be called twice.

Fixes cloudius-systems#293.

Signed-off-by: Asias He <[email protected]>
asias added a commit to asias/osv that referenced this issue Aug 25, 2014
There is a race window between f_count reaches 0 and we set f_count to
INT_MIN. This race would result in multiple call to fp->close() and
rcu_dispose(fp) if we call them in fdrop().

To fix, we move fp->stop_polls and fp->close() from fdrop() into in
fdset() and fdclose(). This will eliminate the multipe call to
fp->close(), as there can be only one call to fdclose() or fdset().

Also, we set f_count to INT_MIN using CAS, this way, we make sure
rcu_dispose(fp) won't be called twice.

Fixes cloudius-systems#293.

Signed-off-by: Asias He <[email protected]>
asias added a commit to asias/osv that referenced this issue Aug 25, 2014
There is a race window between f_count reaches 0 and we set f_count to
INT_MIN. This race would result in multiple call to fp->close() and
rcu_dispose(fp) if we call them in fdrop().

To fix, we move fp->stop_polls and fp->close() from fdrop() into in
fdset() and fdclose(). This will eliminate the multipe call to
fp->close(), as there can be only one call to fdclose() or fdset().

Also, we set f_count to INT_MIN using CAS, this way, we make sure
rcu_dispose(fp) won't be called twice.

Fixes cloudius-systems#293.

Signed-off-by: Asias He <[email protected]>
asias added a commit to asias/osv that referenced this issue Aug 25, 2014
There is a race window between f_count reaches 0 and we set f_count to
INT_MIN. This race would result in multiple call to fp->close() and
rcu_dispose(fp) if we call them in fdrop().

To fix, we set f_count to INT_MIN using CAS, this way, we make sure
fp->close() and rcu_dispose(fp) won't be called twice.

Fixes cloudius-systems#293.

Signed-off-by: Asias He <[email protected]>
asias added a commit to asias/osv that referenced this issue Aug 26, 2014
There is a race window between f_count reaches 0 and we set f_count to
INT_MIN. This race would result in multiple call to fp->close() and
rcu_dispose(fp) if we call them in fdrop().

To fix, we set f_count to INT_MIN using CAS, this way, we make sure
fp->close() and rcu_dispose(fp) won't be called twice.

Fixes cloudius-systems#293.

Signed-off-by: Asias He <[email protected]>
asias added a commit to asias/osv that referenced this issue Aug 27, 2014
There is a race window between f_count reaches 0 and we set f_count to
INT_MIN. This race would result in multiple call to fp->close() and
rcu_dispose(fp) if we call them in fdrop().

To fix, we set f_count to INT_MIN using CAS, this way, we make sure
fp->close() and rcu_dispose(fp) won't be called twice.

Fixes cloudius-systems#293.

Signed-off-by: Asias He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants