Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp_do_segment abort (NFS related?) #726

Closed
justinc1 opened this issue Feb 12, 2016 · 23 comments
Closed

tcp_do_segment abort (NFS related?) #726

justinc1 opened this issue Feb 12, 2016 · 23 comments

Comments

@justinc1
Copy link
Contributor

Crash:

sudo ./scripts/run.py -d -V -n -v -e '--ip=eth0,192.168.122.202,255.255.255.0 --defaultgw=192.168.122.1 --nameserver=192.168.122.1 /tools/mount-nfs.so nfs://192.168.122.1/ggg/?uid=0 /fff; tst-execve.so 30 /tst-file-read.so 30' --novnc -i build/debug/usr.img -c8 1>log-723-b/issue723-$ii 2>&1;

PP 7 tst-file-read.c ii=29/30
PP 14 tst-file-read.c ii=29/30
PP 9 fn=/fff/openfoam/etc/cellModels
PP 7 fn=/fff/openfoam/etc/cellModels
PP 14 fn=/fff/openfoam/etc/cellModels
PP 3 tst-file-read.c main EXIT ret=0
Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc: tcp_do_segment: 1076)

[backtrace]
0x000000000022d8ea <abort(char const*, ...)+249>
0x000000000022d94b <__assert_fail+58>
PP 23 tst-file-read.c main EXIT ret=0
PP 16 tst-file-read.c ii=29/30
PP 16 fn=/fff/openfoam/etc/cellModels
0x00000000002a7206 <???+2781702>
0x00000000002abc3e <???+2800702>
0x00000000002abd19 <???+2800921>
0x00000000002ac6b7 <???+2803383>
0x00000000006313f8 <std::function<void (mbuf*)>::operator()(mbuf*) const+72>
0x0000000000630299 <net_channel::process_queue()+57>
0x0000000000244bb1 <???+2378673>
0x0000000000244ed2 <soclose+797>
0x0000000000254c77 <socket_file::close()+47>
0x0000000000675770 <fdrop+147>
0x000000000067529d <fdclose(int)+163>
0x000000000064cfae <close+35>
0x000000000070afbe <rpc_destroy_context+318>
0xffff8000091d903f <???+152932415>

Running on fedora 23 host, code is master@c93ebf9d140f + :

__cxa_atexit/finalize: be thread-safe
Add test for osv_execve.
Add osv_execve, osv_waittid

gdb:

(gdb) where
#0  processor::cli_hlt () at arch/x64/processor.hh:248
#1  0x0000000000209c4e in arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  0x000000000049d36b in osv::halt () at arch/x64/power.cc:24
#3  0x000000000022d911 in abort (fmt=0xa36f50 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:132
#4  0x000000000022d94c in __assert_fail (expr=0xa4997e "tp->get_state() > 1", 
    file=0xa49644 "bsd/sys/netinet/tcp_input.cc", line=1076, 
    func=0xa49da0 <tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, int, bool&)::__func__> "tcp_do_segment") at runtime.cc:139
#5  0x00000000002a7207 in tcp_do_segment (m=0xffffa0000cfe6f00, th=0xffff80000c71e02e, so=0xffffa00009b4b600, 
    tp=0xffffa00002e8ec00, drop_hdrlen=66, tlen=732, iptos=0 '\000', ti_locked=2, want_close=@0x200000dffa4e: false)
    at bsd/sys/netinet/tcp_input.cc:1075
#6  0x00000000002abc3f in tcp_net_channel_packet (tp=0xffffa00002e8ec00, m=0xffffa0000cfe6f00)
    at bsd/sys/netinet/tcp_input.cc:3210
#7  0x00000000002abd1a in <lambda(mbuf*)>::operator()(mbuf *) const (__closure=0xffff9000043d5000, 
    m=0xffffa0000cfe6f00) at bsd/sys/netinet/tcp_input.cc:3229
#8  0x00000000002ac6b8 in std::_Function_handler<void(mbuf*), tcp_setup_net_channel(tcpcb*, ifnet*)::<lambda(mbuf*)> >::_M_invoke(const std::_Any_data &, <unknown type in /home/justin_cinkelj/devel/mikelangelo/osv/build/debug.x64/loader.elf, CU 0x77dfa7, DIE 0x7ad6c5>) (__functor=..., 
    __args#0=<unknown type in /home/justin_cinkelj/devel/mikelangelo/osv/build/debug.x64/loader.elf, CU 0x77dfa7, DIE 0x7ad6c5>) at /usr/include/c++/5.3.1/functional:1871
#9  0x00000000006313f9 in std::function<void (mbuf*)>::operator()(mbuf*) const (this=0xffff9000043d5000, 
    __args#0=0xffffa0000cfe6f00) at /usr/include/c++/5.3.1/functional:2271
#10 0x000000000063029a in net_channel::process_queue (this=0xffff9000043d5000) at core/net_channel.cc:37
#11 0x0000000000244bb2 in flush_net_channel (so=0xffffa00009b4b600) at bsd/sys/kern/uipc_socket.cc:587
#12 0x0000000000244ed3 in soclose (so=0xffffa00009b4b600) at bsd/sys/kern/uipc_socket.cc:657
#13 0x0000000000254c78 in socket_file::close (this=0xffffa00009802700) at bsd/sys/kern/sys_socket.cc:383
#14 0x0000000000675771 in fdrop (fp=0xffffa00009802700) at fs/vfs/kern_descrip.cc:211
#15 0x000000000067529e in fdclose (fd=10) at fs/vfs/kern_descrip.cc:94
#16 0x000000000064cfaf in close (fd=10) at fs/vfs/main.cc:235
#17 0x000000000070afbf in rpc_destroy_context (rpc=0xffff800009046040) at init.c:284
#18 0x0000000000701c13 in nfs_destroy_context (nfs=0xffffa00008fdf480) at libnfs.c:445
#19 0x00000000006f6fa1 in std::unique_ptr<nfs_context, void (*)(nfs_context*)>::~unique_ptr (
    this=0xffffa00008fc6128, __in_chrg=<optimized out>) at /usr/include/c++/5.3.1/bits/unique_ptr.h:236
---Type <return> to continue, or q <return> to quit---
#20 0x00000000006f7c1a in mount_context::~mount_context (this=0xffffa00008fc6100, __in_chrg=<optimized out>)
    at fs/nfs/nfs.hh:32
#21 0x00000000006f7c50 in std::default_delete<mount_context>::operator() (this=0xffffa00006e94be8, 
    __ptr=0xffffa00008fc6100) at /usr/include/c++/5.3.1/bits/unique_ptr.h:76
#22 0x00000000006f75c3 in std::unique_ptr<mount_context, std::default_delete<mount_context> >::~unique_ptr (
    this=0xffffa00006e94be8, __in_chrg=<optimized out>) at /usr/include/c++/5.3.1/bits/unique_ptr.h:236
#23 0x00000000006f90e2 in std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >::~pair (this=0xffffa00006e94bc8, 
    __in_chrg=<optimized out>) at /usr/include/c++/5.3.1/bits/stl_pair.h:96
#24 0x00000000006f910e in __gnu_cxx::new_allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > (this=0x200000dffe17, __p=0xffffa00006e94bc8)
    at /usr/include/c++/5.3.1/ext/new_allocator.h:124
#25 0x00000000006f8f99 in std::allocator_traits<std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > >::_S_destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > (__a=..., __p=0xffffa00006e94bc8)
    at /usr/include/c++/5.3.1/bits/alloc_traits.h:285
#26 0x00000000006f8bce in std::allocator_traits<std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > (__a=..., __p=0xffffa00006e94bc8)
    at /usr/include/c++/5.3.1/bits/alloc_traits.h:414
#27 0x00000000006f83df in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true> > >::_M_deallocate_node (this=0xffff900007563678, __n=0xffffa00006e94bc0)
    at /usr/include/c++/5.3.1/bits/hashtable_policy.h:1973
#28 0x00000000006f7d74 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true> > >::_M_deallocate_nodes (this=0xffff900007563678, __n=0x0)
    at /usr/include/c++/5.3.1/bits/hashtable_policy.h:1986
#29 0x00000000006f7934 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<cha---Type <return> to continue, or q <return> to quit---
r> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear (this=0xffff900007563678)
    at /usr/include/c++/5.3.1/bits/hashtable.h:1914
#30 0x00000000006f7348 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable (this=0xffff900007563678, __in_chrg=<optimized out>)
    at /usr/include/c++/5.3.1/bits/hashtable.h:1225
#31 0x00000000006f9302 in std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::unique_ptr<mount_context, std::default_delete<mount_context> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > >::~unordered_map (this=0xffff900007563678, __in_chrg=<optimized out>) at /usr/include/c++/5.3.1/bits/unordered_map.h:98
#32 0x00000000006b6384 in <lambda()>::operator()(void) const (__closure=0xffff900002155010)
    at libc/cxa_thread_atexit.cc:69
#33 0x00000000006b6505 in std::_Function_handler<void(), register_call_destructor()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/5.3.1/functional:1871
#34 0x000000000044f924 in std::function<void ()>::operator()() const (this=0xffff900002155010)
    at /usr/include/c++/5.3.1/functional:2271
#35 0x00000000005c149d in sched::run_exit_notifiers () at core/sched.cc:864
#36 0x00000000005c2051 in sched::thread::complete (this=0xffff800006e21050) at core/sched.cc:1081
#37 0x00000000005be772 in sched::thread_main_c (t=0xffff800006e21050) at arch/x64/arch-switch.hh:165
#38 0x000000000048ceb6 in thread_main () at arch/x64/entry.S:113
(gdb) 

30 test clients are started, and abort happens when they start to exit. NFS ~mount_context is part of trace.

So far, it happened once in 80 runs.

@benoit-canet
Copy link
Contributor

Hi Justin,

I just saw this. I will examine it tomorow.

BR

Benoît

@benoit-canet
Copy link
Contributor

mount_context is destroyed because the thread exit. In turn each thread mount_context is supposed to destroy it's NFS context that will close the socket toward the server. Maybe this lead to a double close situation leading to the assert. I will check for this in the code.

@benoit-canet
Copy link
Contributor

My last comment does not make sense fdrop() which is in the backtrace path is suposed to protect us against double close.

@nyh Do you agree with this reasoning ?

@nyh
Copy link
Contributor

nyh commented Feb 16, 2016

On first glance, it seems to me that there's nothing that the NFS code could do to cause this bug - even calling close() twice in parallel should have worked (protected by a mutex). So it appears it is more likely to be a bug in our TCP/IP implementation. We should probably investigate (I don't remember...) what this "tp->get_state() > 1" assertion is about. It seems to me (but it's been ages since I looked at this code) that we are calling code which assumes the socket is not closed, on a closed socket.

Would be nice if we could somehow reproduce this bug without NFS or OpenMPI - perhaps by some test program which opens many TCP connections in parallel and then closes all of them in parallel (after, optionally, writing/reading/? on each connection). I don't really know if it would reproduce this bug, but might be worth trying if reproducing Justin's original scenario is too difficult.

@benoit-canet
Copy link
Contributor

tcp_state is the state of the TCP connection (LISTEN, SYN etc) .

Ok I will write a test

@benoit-canet
Copy link
Contributor

Hi Justin,

I made a quick test launching multiple netperf instances toward the same netperf server.

It crash which could mean you may have spotted a general issue in OSv TCP implementation.

#728

Thanks

@nyh
Copy link
Contributor

nyh commented Feb 16, 2016

Yes, although I'm not convinced it's the same bug - it appears it might be a completely different bug.

What is really surprising me is that we already did numerous workloads on OSv with many threads doing networking, and we didn't run into these bugs before. I wonder what is new in these workloads that the bugs starting crawling out of the woodwork.

@benoit-canet
Copy link
Contributor

They may have in common of openning multiple socket to exactly the same
server.
NFS does this and netperf does this.

If there is somewhere in the code a map using the tuple (ip,port,proto) it
could be the cause.

On Tue, Feb 16, 2016 at 5:05 PM, nyh [email protected] wrote:

Yes, although I'm not convinced it's the same bug - it appears it might be
a completely different bug.

What is really surprising me is that we already did numerous workloads on
OSv with many threads doing networking, and we didn't run into these bugs
before. I wonder what is new in these workloads that the bugs starting
crawling out of the woodwork.


Reply to this email directly or view it on GitHub
#726 (comment)
.

@nyh
Copy link
Contributor

nyh commented Feb 16, 2016

On Tue, Feb 16, 2016 at 6:07 PM, Benoît Canet [email protected]
wrote:

They may have in common of openning multiple socket to exactly the same
server.
NFS does this and netperf does this.

If you can reproduce this with an even simpler single test program which
runs multiple threads (with std::thread or pthread_create) and creates
sockets to the same server, it will help be sure that none of the
complexities of NFS or netperf are related.

Can you reproduce the original bug with "-c1"? If not, the bug may depend
on things running at the same time, so to reproduce it with a simpler test
running less stuff, you might need to pin the threads to different CPUs
(you can do it with out sched::thread), or run for enough time that the
threads move on their own (after about 0.1 seconds).

I wonder if we need to write or read something from this socket to see this
crash, or just the connection / disconnection stages are enough to cause
this bug.

In any case, if you can easily reproduce one of these bugs, we can start
debugging them even without simplifying the test....

If there is somewhere in the code a map using the tuple (ip,port,proto) it

could be the cause.

Usually you have a 5-tuple (with also the source IP and port), but who
knows...

@benoit-canet
Copy link
Contributor

About #728 and this I made the following tests which works ok.

OSv Client:

#include <cstdlib>
#include <cstring>
#include <iostream>
#include <boost/asio.hpp>

using boost::asio::ip::tcp;

int main(int argc, char* argv[])
{
    char request[4096];
    try {
        if (argc != 3) {
            std::cout << "Usage: tst-boost-network-client <host> <port>\n";
            return 1;
        }

        boost::asio::io_service io_service;

        tcp::resolver resolver(io_service);
        tcp::resolver::query query(tcp::v4(), argv[1], argv[2]);
        tcp::resolver::iterator iterator = resolver.resolve(query);

        tcp::socket s(io_service);
        s.connect(*iterator);

        for (int i = 0; i < 10000; i++) {
            boost::asio::write(s, boost::asio::buffer(request, sizeof(request)));
        }
        std::cout << "Exit" << std::endl;
    } catch (std::exception& e) {
        std::cout << "Exception: " << e.what() << "\n";
    }

    return 0;
}

Host server:

use std::io::Read;
use std::net::TcpListener;
use std::thread;

fn main() {
    let listener = TcpListener::bind("192.168.77.5:9999").unwrap();
    println!("listening started, ready to accept");
    for stream in listener.incoming() {
        thread::spawn(|| {
            println!("accepted");
            let mut stream = stream.unwrap();
            let mut buf = [0; 512];
            while stream.read(&mut buf).unwrap() != 0 {
            }
            println!("exited");
        });
    }
}

So we have no more hypothesis.

@nyh
Copy link
Contributor

nyh commented Feb 16, 2016

@benoit-canet, this issue appears to have found a crash while **close()**ing a socket, presumably in some sort of parallel scenario (closing many sockets in parallel, or closing in parallel with data arriving, or whatever). The other issue, #728, saw the crash in select(). Since in your new test you neither close()ed many sockets in parallel, nor used select(), I'm not surprised it didn't reproduce these two issues.

@benoit-canet
Copy link
Contributor

No luck doing that either.
Maybe it's related to 5875942 that look like a can of worms.

@benoit-canet
Copy link
Contributor

@justinc1: Ah I just saw it was the same read test so I just have to use it.

@benoit-canet
Copy link
Contributor

@justinc1 I just realized that execve was also involved could you share your tst-execve with us by attaching the source to this issue ?

@benoit-canet
Copy link
Contributor

I just did:

./scripts/run.py  -v -e '/tools/mount-nfs.so nfs://192.168.77.5/fff/?uid=0 /fff; /tests/tst-file-read.so 30  1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000  ' --novnc  -v c 8

And OSv exited in a perfectly clean way.

So it could be an interaction with threads namespaces, the execve patch or the execve test tool.

@justinc1 A good debugging technique is to try to isolate the problem by bisecting it.

Could you try on your machine that simply launching test-file-read.c in parallel with the basic osv threads does not lead to this crash.

If it's confirmed on your side I start auditing all the namespace and execve thread that was combined with your test while thinking about the interraction with the NFS client.

Best regards

Benoît

@justinc1
Copy link
Contributor Author

I left running test (using execve) for 800 runs, and got 2 occurrences of #726. So its rare.
@benoit-canet, your test above (-e " /tests/tst-file-read.so 30 1000 & /tests/tst-file-read.so 30 1000 & ...") would do nfs_destroy_context only once, at VM termination?

@justinc1
Copy link
Contributor Author

Repeating without execve and elf namespaces.

CMD1="--ip=eth0,192.168.122.202,255.255.255.0 --defaultgw=192.168.122.1 --nameserver=192.168.122.1 /tools/mount-nfs.so nfs://192.168.122.1/ggg/?uid=0 /fff "
CMD2=""; for jj in `seq 1 3`; do CMD2="$CMD2 /tst-file-read.so $jj 30&"; done
for ii in `seq 1 10`; do echo ii=$ii; sudo ./scripts/run.py -d -V -n -v -e "$CMD1 ; $CMD2" --novnc -i build/debug/usr.img -c8 1>log-u-726a/issue726-$ii 2>&1; sleep 1; done

after 790 tries:
PP 2 fn=/fff/openfoam/etc/controlDict
PP 1 fn=/fff/openfoam/etc/controlDict
PP 2 tst-file-read.c main EXIT ret=0
PP 1 tst-file-read.c main EXIT ret=0
Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc: tcp_do_segment: 1076)
[backtrace]
0x000000000022bc69 <abort(char const*, ...)+249>
0x000000000022bcca <__assert_fail+58>
0x00000000002a6350 <???+2777936>
0x00000000002aad53 <???+2796883>
0x00000000002aae1f <???+2797087>
0x00000000002ab7a0 <???+2799520>
0x000000000061398c <std::function<void (mbuf*)>::operator()(mbuf*) const+72>
0x0000000000612866 <net_channel::process_queue()+32>
0x00000000002434cd <???+2372813>
0x00000000002437e2 <soclose+786>
0x0000000000253495 <socket_file::close()+47>
0x0000000000656a5e <fdrop+145>
0x0000000000656572 <fdclose(int)+163>
0x000000000062f453 <close+35>
0x00000000006ee1ee <rpc_destroy_context+318>
0xffff800005c1403f <???+96550975>

gdb:
ERROR: Unable to locate object file for: /tst-file-read.so 0x100005800000
#0  processor::cli_hlt () at arch/x64/processor.hh:248
#1  0x0000000000209693 in arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  0x000000000049b01b in osv::halt () at arch/x64/power.cc:24
#3  0x000000000022bc90 in abort (fmt=0x99eb20 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:132
#4  0x000000000022bccb in __assert_fail (expr=0x9b11b6 "tp->get_state() > 1", 
    file=0x9b0e7c "bsd/sys/netinet/tcp_input.cc", line=1076, 
    func=0x9b1849 <tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, int, bool&)::__func__> "tcp_do_segment") at runtime.cc:139
#5  0x00000000002a6351 in tcp_do_segment (m=0xffffa000042f6600, th=0xffff80000583b02e, so=0xffffa00004129200, 
    tp=0xffffa00004205c00, drop_hdrlen=66, tlen=732, iptos=0 '\000', ti_locked=2, want_close=@0x2000002ffa8e: false)
    at bsd/sys/netinet/tcp_input.cc:1075
#6  0x00000000002aad54 in tcp_net_channel_packet (tp=0xffffa00004205c00, m=0xffffa000042f6600)
    at bsd/sys/netinet/tcp_input.cc:3210
#7  0x00000000002aae20 in __lambda6::operator() (__closure=0xffffa00004177410, m=0xffffa000042f6600)
    at bsd/sys/netinet/tcp_input.cc:3229
#8  0x00000000002ab7a1 in std::_Function_handler<void(mbuf*), tcp_setup_net_channel(tcpcb*, ifnet*)::__lambda6>::_M_invoke(const std::_Any_data &, mbuf *) (__functor=..., __args#0=0xffffa000042f6600)
    at /usr/include/c++/4.8/functional:2071
#9  0x000000000061398d in std::function<void (mbuf*)>::operator()(mbuf*) const (this=0xffff90000468e000, 
    __args#0=0xffffa000042f6600) at /usr/include/c++/4.8/functional:2471
#10 0x0000000000612867 in net_channel::process_queue (this=0xffff90000468e000) at core/net_channel.cc:37
#11 0x00000000002434ce in flush_net_channel (so=0xffffa00004129200) at bsd/sys/kern/uipc_socket.cc:587
---Type <return> to continue, or q <return> to quit---
#12 0x00000000002437e3 in soclose (so=0xffffa00004129200) at bsd/sys/kern/uipc_socket.cc:657
#13 0x0000000000253496 in socket_file::close (this=0xffffa0000413d780) at bsd/sys/kern/sys_socket.cc:383
#14 0x0000000000656a5f in fdrop (fp=0xffffa0000413d780) at fs/vfs/kern_descrip.cc:211
#15 0x0000000000656573 in fdclose (fd=6) at fs/vfs/kern_descrip.cc:94
#16 0x000000000062f454 in close (fd=6) at fs/vfs/main.cc:235
#17 0x00000000006ee1ef in rpc_destroy_context (rpc=0xffff800005b0d040) at init.c:284
#18 0x00000000006e4793 in nfs_destroy_context (nfs=0xffffa000042e0980) at libnfs.c:445
#19 0x00000000006d99b7 in std::unique_ptr<nfs_context, void (*)(nfs_context*)>::~unique_ptr (
    this=0xffffa0000436c8d0, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/unique_ptr.h:184
#20 0x00000000006da640 in mount_context::~mount_context (this=0xffffa0000436c8c0, __in_chrg=<optimized out>)
    at fs/nfs/nfs.hh:32
#21 0x00000000006da674 in std::default_delete<mount_context>::operator() (this=0xffffa000041311d0, 
    __ptr=0xffffa0000436c8c0) at /usr/include/c++/4.8/bits/unique_ptr.h:67
#22 0x00000000006d9fdd in std::unique_ptr<mount_context, std::default_delete<mount_context> >::~unique_ptr (
    this=0xffffa000041311d0, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/unique_ptr.h:184
#23 0x00000000006db7ac in std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >::~pair (this=0xffffa000041311c8, __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/stl_pair.h:96
#24 0x00000000006dba28 in std::__detail::_Hash_node<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true>::~_Hash_node (this=0xffffa000041311c0, __in_chrg=<optimized out>)
    at /usr/include/c++/4.8/bits/hashtable_policy.h:170
#25 0x00000000006dba46 in __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true> >::destroy<std::__detail::_Hash_node<std::pair<std:---Type <return> to continue, or q <return> to quit---
:string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true> > (
    this=0xffff9000046a8670, __p=0xffffa000041311c0) at /usr/include/c++/4.8/ext/new_allocator.h:124
#26 0x00000000006db56d in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_deallocate_node (this=0xffff9000046a8660, __n=0xffffa000041311c0)
    at /usr/include/c++/4.8/bits/hashtable.h:746
#27 0x00000000006dafd3 in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_deallocate_nodes (this=0xffff9000046a8660, __n=0x0)
    at /usr/include/c++/4.8/bits/hashtable.h:763
#28 0x00000000006da8b6 in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear (this=0xffff9000046a8660)
    at /usr/include/c++/4.8/bits/hashtable.h:1641
#29 0x00000000006da3a0 in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std:---Type <return> to continue, or q <return> to quit---
:default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable (this=0xffff9000046a8660, __in_chrg=<optimized out>)
    at /usr/include/c++/4.8/bits/hashtable.h:958
#30 0x00000000006dbc5a in std::unordered_map<std::string, std::unique_ptr<mount_context, std::default_delete<mount_context> >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > >::~unordered_map (this=0xffff9000046a8660, 
    __in_chrg=<optimized out>) at /usr/include/c++/4.8/bits/unordered_map.h:97
#31 0x0000000000697ecc in __lambda1::operator() (__closure=0xffff900002007000) at libc/cxa_thread_atexit.cc:69
#32 0x000000000069804c in std::_Function_handler<void(), register_call_destructor()::__lambda1>::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/4.8/functional:2071
#33 0x000000000044c3c0 in std::function<void ()>::operator()() const (this=0xffff900002008010)
    at /usr/include/c++/4.8/functional:2471
#34 0x00000000005bbde0 in sched::run_exit_notifiers () at core/sched.cc:864
#35 0x00000000005bc9b1 in sched::thread::complete (this=0xffff800005af5050) at core/sched.cc:1081
#36 0x00000000005b9051 in sched::thread_main_c (t=0xffff800005af5050) at arch/x64/arch-switch.hh:165
#37 0x000000000048b626 in thread_main () at arch/x64/entry.S:113

Next try should be trying to get tcp_do_segment failure without using NFS, I guess. And maybe even with less that 1000 trials.

@nyh
Copy link
Contributor

nyh commented Feb 18, 2016

Thanks, as I said above I am guessing this is an actual bug in our TCP/IP code, and not caused by NFS, although if you try it without NFS you need to try some other type of network traffic which is somehow similar to the NFS workload (have a bunch of TCP connections in multiple threads, and then try to close all of them). This could be related to an SMP race (closing many connections in parallel, etc.) but also might not be related to SMP at all but rather some sort of TCP protocol race (e.g., a retransmitted packet arrives after we closed the socket, or something). I'll need to do some more serious debugging to try to fix this issue.

@justinc1
Copy link
Contributor Author

I was trying to repeat bug without using NFS. I opened multiple connection to TCP server, and TCP server was then pushing data to client, and than client terminates. But issue didn't show up.

@nyh
Copy link
Contributor

nyh commented Sep 19, 2016

@DerangedMonkeyNinja submitted a patch to the maiiling list, titled "tcp_input: net channel state fix", which may fix this bug.

He also made an interesting observation: The KASSERT is compiled out in release mode. So the fact that @justinc1 used debug mode in the example above was actually important for reproducing this bug. At least in some of the runs by @benoit-canet above he used release mode, which might be why he didn't see this bug.

@justinc1
Copy link
Contributor Author

justinc1 commented Nov 4, 2016

I repeated the test with patch applied (e.g. current master), and the issue is still there. The backtrace is nearly identical, I'm adding it just FYI:

PP 4 tst-file-read.c main EXIT ret=0
Assertion failed: tp->get_state() > 1 (bsd/sys/netinet/tcp_input.cc: tcp_do_segment: 1076)

[backtrace]
0x000000000022bc39 <abort(char const*, ...)+270>
0x000000000022bc9a <__assert_fail+58>
0x00000000002a56f9 <???+2774777>
0x00000000002aa0fc <???+2793724>
0x00000000002aa1c7 <???+2793927>
0x00000000002aab48 <???+2796360>
0x0000000000611152 <std::function<void (mbuf*)>::operator()(mbuf*) const+72>
0x000000000061002c <net_channel::process_queue()+32>
0x0000000000243686 <???+2373254>
0x000000000024399b <soclose+786>
0x000000000025375f <socket_file::close()+47>
0x000000000065b2d2 <fdrop+145>
0x000000000065ade7 <fdclose(int)+163>
0x0000000000633cf2 <close+35>
0x00000000006f2e1e <rpc_destroy_context+318>
0xffff800005e8e03f <???+99147839>

GDB:

(gdb) bt
#0  processor::cli_hlt () at arch/x64/processor.hh:248
#1  0x0000000000209613 in arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  0x000000000049c17a in osv::halt () at arch/x64/power.cc:24
#3  0x000000000022bc60 in abort (fmt=0x9a4760 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:130
#4  0x000000000022bc9b in __assert_fail (expr=0x9b6e36 "tp->get_state() > 1", file=0x9b6afc "bsd/sys/netinet/tcp_input.cc", line=1076, func=0x9b74c9 <tcp_do_segment(mbuf*, tcphdr*, socket*, tcpcb*, int, int, unsigned char, int, bool&)::__func__> "tcp_do_segment") at runtime.cc:137
#5  0x00000000002a56fa in tcp_do_segment (m=0xffffa000023f5c00, th=0xffff8000062f502e, so=0xffffa00002525400, tp=0xffffa0000250a800, drop_hdrlen=66, tlen=732, iptos=0 '\000', ti_locked=2, want_close=@0x2000007ff52e: false) at bsd/sys/netinet/tcp_input.cc:1075
#6  0x00000000002aa0fd in tcp_net_channel_packet (tp=0xffffa0000250a800, m=0xffffa000023f5c00) at bsd/sys/netinet/tcp_input.cc:3210
#7  0x00000000002aa1c8 in __lambda6::operator() (__closure=0xffffa00003b2b578, m=0xffffa000023f5c00) at bsd/sys/netinet/tcp_input.cc:3229
#8  0x00000000002aab49 in std::_Function_handler<void(mbuf*), tcp_setup_net_channel(tcpcb*, ifnet*)::__lambda6>::_M_invoke(const std::_Any_data &, mbuf *) (__functor=..., __args#0=0xffffa000023f5c00) at /usr/include/c++/4.8.2/functional:2071
#9  0x0000000000611153 in std::function<void (mbuf*)>::operator()(mbuf*) const (this=0xffff900002495000, __args#0=0xffffa000023f5c00) at /usr/include/c++/4.8.2/functional:2471
#10 0x000000000061002d in net_channel::process_queue (this=0xffff900002495000) at core/net_channel.cc:37
#11 0x0000000000243687 in flush_net_channel (so=0xffffa00002525400) at bsd/sys/kern/uipc_socket.cc:587
#12 0x000000000024399c in soclose (so=0xffffa00002525400) at bsd/sys/kern/uipc_socket.cc:657
#13 0x0000000000253760 in socket_file::close (this=0xffffa00002462300) at bsd/sys/kern/sys_socket.cc:383
#14 0x000000000065b2d3 in fdrop (fp=0xffffa00002462300) at fs/vfs/kern_descrip.cc:211
#15 0x000000000065ade8 in fdclose (fd=11) at fs/vfs/kern_descrip.cc:94
#16 0x0000000000633cf3 in close (fd=11) at fs/vfs/main.cc:235
#17 0x00000000006f2e1f in rpc_destroy_context (rpc=0xffff800005d87040) at init.c:284
#18 0x00000000006e9513 in nfs_destroy_context (nfs=0xffffa000024de700) at libnfs.c:445
#19 0x00000000006de7f9 in std::unique_ptr<nfs_context, void (*)(nfs_context*)>::~unique_ptr (this=0xffffa0000251d690, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/unique_ptr.h:184
#20 0x00000000006df482 in mount_context::~mount_context (this=0xffffa0000251d680, __in_chrg=<optimized out>) at fs/nfs/nfs.hh:32
#21 0x00000000006df4b6 in std::default_delete<mount_context>::operator() (this=0xffffa00002584850, __ptr=0xffffa0000251d680) at /usr/include/c++/4.8.2/bits/unique_ptr.h:67
#22 0x00000000006dee1f in std::unique_ptr<mount_context, std::default_delete<mount_context> >::~unique_ptr (this=0xffffa00002584850, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/unique_ptr.h:184
#23 0x00000000006e05ee in std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >::~pair (this=0xffffa00002584848, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_pair.h:96
#24 0x00000000006e086a in std::__detail::_Hash_node<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true>::~_Hash_node (this=0xffffa00002584840, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/hashtable_policy.h:170
#25 0x00000000006e0888 in __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true> >::destroy<std::__detail::_Hash_node<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, true> > (this=0xffff9000055aa688, __p=0xffffa00002584840) at /usr/include/c++/4.8.2/ext/new_allocator.h:124
#26 0x00000000006e03af in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_deallocate_node (this=0xffff9000055aa678, __n=0xffffa00002584840) at /usr/include/c++/4.8.2/bits/hashtable.h:746
#27 0x00000000006dfe15 in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_deallocate_nodes (this=0xffff9000055aa678, __n=0x0) at /usr/include/c++/4.8.2/bits/hashtable.h:763
#28 0x00000000006df6f8 in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear (this=0xffff9000055aa678) at /usr/include/c++/4.8.2/bits/hashtable.h:1641
#29 0x00000000006df1e2 in std::_Hashtable<std::string, std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > >, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable (this=0xffff9000055aa678, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/hashtable.h:958
#30 0x00000000006e0a9c in std::unordered_map<std::string, std::unique_ptr<mount_context, std::default_delete<mount_context> >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::unique_ptr<mount_context, std::default_delete<mount_context> > > > >::~unordered_map (this=0xffff9000055aa678, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/unordered_map.h:97
#31 0x00000000006fa489 in (anonymous namespace)::run(void*) ()
#32 0x0000000000687258 in pthread_private::run_tsd_dtors () at libc/pthread.cc:68
#33 0x0000000000687297 in pthread_private::__lambda5::operator() (__closure=0xffff900001e89000) at libc/pthread.cc:78
#34 0x0000000000689ad3 in std::_Function_handler<void(), pthread_private::pthread_register_tsd_dtor_notifier()::__lambda5>::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/4.8.2/functional:2071
#35 0x000000000044aa42 in std::function<void ()>::operator()() const (this=0xffff900001e8a010) at /usr/include/c++/4.8.2/functional:2471
#36 0x00000000005be570 in sched::run_exit_notifiers () at core/sched.cc:984
#37 0x00000000005bf147 in sched::thread::complete (this=0xffff800005d81040) at core/sched.cc:1205
#38 0x00000000005bf285 in sched::thread::exit () at core/sched.cc:1239
#39 0x0000000000688fde in pthread_exit (retval=0x0) at libc/pthread.cc:906
#40 0x0000100005c00f7a in ?? ()
#41 0x0000000000000000 in ?? ()

@justinc1
Copy link
Contributor Author

justinc1 commented Nov 4, 2016

With test below I was able to get crash 6 times in 31 tries. Appending the file as I tend to forgot what I did last time to trigger the problem. It still uses NFS, but multiple wokers are started with pthreads, not in elf namespaces.

gcc -o tst-file-read-pt.so tst-file-read-pt.c -shared -fPIC -ggdb -lpthread

sudo ./scripts/run.py -nvd -e '/tools/mount-nfs.so nfs://192.168.122.1/ggg/?uid=0 /fff ; /tst-file-read-pt.so 10 30'

tst-file-read-pt.c.txt

@justinc1
Copy link
Contributor Author

This seems as duplicate of #454

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants