Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

closed socket, pending packets and rtx timer results in assertion failure in tcp_do_segment #454

Open
tgrabiec opened this issue Aug 13, 2014 · 5 comments

Comments

@tgrabiec
Copy link
Member

I got this after running workloadf on my laptop.

Assertion failed: tp->get_state() > 1 (/home/tgrabiec/src/osv/bsd/sys/netinet/tcp_input.cc: tcp_do_segment: 1076)
#0  0x00000000003e0692 in cli_hlt () at /home/tgrabiec/src/osv/arch/x64/processor.hh:242
#1  halt_no_interrupts () at /home/tgrabiec/src/osv/arch/x64/arch.hh:48
#2  osv::halt () at /home/tgrabiec/src/osv/core/power.cc:34
#3  0x00000000002232a5 in abort (fmt=fmt@entry=0x5e7140 "Assertion failed: %s (%s: %s: %d)\n") at /home/tgrabiec/src/osv/runtime.cc:143
#4  0x00000000002232e9 in __assert_fail (expr=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at /home/tgrabiec/src/osv/runtime.cc:149
#5  0x0000000000270470 in tcp_do_segment (m=m@entry=0xffffa0017f20e500, th=th@entry=0xffff800140fef02e, so=so@entry=0xffffa00109e84a00, tp=tp@entry=0xffffa00130062800, drop_hdrlen=0x42, tlen=tlen@entry=0x3d, iptos=iptos@entry=0x0, ti_locked=0x2, ti_locked@entry=0x1, want_close=@0xffff80010059deb0: 0x0) at /home/tgrabiec/src/osv/bsd/sys/netinet/tcp_input.cc:1075
#6  0x0000000000271f4f in tcp_net_channel_packet (m=0xffffa0017f20e500, tp=0xffffa00130062800) at /home/tgrabiec/src/osv/bsd/sys/netinet/tcp_input.cc:3210
#7  operator() (m=0xffffa0017f20e500, __closure=<optimized out>) at /home/tgrabiec/src/osv/bsd/sys/netinet/tcp_input.cc:3229
#8  std::_Function_handler<void(mbuf*), tcp_setup_net_channel(tcpcb*, ifnet*)::__lambda7>::_M_invoke(const std::_Any_data &, mbuf *) (__functor=..., __args#0=0xffffa0017f20e500) at /home/tgrabiec/src/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2071
#9  0x00000000003e5c5f in operator() (__args#0=<optimized out>, this=0xffff90012ff0b000) at /home/tgrabiec/src/osv/external/x64/gcc.bin/usr/include/c++/4.8.2/functional:2464
#10 net_channel::process_queue (this=0xffff90012ff0b000) at /home/tgrabiec/src/osv/core/net_channel.cc:37
#11 0x000000000027ba00 in tcp_timer_rexmt (timer=..., tp=0xffffa00130062800) at /home/tgrabiec/src/osv/bsd/sys/netinet/tcp_timer.cc:478
#12 0x00000000003ea582 in async::timer_task::fire (this=this@entry=0xffffa00103132a10, task=...) at /home/tgrabiec/src/osv/core/async.cc:360
#13 0x00000000003eb36b in fire (task=..., this=0xffff800100588040) at /home/tgrabiec/src/osv/core/async.cc:227
#14 async::async_worker::run (this=0xffff800100588040) at /home/tgrabiec/src/osv/core/async.cc:175
#15 0x00000000003caa0b in main (this=0xffff800100588740) at /home/tgrabiec/src/osv/core/sched.cc:935
#16 sched::thread_main_c (t=0xffff800100588740) at /home/tgrabiec/src/osv/arch/x64/arch-switch.hh:137
#17 0x000000000037a616 in thread_main () at /home/tgrabiec/src/osv/arch/x64/entry.S:113

The inp is dropped at this point:

gdb$ p inp->inp_flags
$10 = 0x4000000

The problem is that the the retransmission timer fires after socket got closed (this is ok) and there are unprocessed packets in the net channel. They go fast path into tcp_do_segment. Looking at tcp_input(), it sends RST before calling tcp_do_segment when socket is in closed state. I think we should replicate this in our net channel fast path too. I will try to come up with a patch for that.

@copumpkin
Copy link

@slivne @tgrabiec Did this get resolved in 0.14? I think an elusive bug we've been encountering is caused by this or something that closely resembles it 😦

@nyh
Copy link
Contributor

nyh commented Sep 24, 2015

Unfortunately, if it had been resolved by a specific commit, this commit would have been mentioned here, and the issue would have been automatically closed. So I'm afraid that unless we were lucky and some other fix fixed it, this bug is still open :-(

@copumpkin
Copy link

@nyh thanks for the prompt response!

What does fixing it look like? Are all you folks busy with the ScyllaDB release or is this sort of thing still a priority? I have minimal knowledge to help technically here, but perhaps I can help in other ways? It's unclear from the post if @tgrabiec has a simple repro of the issue, but if he doesn't, perhaps I can get one.

@tgrabiec
Copy link
Member Author

@copumpkin It's been a while, but from what I remember work on this was preempted by something else. I don't have a reproducer around, unfortunately.

@ohpauleez
Copy link

Is there any insight if the change is better addressed at the RCU related changes (introduced in #383 - also referenced in #378), or is the advice above sound - simply replicating the RST behavior in the net channel fast path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants