Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

golang-pie-httpserver crashes on control-C #1026

Closed
nyh opened this issue Feb 24, 2019 · 7 comments
Closed

golang-pie-httpserver crashes on control-C #1026

nyh opened this issue Feb 24, 2019 · 7 comments

Comments

@nyh
Copy link
Contributor

nyh commented Feb 24, 2019

This is a spinoff from issue #352.

If we run

$ scripts/build image=golang-pie-httpserver
$ scripts/run.py

This seems to work correctly despite issue #352. At least no obvious problem.
But there is a crash if we click on control-C to kill the server:

page fault outside application, addr: 0x0000000000000000
[registers]
RIP: 0x000000000039a462 <syscall_entry+18>
RFL: 0x0000000000010286  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x000000000000000e  RBX: 0x000000c00002e000  RCX: 0x00001000001f6eda  RDX: 0x000000c00002e080
RSI: 0x0000000000000000  RDI: 0x0000000000000002  RBP: 0xffff8000032eee70  R8:  0xffff800000a24501
R9:  0xffff800000a24530  R10: 0x0000000000000008  R11: 0x0000000000000286  R12: 0xffff800001d43040
R13: 0xffff800000a24530  R14: 0x0000000000000008  R15: 0x0000000000000286  RSP: 0x0000000000000000
Aborted

[backtrace]
0x000000000033fbc8 <???+3406792>
0x00000000003410aa <mmu::vm_fault(unsigned long, exception_frame*)+458>
0x000000000039a625 <page_fault+117>
0x00000000003994a6 <???+3773606>
0x00001000001dde94 <???+1957524>
0x00001000001ce031 <???+1892401>
0x00001000001ddc56 <???+1956950>
0x00001000001dd545 <???+1955141>
0x00001000001f6fc2 <???+2060226>
0x00000000003f3116 <thread_main_c+38>
0x000000000039a422 <???+3777570>

Unfortunately gdb gives us this:

#0  0x00000000003a0572 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  osv::halt () at arch/x64/power.cc:24
#3  0x000000000023ebae in abort (fmt=fmt@entry=0x622b33 "Aborted\n")
    at runtime.cc:132
#4  0x000000000020281e in abort () at runtime.cc:98
#5  0x000000000033fbc9 in mmu::vm_sigsegv (ef=0xffff8000032de068, 
    addr=<optimized out>) at core/mmu.cc:1316
#6  mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff8000032de068)
    at core/mmu.cc:1310
#7  0x00000000003410ab in mmu::vm_fault (addr=addr@entry=0, 
    ef=ef@entry=0xffff8000032de068) at core/mmu.cc:1338
#8  0x000000000039a626 in page_fault (ef=0xffff8000032de068)
    at arch/x64/mmu.cc:38
#9  <signal handler called>
../../gdb/dwarf2-frame.c:1061: internal-error: Unknown CFA rule.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n)

I suspect the second problem comes from bad Dwarf CFA instructions we added for the SYSCALL instruction and didn't debug thoroughly well enough.

As for the crash itself, my guess is that maybe a SYSCALL inside a signal handler doesn't work. Did we ever test that? The crashing instruction inside syscall_entry is:

  39a459:       64 48 8b 24 25 10 00    mov    %fs:0x10,%rsp
  39a460:       00 00 
  39a462:       48 83 3c 24 00          cmpq   $0x0,(%rsp)

The syscall_entry code seems to have that tiny/large syscall stack trickery, and maybe it somehow breaks inside a signal handler? Maybe the PIE has something to do with this problem?

We can try a signal handler in Go which does not involve a shutdown, to at least get shutdown issues out of the equation.

@wkozaczuk
Copy link
Collaborator

@nyh What do you mean by "signal handler in Go which does not involve a shutdown"?

@nyh
Copy link
Contributor Author

nyh commented Feb 27, 2019

I was assuming (but it's worth verifying too...) that the Go application captures that SIGINT, and then this case involves two things: a signal handler (written in Go?) and a shutdown from that signal handler. I don't know which of the two things causes this crash. It could be a crash caused by incorrect handling of system calls (e.g., exit()) inside a signal handler. It could be a crash caused by exit() in parallel with other threads running other system calls. Or it can be other things... In short, this needs debugging....

@wkozaczuk
Copy link
Collaborator

Here is is the crux of the problem: it looks Golang sets up a signal handler that gets run on OSv system thread (not an app thread) which obviously has no syscall stack setup (see RSP with 0 value as a result of it). That is why we get page fault accessing a value on non-existent stack.

It looks like this patch fixes it:

diff --git a/arch/x64/entry.S b/arch/x64/entry.S
index 0c7164e0..5a64c7dd 100644
--- a/arch/x64/entry.S
+++ b/arch/x64/entry.S
@@ -174,6 +174,10 @@ syscall_entry:
     # on syscall stack pointed by address in TCB (%fs:16) - double memory dereference.
     # Therefore we are forced to save caller stack address in a field in TCB.
     movq %rsp, %fs:24 # syscall_caller_stack_pointer
+
+    cmpq $0, %fs:16
+    jz large_stack_has_been_setup
+
     #
     # Switch stack to "tiny" syscall stack that should be large
     # enough to setup "large" syscall stack (only when first SYSCALL on this thread)

Now when I do ctrl-C I get this:

OSv v0.53.0-44-g06ce8e00
eth0: 192.168.122.15
Booted up in 138.36 ms
WARNING: /httpserver.so is a PIE using TLS. This is currently unsupported (see issue #352). Link with '-shared' instead of '-pie'.
Go version: go1.12.6
syscall(): unimplemented system call 39
syscall(): unimplemented system call 234

The 39 syscall (getpid) is easy to fix. But what about 234 (sys_tgkill) - how shall we handle it?

@nyh
Copy link
Contributor Author

nyh commented Jul 1, 2019

Good catch. Your patch seems reasonable (if there is no "tiny syscall stack", don't switch stacks at all, hoping this thread has a big enough stack already). There's perhaps another option we can consider, which is to make the thread running the signal handler (see libc/signal.cc, sched::thread::make() call in kill()) be tagged an application thread, and get the normal syscall stack handling.

About tgkill(), I don't know what this is trying to do (send a signal to a specific thread? a thread group? what is this signal supposed to do?) so I don't know what is the minimal support we can add to make sense...

@nyh
Copy link
Contributor Author

nyh commented Jul 3, 2019

@wkozaczuk If I understand correctly, the crash originally reported here was fixed by commit 33bea60, right? If so, what are the symptoms now?

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Jul 3, 2019 via email

@wkozaczuk
Copy link
Collaborator

Closing this issue as the commit 33bea60 technically fixed this original issue. The related issue #1047 has been opened instead, to track new different problem exposed by fixing this original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants