Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGHUP leads to SIGSEGV #188

Closed
redbaron opened this issue Jun 22, 2016 · 21 comments
Closed

SIGHUP leads to SIGSEGV #188

redbaron opened this issue Jun 22, 2016 · 21 comments

Comments

@redbaron
Copy link

Idling carbon-c-relay with no incoming traffic crashes with SIGSEGVwhen SIGHUP is sent.

Using config:

configuration:
    relay hostname = 4b8605089887
    listen port = 2004
    workers = 4
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    statistics submission interval = 60s
    listen backlog = 512
    server connection IO timeout = 600ms
    daemon = true
    routes configuration = /etc/carbon-c-relay.conf

parsed configuration follows:
cluster covalent
    fnv1a_ch replication 1
        4b8605089887:2003
    ;

match *
    send to covalent
    stop
    ;

I noticed that whole handling logic is done right inside in handler. realistically it can never be done right, ther would always be unexpected race conditions, non-reetrant functions called etc, simply because signal handler can be called in the middle of any function, even deep down in glibc.

What signal handlers should do is set global flag and then exit. main loop in a program should check for flag and invoke actual handling logic.More details on recommended practices of signal handling can be found here:
https://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy

@grobian
Copy link
Owner

grobian commented Jun 22, 2016

That's a good suggestion.

Do you have a backtrace perhaps? What version was this observed with?

@redbaron
Copy link
Author

I am compiling from tag v2.1

@redbaron
Copy link
Author

It is reproducible 100% of the time on my machine, I'll get stacktraces

@redbaron
Copy link
Author

Stacktrace:

(gdb) bt
#0  0x00007f21d124d213 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000414c16 in server_free (s=0x97c870) at server.c:643
#2  0x0000000000411112 in router_free (rtr=0x97c740) at router.c:2584
#3  0x00000000004027cd in hup_handler (sig=1) at relay.c:181
#4  <signal handler called>
#5  0x00007f21d0f5dbdd in nanosleep () from /lib64/libc.so.6
#6  0x00007f21d0f5da50 in sleep () from /lib64/libc.so.6
#7  0x00000000004040a5 in main (argc=9, argv=0x7ffe75400c98) at relay.c:750

@grobian
Copy link
Owner

grobian commented Jun 22, 2016

logic error introduced after retaining servers upon config reloading, please test to confirm it fixes your issue.

@redbaron
Copy link
Author

redbaron commented Jun 22, 2016

Thanks for quick fix, I'll check it. Nevertheless, signal handling should be reduced to setting a flag, having anything else there is just asking for problems

EDIT: or use things like signalfd

@redbaron
Copy link
Author

redbaron commented Jun 22, 2016

It fixed a given issue, where single SIGHUP was killing relay 100% of the time. Sending signal continue to be unsafe. Here is small python script:

#!/usr/bin/pyth
import os, sys, signal

pid = int(sys.argv[1])
count = 1
while True:
  print count
  os.kill(pid, signal.SIGHUP)
  count = count+1

Call it:

/tmp/smash.py $(pgrep carbon-c-relay)

Tail of the output:

1580
1581
1582
1583
Traceback (most recent call last):
  File "/tmp/smash.py", line 9, in <module>
    os.kill(pid, signal.SIGHUP)
OSError: [Errno 3] No such process
[root@26ba1b5bd4fb /]# cat /tmp/smash.py

Stacktrace:

Program terminated with signal 11, Segmentation fault.
#0  0x00007fd9c0d1a913 in fflush () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) bt
#0  0x00007fd9c0d1a913 in fflush () from /lib64/libc.so.6
#1  0x0000000000402528 in relaylog ()
#2  0x00000000004039c8 in hup_handler ()
#3  <signal handler called>
#4  0x00007fd9c0d60bdd in nanosleep () from /lib64/libc.so.6
#5  0x00007fd9c0d95d94 in usleep () from /lib64/libc.so.6
#6  0x000000000040644c in dispatch_runner ()
#7  0x00007fd9c104faa1 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd9c0d9caad in clone () from /lib64/libc.so.6

@Civil
Copy link

Civil commented Jun 22, 2016

@redbaron can you check if the same stuff happens if my commit (in my fork) is applied?

@grobian
Copy link
Owner

grobian commented Jun 22, 2016

Isn't this just a race that's because of SIGHUP spamming? e.g. would unsetting the handler while we reload do the right thing?

@redbaron
Copy link
Author

redbaron commented Jun 22, 2016

@Civil , I'll check, looks sane, few notes straight away:

EDIT:

  • I'd reset flag to 0 as the very first step of hup handler too

@redbaron
Copy link
Author

Isn't this just a race that's because of SIGHUP spamming? e.g. would unsetting the handler while we reload do the right thing?

In stack trace there is only one hup_handler so it is not because of spamming

@Civil
Copy link

Civil commented Jun 22, 2016

@redbaron fixed for the reload handler, will make the same stuff for other handlers tomorrow (if Fabian won't do that by himself).

@Civil
Copy link

Civil commented Jun 22, 2016

I'm also not really sure that exit_handler should be that complex. Just set "exit" to, for example, signal name and perform exit, ignoring other signals.

@grobian he's actually right. I remember getting a lot of errors some time ago also related to sighup handling. It might also be related and anyway that's sort of a best practice.

grobian added a commit that referenced this issue Jun 22, 2016
Hook the hup_handler on a trigger after receiving SIGHUP so we don't
have to unset any signals, and can deal with receiving multiple SIGHUPs
without causing race conditions in the code.  Continuation of issue #188.
@grobian
Copy link
Owner

grobian commented Jun 22, 2016

I already had a commit in progress, it sorts the somewhat artificial HUP-bombing

@grobian grobian closed this as completed Jun 22, 2016
@redbaron
Copy link
Author

@grobian look at my last stacktrace, problem appeared in fflush() which was called from relaylog. It is NOT because handler was called while previous handler was still executing.

Looking at commit 5a453a7 , exit_handler continues to call logerr() which is relaylog(). Until signal handlers do nothing more than setting few flags and NOT call any libc() function there will be always bugs like that lurking around.

@grobian
Copy link
Owner

grobian commented Jun 22, 2016

huphandler re-opens streams, might be a race in there still, because the pointer gets unset

@grobian
Copy link
Owner

grobian commented Jun 22, 2016

added short sleep to mitigate this race.

@Civil
Copy link

Civil commented Jun 22, 2016

@grobian is there any particular reason why you still want to write to stdout while in signal handler? @redbaron is right about possible error's that'll be cause by possible calls to other functions from the handler. So the bug will never be properly fixed, until you'll make the handlers only to set appropriate flags.

P.S. And also why not to use sig_atomic_t? It's in C89 standard so it's portable enough.

@grobian
Copy link
Owner

grobian commented Jun 23, 2016

NAME
     signal — simplified software signal facilities

LIBRARY
     Standard C Library (libc, -lc)

SYNOPSIS
     #include <signal.h>

     void (*signal(int sig, void (*func)(int)))(int);

and

     The following functions are either reentrant or not interruptible by sig‐
     nals and are async-signal safe.  Therefore applications may invoke them,
     without restriction, from signal-catching functions:

     Base Interfaces:

     _exit(), access(), alarm(), cfgetispeed(), cfgetospeed(), cfsetispeed(),
     cfsetospeed(), chdir(), chmod(), chown(), close(), creat(), dup(),
     dup2(), execle(), execve(), fcntl(), fork(), fpathconf(), fstat(),
     fsync(), getegid(), geteuid(), getgid(), getgroups(), getpgrp(),
     getpid(), getppid(), getuid(), kill(), link(), lseek(), mkdir(),
     mkfifo(), open(), pathconf(), pause(), pipe(), raise(), read(), rename(),
     rmdir(), setgid(), setpgid(), setsid(), setuid(), sigaction(),
     sigaddset(), sigdelset(), sigemptyset(), sigfillset(), sigismember(),
     signal(), sigpending(), sigprocmask(), sigsuspend(), sleep(), stat(),
     sysconf(), tcdrain(), tcflow(), tcflush(), tcgetattr(), tcgetpgrp(),
     tcsendbreak(), tcsetattr(), tcsetpgrp(), time(), times(), umask(),
     uname(), unlink(), utime(), wait(), waitpid(), write().

I thought I used write, but it appears I use the stream-based variants, so indeed an AS race can occur.

@Civil
Copy link

Civil commented Jun 23, 2016

@grobian my point is that it seems to be there is no lose in functionality if you'll make 2 lines signal handlers functions and move all logout/logerr to appropriate functions that'll be called in the code. That'll close this topic once and for all and you'll always will be sure that signal handling doesn't have any bugs.

@grobian
Copy link
Owner

grobian commented Jun 23, 2016

yes, as my previous comment said, that's the only way ;)

grobian added a commit that referenced this issue Jun 23, 2016
In order to be AS-Safe, just set a pending flag on receiving a signal
which the main loop handles.  This should avoid any race condition that
can take place by deferring into MT-Safe land.  This should fix the
issue #188 at last.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants