SIGHUP leads to SIGSEGV #188

redbaron · 2016-06-22T11:22:33Z

Idling carbon-c-relay with no incoming traffic crashes with SIGSEGVwhen SIGHUP is sent.

Using config:

configuration:
    relay hostname = 4b8605089887
    listen port = 2004
    workers = 4
    send batch size = 2500
    server queue size = 25000
    server max stalls = 4
    statistics submission interval = 60s
    listen backlog = 512
    server connection IO timeout = 600ms
    daemon = true
    routes configuration = /etc/carbon-c-relay.conf

parsed configuration follows:
cluster covalent
    fnv1a_ch replication 1
        4b8605089887:2003
    ;

match *
    send to covalent
    stop
    ;

I noticed that whole handling logic is done right inside in handler. realistically it can never be done right, ther would always be unexpected race conditions, non-reetrant functions called etc, simply because signal handler can be called in the middle of any function, even deep down in glibc.

What signal handlers should do is set global flag and then exit. main loop in a program should check for flag and invoke actual handling logic.More details on recommended practices of signal handling can be found here:
https://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy

The text was updated successfully, but these errors were encountered:

grobian · 2016-06-22T12:26:04Z

That's a good suggestion.

Do you have a backtrace perhaps? What version was this observed with?

redbaron · 2016-06-22T12:37:37Z

I am compiling from tag v2.1

redbaron · 2016-06-22T12:48:37Z

It is reproducible 100% of the time on my machine, I'll get stacktraces

redbaron · 2016-06-22T12:56:25Z

Stacktrace:

(gdb) bt
#0  0x00007f21d124d213 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000414c16 in server_free (s=0x97c870) at server.c:643
#2  0x0000000000411112 in router_free (rtr=0x97c740) at router.c:2584
#3  0x00000000004027cd in hup_handler (sig=1) at relay.c:181
#4  <signal handler called>
#5  0x00007f21d0f5dbdd in nanosleep () from /lib64/libc.so.6
#6  0x00007f21d0f5da50 in sleep () from /lib64/libc.so.6
#7  0x00000000004040a5 in main (argc=9, argv=0x7ffe75400c98) at relay.c:750

#188

grobian · 2016-06-22T13:09:59Z

logic error introduced after retaining servers upon config reloading, please test to confirm it fixes your issue.

redbaron · 2016-06-22T13:31:24Z

Thanks for quick fix, I'll check it. Nevertheless, signal handling should be reduced to setting a flag, having anything else there is just asking for problems

EDIT: or use things like signalfd

redbaron · 2016-06-22T13:52:57Z

It fixed a given issue, where single SIGHUP was killing relay 100% of the time. Sending signal continue to be unsafe. Here is small python script:

#!/usr/bin/pyth
import os, sys, signal

pid = int(sys.argv[1])
count = 1
while True:
  print count
  os.kill(pid, signal.SIGHUP)
  count = count+1

Call it:

/tmp/smash.py $(pgrep carbon-c-relay)

Tail of the output:

1580
1581
1582
1583
Traceback (most recent call last):
  File "/tmp/smash.py", line 9, in <module>
    os.kill(pid, signal.SIGHUP)
OSError: [Errno 3] No such process
[root@26ba1b5bd4fb /]# cat /tmp/smash.py

Stacktrace:

Program terminated with signal 11, Segmentation fault.
#0  0x00007fd9c0d1a913 in fflush () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64
(gdb) bt
#0  0x00007fd9c0d1a913 in fflush () from /lib64/libc.so.6
#1  0x0000000000402528 in relaylog ()
#2  0x00000000004039c8 in hup_handler ()
#3  <signal handler called>
#4  0x00007fd9c0d60bdd in nanosleep () from /lib64/libc.so.6
#5  0x00007fd9c0d95d94 in usleep () from /lib64/libc.so.6
#6  0x000000000040644c in dispatch_runner ()
#7  0x00007fd9c104faa1 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fd9c0d9caad in clone () from /lib64/libc.so.6

Civil · 2016-06-22T15:13:23Z

@redbaron can you check if the same stuff happens if my commit (in my fork) is applied?

grobian · 2016-06-22T15:35:43Z

Isn't this just a race that's because of SIGHUP spamming? e.g. would unsetting the handler while we reload do the right thing?

redbaron · 2016-06-22T15:36:35Z

@Civil , I'll check, looks sane, few notes straight away:

flag should be sig_atomic_t (see https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html)
other handlers should be flag-only too

EDIT:

I'd reset flag to 0 as the very first step of hup handler too

redbaron · 2016-06-22T15:38:54Z

Isn't this just a race that's because of SIGHUP spamming? e.g. would unsetting the handler while we reload do the right thing?

In stack trace there is only one hup_handler so it is not because of spamming

Civil · 2016-06-22T15:44:12Z

@redbaron fixed for the reload handler, will make the same stuff for other handlers tomorrow (if Fabian won't do that by himself).

Civil · 2016-06-22T15:45:38Z

I'm also not really sure that exit_handler should be that complex. Just set "exit" to, for example, signal name and perform exit, ignoring other signals.

@grobian he's actually right. I remember getting a lot of errors some time ago also related to sighup handling. It might also be related and anyway that's sort of a best practice.

Hook the hup_handler on a trigger after receiving SIGHUP so we don't have to unset any signals, and can deal with receiving multiple SIGHUPs without causing race conditions in the code. Continuation of issue #188.

grobian · 2016-06-22T15:50:07Z

I already had a commit in progress, it sorts the somewhat artificial HUP-bombing

redbaron · 2016-06-22T16:09:30Z

@grobian look at my last stacktrace, problem appeared in fflush() which was called from relaylog. It is NOT because handler was called while previous handler was still executing.

Looking at commit 5a453a7 , exit_handler continues to call logerr() which is relaylog(). Until signal handlers do nothing more than setting few flags and NOT call any libc() function there will be always bugs like that lurking around.

grobian · 2016-06-22T16:14:25Z

huphandler re-opens streams, might be a race in there still, because the pointer gets unset

grobian · 2016-06-22T17:54:09Z

added short sleep to mitigate this race.

Civil · 2016-06-22T19:08:54Z

@grobian is there any particular reason why you still want to write to stdout while in signal handler? @redbaron is right about possible error's that'll be cause by possible calls to other functions from the handler. So the bug will never be properly fixed, until you'll make the handlers only to set appropriate flags.

P.S. And also why not to use sig_atomic_t? It's in C89 standard so it's portable enough.

grobian · 2016-06-23T07:17:14Z

NAME
     signal — simplified software signal facilities

LIBRARY
     Standard C Library (libc, -lc)

SYNOPSIS
     #include <signal.h>

     void (*signal(int sig, void (*func)(int)))(int);

and

     The following functions are either reentrant or not interruptible by sig‐
     nals and are async-signal safe.  Therefore applications may invoke them,
     without restriction, from signal-catching functions:

     Base Interfaces:

     _exit(), access(), alarm(), cfgetispeed(), cfgetospeed(), cfsetispeed(),
     cfsetospeed(), chdir(), chmod(), chown(), close(), creat(), dup(),
     dup2(), execle(), execve(), fcntl(), fork(), fpathconf(), fstat(),
     fsync(), getegid(), geteuid(), getgid(), getgroups(), getpgrp(),
     getpid(), getppid(), getuid(), kill(), link(), lseek(), mkdir(),
     mkfifo(), open(), pathconf(), pause(), pipe(), raise(), read(), rename(),
     rmdir(), setgid(), setpgid(), setsid(), setuid(), sigaction(),
     sigaddset(), sigdelset(), sigemptyset(), sigfillset(), sigismember(),
     signal(), sigpending(), sigprocmask(), sigsuspend(), sleep(), stat(),
     sysconf(), tcdrain(), tcflow(), tcflush(), tcgetattr(), tcgetpgrp(),
     tcsendbreak(), tcsetattr(), tcsetpgrp(), time(), times(), umask(),
     uname(), unlink(), utime(), wait(), waitpid(), write().

I thought I used write, but it appears I use the stream-based variants, so indeed an AS race can occur.

Civil · 2016-06-23T07:55:11Z

@grobian my point is that it seems to be there is no lose in functionality if you'll make 2 lines signal handlers functions and move all logout/logerr to appropriate functions that'll be called in the code. That'll close this topic once and for all and you'll always will be sure that signal handling doesn't have any bugs.

grobian · 2016-06-23T08:06:16Z

yes, as my previous comment said, that's the only way ;)

In order to be AS-Safe, just set a pending flag on receiving a signal which the main loop handles. This should avoid any race condition that can take place by deferring into MT-Safe land. This should fix the issue #188 at last.

grobian added a commit that referenced this issue Jun 22, 2016

server_free: don't attempt pthread_join if no thread was started, issue

8ede7c8

#188

Civil mentioned this issue Jun 22, 2016

Split HUP handler and actual reload process. #189

Closed

grobian closed this as completed Jun 22, 2016

Civil mentioned this issue Oct 3, 2016

relay is sending data randomly while kill -HUP happens #204

Closed

grobian mentioned this issue Oct 31, 2016

segfault during (or right after) sighup, carbon-c-relay 2.2 on rhel6 #224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGHUP leads to SIGSEGV #188

SIGHUP leads to SIGSEGV #188

redbaron commented Jun 22, 2016

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016

redbaron commented Jun 22, 2016

redbaron commented Jun 22, 2016

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016 •

edited

Loading

redbaron commented Jun 22, 2016 •

edited

Loading

Civil commented Jun 22, 2016

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016 •

edited

Loading

redbaron commented Jun 22, 2016

Civil commented Jun 22, 2016

Civil commented Jun 22, 2016 •

edited

Loading

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016

grobian commented Jun 22, 2016

grobian commented Jun 22, 2016

Civil commented Jun 22, 2016

grobian commented Jun 23, 2016

Civil commented Jun 23, 2016

grobian commented Jun 23, 2016

SIGHUP leads to SIGSEGV #188

SIGHUP leads to SIGSEGV #188

Comments

redbaron commented Jun 22, 2016

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016

redbaron commented Jun 22, 2016

redbaron commented Jun 22, 2016

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016 • edited Loading

redbaron commented Jun 22, 2016 • edited Loading

Civil commented Jun 22, 2016

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016 • edited Loading

redbaron commented Jun 22, 2016

Civil commented Jun 22, 2016

Civil commented Jun 22, 2016 • edited Loading

grobian commented Jun 22, 2016

redbaron commented Jun 22, 2016

grobian commented Jun 22, 2016

grobian commented Jun 22, 2016

Civil commented Jun 22, 2016

grobian commented Jun 23, 2016

Civil commented Jun 23, 2016

grobian commented Jun 23, 2016

redbaron commented Jun 22, 2016 •

edited

Loading

redbaron commented Jun 22, 2016 •

edited

Loading

redbaron commented Jun 22, 2016 •

edited

Loading

Civil commented Jun 22, 2016 •

edited

Loading