Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace signal() call with sigaction() for better portability #127

Merged
merged 2 commits into from
Jun 18, 2014

Conversation

blaenk
Copy link
Contributor

@blaenk blaenk commented Apr 8, 2013

I came across issue #51 in which Solaris users encounter crashes upon receipt of SIGUSR1 and I have a feeling I know what's happening. I don't think the problem is Solaris' implementation of pthread_kill as that is defined by the POSIX standards and Solaris is apparently fully POSIX-complaint, more than can be said of Linux, and yet those of us on Linux don't experience this problem.

Rather, signal() is known to have portability problems (the POSIX standard allows it presumably due to historical reasons). Some systems such as old Unix systems and System V employ a behavior in which the signal disposition is reset to SIG_DFL (default) upon invocation of a signal handler by the delivery of a signal.

The only portable use of signal() is to set a signal's disposition to
SIG_DFL or SIG_IGN. The semantics when using signal() to establish a
signal handler vary across systems (and POSIX.1 explicitly permits this
variation); do not use it for this purpose.

From: http://man7.org/linux/man-pages/man2/signal.2.html

Not to mention:

The effects of signal() in a multithreaded process are unspecified.

I guess so far we have gotten lucky.

Solaris is a system which employs the SIG_DFL reset behavior even in version 11.1.

See http://docs.oracle.com/cd/E26502_01/html/E29034/signal-3c.html

If signal() is used, disp is the address of a signal handler, and sig is not SIGILL,
SIGTRAP, or SIGPWR, the system first sets the signal's disposition to SIG_DFL before
executing the signal handler.

On Solaris the default signal disposition for SIGUSR1 is to exit the application.

See http://docs.oracle.com/cd/E26502_01/html/E29033/signal.h-3head.html

If I'm correct in assuming this is the problem Solaris users are encountering, the solution to this (and likely any other portability issues that may arise) is to use sigaction() which has explicitly defined behavior as per the POSIX standards.

As you can see, it's a pretty drop-in solution. In fact, on some systems such as Linux, glibc defines signal() as a wrapper around sigaction() to use BSD semantics (i.e. not the behavior detailed above), which is likely why few others have experienced these problems.

See http://man7.org/linux/man-pages/man2/signal.2.html

I really hope this helps you guys out with Solaris (and maybe other systems which employ this behavior). I don't have Solaris myself so I can't test it, but it's a drop-in replacement of signal() so it should have no effect on other systems and hopefully fix the problem on Solaris. All I could do on my Linux machine was add a torrent and force a re-hash, which I believe is what fires SIGUSR1 (or so I inferred from the talk on the issue). The re-hash went through just fine.

signal() is known to have portability problems. Some systems such as
old Unix systems and System V employ a behavior in which the signal
disposition is reset to SIG_DFL (default) upon invocation of a signal
handler by the delivery of a signal.

See issue rakshasa#51 in which Solaris users experience signal-based issues.

Solaris is a system which employs the SIG_DFL reset behavior even in
version 11.1.

See http://docs.oracle.com/cd/E26502_01/html/E29034/signal-3c.html

> If signal() is used, disp is the address of a signal handler, and sig is not SIGILL,
> SIGTRAP, or SIGPWR, the system first sets the signal's disposition to SIG_DFL before
> executing the signal handler.

On Solaris the default signal disposition for SIGUSR1 is to exit the application.

See http://docs.oracle.com/cd/E26502_01/html/E29033/signal.h-3head.html

If I'm correct in assuming this is the problem Solaris users are encountering,
the solution to this (and likely any other portability issues that may arise)
is to use sigaction() which has explicitly defined behavior as per the POSIX standards.

As you can see, it's a pretty drop-in solution. In fact, on some systems such as Linux,
glibc defines signal() as a wrapper around sigaction() to use BSD semantics (i.e. not the
behavior detailed above), which is likely why few others have experienced these problems.

See http://man7.org/linux/man-pages/man2/signal.2.html
… signals (as was already done by signal()) as well as error checking on sigaction()
@blaenk
Copy link
Contributor Author

blaenk commented Apr 8, 2013

I noticed there seems to be some effort into using something other than signals, at least for the thing SIGUSR1 is currently used for. I believe my changes are still useful, however, for any other signal handlers which may be established, considering it's a drop-in replacement of signal() and signal() is well known to have portability problems for establishing signal handlers (the other use of it in rtorrent, for setting SIG_DFL and SIG_IGN, is apparently fine as it is well defined in the POSIX standards).

In fact, we would probably experience these issues on Linux were it not for the _BSD_SOURCE feature test macro being implicitly defined in glibc:

By default, in glibc 2 and later, the signal() wrapper function does not
invoke the kernel system call. Instead, it calls sigaction(2) using
flags that supply BSD semantics. This default behavior is provided as
long as the _BSD_SOURCE feature test macro is defined. By default,
_BSD_SOURCE is defined; it is also implicitly defined if one defines
_GNU_SOURCE, and can of course be explicitly defined.

On glibc 2 and later, if the _BSD_SOURCE feature test macro is not
defined, then signal() provides System V semantics. (The default
implicit definition of _BSD_SOURCE is not provided if one invokes gcc(1)
in one of its standard modes (-std=xxx or -ansi) or defines various other
feature test macros such as _POSIX_SOURCE, _XOPEN_SOURCE, or
_SVID_SOURCE; see feature_test_macros(7).)

Hope that helps.

@lotheac
Copy link
Contributor

lotheac commented Jul 19, 2013

This PR seems to fix #51 for me on Illumos (OmniOS).

@lotheac
Copy link
Contributor

lotheac commented Jul 19, 2013

I just noticed that SIGWINCH will still cause rtorrent to exit, though. Maybe the signal handler for that has the same issue.

@blaenk
Copy link
Contributor Author

blaenk commented Jul 19, 2013

Eh, accidentally deleted the comment when I meant to edit it.

That's weird, I just checked and the handler for SIGWINCH is established using the same function that was patched. Perhaps the bug is some place else? Could you perhaps provide a backtrace or other thing we could analyze?

@lotheac
Copy link
Contributor

lotheac commented Jul 19, 2013

On Fri, Jul 19 2013 04:29:44 -0700, Jorge Israel Peña wrote:

That's weird, I just checked and the handler for SIGWINCH is
established using the same function that was patched. Perhaps the bug
is some place else? Could you perhaps provide a backtrace or other
thing we could analyze?

Don't know how exactly to get a backtrace in this situation (I don't have much
experience debugging threads), but I can see that the signal handler does get
called:

(gdb) break manager.cc:61
Breakpoint 1 at 0x912300: file manager.cc, line 61.
(gdb) c
Continuing.

Breakpoint 1, display::Manager::force_redraw (this=0x9c8170) at manager.cc:62
62 m_forceRedraw = true;
(gdb) where
#0 display::Manager::force_redraw (this=0x9c8170) at manager.cc:62
#1 0x0000000000815cbc in sigc::bound_mem_functor0<void, display::Manager>::operator()
(this=0x9d6228) at /usr/include/sigc++-2.0/sigc++/functors/mem_fun.h:1787
#2 0x0000000000814f54 in sigc::adaptor_functor<sigc::bound_mem_functor0<void, display::Manager> >::operator() (this=0x9d6220)
at /usr/include/sigc++-2.0/sigc++/adaptors/adaptor_trait.h:251
#3 0x000000000081391f in sigc::internal::slot_call0<sigc::bound_mem_functor0<void, display::Manager>, void>::call_it (rep=0x9d61f0)
at /usr/include/sigc++-2.0/sigc++/functors/slot.h:103
#4 0x000000000086e6e1 in sigc::slot0::operator()() const ()
#5 0x000000000086f3e8 in SignalHandler::caught (signum=20) at signal_handler.cc:111
#6 0xfffffd7fff231196 in __sighndlr () from /lib/64/libc.so.1
#7 0xfffffd7fff223feb in call_user_handler () from /lib/64/libc.so.1
#8 0xfffffd7fff22435e in sigacthandler () from /lib/64/libc.so.1
#9 0xffffffffffffffff in ?? ()
#10 0x0000000000000014 in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb) c
Continuing.
[Inferior 1 (process 7337 ) exited normally]

It prints "rtorrent: Listener port received an error event." and exits with
status 255. Any advice on how to debug further?

In any case SIGWINCH is ignored by default and the process exits on its
own here so maybe the problem is unrelated after all.

Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet

@blaenk
Copy link
Contributor Author

blaenk commented Jul 20, 2013

I think you can stop the program with CTRL-C and then run info threads to get information on the various threads and what they're doing. You can then change threads with thread #, e.g. thread 3, and then do the regular bt to perform the backtrace.

Supposedly you can pause execution of the program on receipt of SIGWINCH with:

handle SIGWINCH stop print pass

You should also ignore SIGUSR1 and SIGPIPE since I think rtorrent uses those:

handle SIGUSR1 nostop noprint
handle SIGPIPE nostop noprint

You should be able to run the same commands found in #51 to get the same backtrace output that user got.

It's weird that it exits that way though. I tried searching the code base and I couldn't find mention of that error message, are you running the latest version with this patch applied? Because it seems to me like the latest version uses the error string "SCGI listener port received an error event."

It's the same error that is mentioned in issue #51 , so perhaps my patch didn't fix it after all? Or did you say it fixed the SIGUSR1 crash? I also found this old issue. It seems like switching to select-based polling fixed it. Worth a try:

export RTORRENT_POLL=select
rtorrent

However, I don't think rtorrent uses Solaris' OS-specific I/O multiplexing mechanism (/dev/poll or event ports I think) to begin with, so it should be that you're already using select.

@blaenk
Copy link
Contributor Author

blaenk commented Jul 20, 2013

Nevermind, I realized that the error message originates from libtorrent. I think I may be on to what is causing the problem.

@blaenk
Copy link
Contributor Author

blaenk commented Jul 20, 2013

I think it has to do with libtorrent treating the exceptfds FD set in select() as being outright errors. From what I've read, they're not necessarily errors. The man page for Solaris for select() says:

If a socket has a pending error, it is considered to have an exceptional condition pending. Otherwise, what constitutes an exceptional condition is file type-specific. For a file descriptor for use with a socket, it is protocol-specific except as noted below. For other file types, if the operation is meaningless for a particular file type, select() or pselect() indicates that the descriptor is ready for read or write operations and indicates that the descriptor has no exceptional condition pending.
...
A socket is considered to have an exceptional condition pending if a receive operation with O_NONBLOCK clear for the open file description and with the MSG_OOB flag set would return out-of-band data without blocking. (It is protocol-specific whether the MSG_OOB flag would be used to read out-of-band data.) A socket will also be considered to have an exceptional condition pending if an out-of-band data mark is present in the receive queue.

So it seems that it either could be a socket error or out of bound data. I was wondering if it could be something else other than that on Solaris but couldn't find more information. What we can do is rule out that it is a socket error. If this is the case, then it's something else and it's probably safe to ignore the exception in the program.

To do this we have to make the "Listener port received an error event." message a little bit more descriptive.

Clone the latest libtorrent and go to src/net/listen.cc and change the Listen::event_error() function to this, then recompile rtorrent so that it uses this modified libtorrent:

void
Listen::event_error() {
  int socket = get_fd().get_fd();
  int error = 0;
  socklen_t errorLen = sizeof(error);

  std::string errorMsg = "Listener port received an error event.";

  if (getsockopt(socket, SOL_SOCKET, SO_ERROR, &error, &errorLen) == -1)
    throw internal_error(errorMsg.c_str());
  else {
    errorMsg += std::string(" ") + strerror(error);
    throw internal_error(errorMsg.c_str());
  }
}

I haven't tried it (since I don't encounter this issue) but it compiles. What it should do is fetch the error associated with the socket, if there is one, and display it a long with the original error message.

I honestly have no clue about Solaris nor am I very familiar with the rtorrent or libtorrent codebase, I'm just trying to help. My suspicion is that Solaris is a little liberal with what it considers to be an exception, which, to reiterate, everywhere I've read including man pages explicitly emphasize that an exception is not necessarily an error. For example on Linux, it's usually either:

  • state change on pseudoterminal slave connected to master that is in packet mode
  • out-of-band data received on stream socket

A file descriptor present in the exception FD set is what triggers this C++ exception to be thrown which is what causes rtorrent to exit. So if we can establish that it really isn't being caused due to an error then perhaps we can do away with the current behavior of treating anything present in the exception FD set as a dramatic error necessitating absolute program termination.

@lotheac
Copy link
Contributor

lotheac commented Jul 20, 2013

On Fri, Jul 19 2013 18:11:10 -0700, Jorge Israel Peña wrote:

It's the same error that is mentioned in issue #51 , so perhaps my
patch didn't fix it after all? Or did you say it fixed the SIGUSR1
crash?

Yeah, that's what the patch fixed. I don't think I actually got the
error message in question here at that point though (an exit caused by
unhandled SIGUSR1 wouldn't really print that so I don't know why the
original reporter saw it). I'll follow your suggestions later when I have
time, thanks for the advice.

Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet

@lotheac
Copy link
Contributor

lotheac commented Jul 20, 2013

On Fri, Jul 19 2013 19:35:10 -0700, Jorge Israel Peña wrote:

To do this we have to make the "Listener port received an error
event." message a little bit more descriptive.

Clone the latest libtorrent and go to src/net/listen.cc and change
the Listen::event_error() function to this, then recompile rtorrent
so that it uses this modified libtorrent:

autogen was failing to generate a working configure script for me at the
moment so I just applied the patch on top of the libtorrent release I
was using (0.13.2). Here's what it says:

rtorrent: Listener port received an error event. Error 0

So it would appear that there is indeed no error condition and exiting
isn't the right thing to do :)

Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet

@blaenk
Copy link
Contributor Author

blaenk commented Jul 20, 2013

Alright then. If you want, what you can try is making it so that it only throws a C++ exception (and thus exits the program) if there is indeed a socket error:

void
Listen::event_error() {
  int socket = get_fd().get_fd();
  int error = 0;
  socklen_t errorLen = sizeof(error);

  if (getsockopt(socket, SOL_SOCKET, SO_ERROR, &error, &errorLen) != -1 && error != 0) {
    std::string errorMsg = std::string("Listener port received an error event: ") + strerror(error);
    throw internal_error(errorMsg.c_str());
  }
}

That should check to see if there's a socket error and if so throw the exception with the socket error message. I'm not sure what the implications could be of ignoring the exception, but to reiterate, everywhere I've read has emphasized that a file descriptor being present in the exceptions set doesn't necessarily indicate an error, but an "exceptional condition," such as out-of-band data being present. I can only imagine Solaris' criteria for what can show up there is a bit wider and so you're witnessing this in what is otherwise a normal use case on other platforms.

@blaenk
Copy link
Contributor Author

blaenk commented Jul 20, 2013

If this helps and seems to cause no problems in prolonged use, I'll submit it as a separate PR.

@blaenk
Copy link
Contributor Author

blaenk commented Jul 20, 2013

Just to clarify, there are two separate kinds of exceptions we're talking about in this issue/fix.

  • C++ exception: the throw internal_error part is the C++ exception being thrown by rtorrent whenever it detects that a file descriptor is present in the exception file descriptor set
  • exception file descriptor set: the select() function monitors/returns three different sets of file descriptors: those that are ready to be read, those that are ready to be written to, and those that have reached an "exceptional condition"

The problem that we're trying to solve is that, when rtorrent detects that a file descriptor is present in the exception FD set as returned by select(), it assumes that to be an outright, stop-everything-you're-doing error, and so it throws the C++ exception which subsequently terminates rtorrent. The problem is that from what I've read, a file descriptor being present in the exceptions file descriptor set doesn't necessarily mean an error, but rather it means that the file descriptor has reached an "exceptional condition" such as out-of-band data being present on a stream socket.

The fix simply uses the socket API to check if there actually is an error present with the socket. If so then indeed throw the exception with a detailed message about the error. Otherwise ignore the file descriptor exception.

@lotheac
Copy link
Contributor

lotheac commented Jul 21, 2013

On Sat, Jul 20 2013 15:44:40 -0700, Jorge Israel Peña wrote:

If this helps and seems to cause no problems in prolonged use, I'll
submit it as a separate PR.

I applied it to my build of libtorrent now, will report back later
(sometime next week perhaps) whether it caused issues or not.

Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet

@lotheac
Copy link
Contributor

lotheac commented Jul 25, 2013

On Sun, Jul 21 2013 16:45:18 +0300, Lauri Tirkkonen wrote:

I applied it to my build of libtorrent now, will report back later
(sometime next week perhaps) whether it caused issues or not.

Seems to be working fine, it's probably safe for you to make that other
PR.

Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet

@blaenk
Copy link
Contributor Author

blaenk commented Jul 25, 2013

Alright thanks for the feedback Lauri, glad it helped you out.

On Thursday, July 25, 2013, Lauri Tirkkonen wrote:

On Sun, Jul 21 2013 16:45:18 +0300, Lauri Tirkkonen wrote:

I applied it to my build of libtorrent now, will report back later
(sometime next week perhaps) whether it caused issues or not.

Seems to be working fine, it's probably safe for you to make that other
PR.

Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet


Reply to this email directly or view it on GitHubhttps://github.com//pull/127#issuecomment-21539471
.

  • Jorge Israel Peña

@rakshasa rakshasa merged commit 49d0028 into rakshasa:master Jun 18, 2014
lotheac added a commit to niksula/omnios-build-scripts that referenced this pull request Oct 7, 2014
lotheac added a commit to niksula/omnios-build that referenced this pull request Oct 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants