-
-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace signal() call with sigaction() for better portability #127
Conversation
signal() is known to have portability problems. Some systems such as old Unix systems and System V employ a behavior in which the signal disposition is reset to SIG_DFL (default) upon invocation of a signal handler by the delivery of a signal. See issue rakshasa#51 in which Solaris users experience signal-based issues. Solaris is a system which employs the SIG_DFL reset behavior even in version 11.1. See http://docs.oracle.com/cd/E26502_01/html/E29034/signal-3c.html > If signal() is used, disp is the address of a signal handler, and sig is not SIGILL, > SIGTRAP, or SIGPWR, the system first sets the signal's disposition to SIG_DFL before > executing the signal handler. On Solaris the default signal disposition for SIGUSR1 is to exit the application. See http://docs.oracle.com/cd/E26502_01/html/E29033/signal.h-3head.html If I'm correct in assuming this is the problem Solaris users are encountering, the solution to this (and likely any other portability issues that may arise) is to use sigaction() which has explicitly defined behavior as per the POSIX standards. As you can see, it's a pretty drop-in solution. In fact, on some systems such as Linux, glibc defines signal() as a wrapper around sigaction() to use BSD semantics (i.e. not the behavior detailed above), which is likely why few others have experienced these problems. See http://man7.org/linux/man-pages/man2/signal.2.html
… signals (as was already done by signal()) as well as error checking on sigaction()
I noticed there seems to be some effort into using something other than signals, at least for the thing In fact, we would probably experience these issues on Linux were it not for the
Hope that helps. |
This PR seems to fix #51 for me on Illumos (OmniOS). |
I just noticed that SIGWINCH will still cause rtorrent to exit, though. Maybe the signal handler for that has the same issue. |
Eh, accidentally deleted the comment when I meant to edit it. That's weird, I just checked and the handler for |
On Fri, Jul 19 2013 04:29:44 -0700, Jorge Israel Peña wrote:
Don't know how exactly to get a backtrace in this situation (I don't have much (gdb) break manager.cc:61 Breakpoint 1, display::Manager::force_redraw (this=0x9c8170) at manager.cc:62 It prints "rtorrent: Listener port received an error event." and exits with In any case SIGWINCH is ignored by default and the process exits on its Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet |
I think you can stop the program with Supposedly you can pause execution of the program on receipt of
You should also ignore
You should be able to run the same commands found in #51 to get the same backtrace output that user got. It's weird that it exits that way though. I tried searching the code base and I couldn't find mention of that error message, are you running the latest version with this patch applied? Because it seems to me like the latest version uses the error string "SCGI listener port received an error event." It's the same error that is mentioned in issue #51 , so perhaps my patch didn't fix it after all? Or did you say it fixed the
However, I don't think rtorrent uses Solaris' OS-specific I/O multiplexing mechanism ( |
Nevermind, I realized that the error message originates from libtorrent. I think I may be on to what is causing the problem. |
I think it has to do with libtorrent treating the
So it seems that it either could be a socket error or out of bound data. I was wondering if it could be something else other than that on Solaris but couldn't find more information. What we can do is rule out that it is a socket error. If this is the case, then it's something else and it's probably safe to ignore the exception in the program. To do this we have to make the "Listener port received an error event." message a little bit more descriptive. Clone the latest libtorrent and go to void
Listen::event_error() {
int socket = get_fd().get_fd();
int error = 0;
socklen_t errorLen = sizeof(error);
std::string errorMsg = "Listener port received an error event.";
if (getsockopt(socket, SOL_SOCKET, SO_ERROR, &error, &errorLen) == -1)
throw internal_error(errorMsg.c_str());
else {
errorMsg += std::string(" ") + strerror(error);
throw internal_error(errorMsg.c_str());
}
} I haven't tried it (since I don't encounter this issue) but it compiles. What it should do is fetch the error associated with the socket, if there is one, and display it a long with the original error message. I honestly have no clue about Solaris nor am I very familiar with the rtorrent or libtorrent codebase, I'm just trying to help. My suspicion is that Solaris is a little liberal with what it considers to be an exception, which, to reiterate, everywhere I've read including man pages explicitly emphasize that an exception is not necessarily an error. For example on Linux, it's usually either:
A file descriptor present in the exception FD set is what triggers this C++ exception to be thrown which is what causes rtorrent to exit. So if we can establish that it really isn't being caused due to an error then perhaps we can do away with the current behavior of treating anything present in the exception FD set as a dramatic error necessitating absolute program termination. |
On Fri, Jul 19 2013 18:11:10 -0700, Jorge Israel Peña wrote:
Yeah, that's what the patch fixed. I don't think I actually got the Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet |
On Fri, Jul 19 2013 19:35:10 -0700, Jorge Israel Peña wrote:
autogen was failing to generate a working configure script for me at the rtorrent: Listener port received an error event. Error 0 So it would appear that there is indeed no error condition and exiting Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet |
Alright then. If you want, what you can try is making it so that it only throws a C++ exception (and thus exits the program) if there is indeed a socket error: void
Listen::event_error() {
int socket = get_fd().get_fd();
int error = 0;
socklen_t errorLen = sizeof(error);
if (getsockopt(socket, SOL_SOCKET, SO_ERROR, &error, &errorLen) != -1 && error != 0) {
std::string errorMsg = std::string("Listener port received an error event: ") + strerror(error);
throw internal_error(errorMsg.c_str());
}
} That should check to see if there's a socket error and if so throw the exception with the socket error message. I'm not sure what the implications could be of ignoring the exception, but to reiterate, everywhere I've read has emphasized that a file descriptor being present in the exceptions set doesn't necessarily indicate an error, but an "exceptional condition," such as out-of-band data being present. I can only imagine Solaris' criteria for what can show up there is a bit wider and so you're witnessing this in what is otherwise a normal use case on other platforms. |
If this helps and seems to cause no problems in prolonged use, I'll submit it as a separate PR. |
Just to clarify, there are two separate kinds of exceptions we're talking about in this issue/fix.
The problem that we're trying to solve is that, when rtorrent detects that a file descriptor is present in the exception FD set as returned by The fix simply uses the socket API to check if there actually is an error present with the socket. If so then indeed throw the exception with a detailed message about the error. Otherwise ignore the file descriptor exception. |
On Sat, Jul 20 2013 15:44:40 -0700, Jorge Israel Peña wrote:
I applied it to my build of libtorrent now, will report back later Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet |
On Sun, Jul 21 2013 16:45:18 +0300, Lauri Tirkkonen wrote:
Seems to be working fine, it's probably safe for you to make that other Lauri Tirkkonen | +358 50 5341376 | lotheac @ IRCnet |
Alright thanks for the feedback Lauri, glad it helped you out. On Thursday, July 25, 2013, Lauri Tirkkonen wrote:
|
I came across issue #51 in which Solaris users encounter crashes upon receipt of
SIGUSR1
and I have a feeling I know what's happening. I don't think the problem is Solaris' implementation ofpthread_kill
as that is defined by the POSIX standards and Solaris is apparently fully POSIX-complaint, more than can be said of Linux, and yet those of us on Linux don't experience this problem.Rather,
signal()
is known to have portability problems (the POSIX standard allows it presumably due to historical reasons). Some systems such as old Unix systems and System V employ a behavior in which the signal disposition is reset toSIG_DFL
(default) upon invocation of a signal handler by the delivery of a signal.From: http://man7.org/linux/man-pages/man2/signal.2.html
Not to mention:
I guess so far we have gotten lucky.
Solaris is a system which employs the SIG_DFL reset behavior even in version 11.1.
See http://docs.oracle.com/cd/E26502_01/html/E29034/signal-3c.html
On Solaris the default signal disposition for
SIGUSR1
is to exit the application.See http://docs.oracle.com/cd/E26502_01/html/E29033/signal.h-3head.html
If I'm correct in assuming this is the problem Solaris users are encountering, the solution to this (and likely any other portability issues that may arise) is to use
sigaction()
which has explicitly defined behavior as per the POSIX standards.As you can see, it's a pretty drop-in solution. In fact, on some systems such as Linux, glibc defines
signal()
as a wrapper aroundsigaction()
to use BSD semantics (i.e. not the behavior detailed above), which is likely why few others have experienced these problems.See http://man7.org/linux/man-pages/man2/signal.2.html
I really hope this helps you guys out with Solaris (and maybe other systems which employ this behavior). I don't have Solaris myself so I can't test it, but it's a drop-in replacement of
signal()
so it should have no effect on other systems and hopefully fix the problem on Solaris. All I could do on my Linux machine was add a torrent and force a re-hash, which I believe is what firesSIGUSR1
(or so I inferred from the talk on the issue). The re-hash went through just fine.