[OPEN-ILS-DEV] C nits: osrf_system.c
Bill Erickson
billserickson at gmail.com
Fri May 18 09:17:19 EDT 2007
On 5/12/07, Scott McKellar <mck9 at swbell.net> wrote:
>
> in ILS/OpenSRF/src/libstack/osrf_system.c, we install a signal handler
> __osrfSystemSignalHandler() to catch SIGCHLD signals from the death of
> child processes.
>
> The handler calls waitpid() in a small loop to release any defunct
> child processes from zombie status, and issues a warning message for
> each one.
>
> What bothers me is that the signal handler does not reinstall itself
> before returning. Under standard ANSI signal handling: when the
> signal is received, it restores the default handling for that signal
> (which in the case of SIGCHLD is to ignore the signal). The usual
> practice is for the signal handler to do whatever it does and then
> call signal() to reinstall itself.
>
> However we don't do that. We return to a sleep loop, and only after
> 10000 seconds do we reinstall the signal handler.
My understanding of how that works is something like this:
*top of main loop*
- register signal handler
- sleep
- SIGCHLD interrupts sleep
- signal handler reaps all dead children with wait() loop
*back to top of main loop*
- register signal handler
- sleep
-...
Suppose that one child process dies, and then a second child dies a
> minute later. We catch the first SIGCHLD signal and issue a message
> for the first child. However by the time second child dies we don't
> have a signal handler any more, so we miss that signal.
>
> Eventually, if a third child process dies after 10000 more seconds,
> we will catch it because we will have reinstalled the signal handler
> at the top of the sleep loop. At that point, the waitpid() loop
> will notice the deaths of both the second and the third children.
> However the second child will have been a zombie in the meanwhile
> for almost three hours.
>
> If all the child processes die within a few minutes, we'll only catch
> the first one. In general we'll notice all the child deaths only if
> the last one dies more than 10000 seconds after the one before.
>
> The simplest fix is to add the following line to the end of the
> signal handler:
>
> signal(SIGCHLD, __osrfSystemSignalHandler);
>
> In addition we can remove the other call to signal() from the inside
> of the sleep loop:
>
> daemonize();
> signal(SIGCHLD, __osrfSystemSignalHandler);
> while(1) {
> sleep(10000);
> }
>
> Actually I don't think we need this signal handler at all, nor any
> call to sleep(). We can just call wait() in a loop (warning:
> uncompiled and untested):
>
> daemonize();
> while(1) {
> int status;
> pid_t pid;
>
> pid = wait( &status );
> osrfLogWarning( OSRF_LOG_MARK, "We lost child %d", pid);
> }
>
> This approach avoids the problem of calling non-reentrant functions
> from within a signal handler. It also plugs another hole: the
> possibility that a child process will die after the fork() but before
> the signal().
>
> Some version of these changes will enable us to respond to the death
> of child processes promptly and reliably, instead of haphazardly.
>
> I could submit a patch, but I'm not in a position to test it very
> readily. In any case it has been a while since I've written this
> kind of code, and I'm more than a little rusty.
Yep, I agree this is the superior approach. We should be able to work the
code in (from your next email, actually) without a patch.
Thanks, Scott!
--
Bill Erickson
PINES Systems Developer
Georgia Public Library Service
billserickson at gmail.com
http://open-ils.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://list.georgialibraries.org/pipermail/open-ils-dev/attachments/20070518/b31c305f/attachment.html
More information about the Open-ils-dev
mailing list