[OPEN-ILS-DEV] C nits: osrf_system.c

Bill Erickson billserickson at gmail.com
Fri May 18 09:17:19 EDT 2007


On 5/12/07, Scott McKellar <mck9 at swbell.net> wrote:
>
> in ILS/OpenSRF/src/libstack/osrf_system.c, we install a signal handler
> __osrfSystemSignalHandler() to catch SIGCHLD signals from the death of
> child processes.
>
> The handler calls waitpid() in a small loop to release any defunct
> child processes from zombie status, and issues a warning message for
> each one.
>
> What bothers me is that the signal handler does not reinstall itself
> before returning.  Under standard ANSI signal handling: when the
> signal is received, it restores the default handling for that signal
> (which in the case of SIGCHLD is to ignore the signal).  The usual
> practice is for the signal handler to do whatever it does and then
> call signal() to reinstall itself.
>
> However we don't do that.  We return to a sleep loop, and only after
> 10000 seconds do we reinstall the signal handler.



My understanding of how that works is something like this:

*top of main loop*
- register signal handler
- sleep
- SIGCHLD interrupts sleep
- signal handler reaps all dead children with  wait() loop
*back to top of main loop*
- register signal handler
- sleep
-...

Suppose that one child process dies, and then a second child dies a
> minute later.  We catch the first SIGCHLD signal and issue a message
> for the first child.  However by the time second child dies we don't
> have a signal handler any more, so we miss that signal.
>
> Eventually, if a third child process dies after 10000 more seconds,
> we will catch it because we will have reinstalled the signal handler
> at the top of the sleep loop.  At that point, the waitpid() loop
> will notice the deaths of both the second and the third children.
> However the second child will have been a zombie in the meanwhile
> for almost three hours.
>
> If all the child processes die within a few minutes, we'll only catch
> the first one.  In general we'll notice all the child deaths only if
> the last one dies more than 10000 seconds after the one before.
>
> The simplest fix is to add the following line to the end of the
> signal handler:
>
>     signal(SIGCHLD, __osrfSystemSignalHandler);
>
> In addition we can remove the other call to signal() from the inside
> of the sleep loop:
>
>         daemonize();
>         signal(SIGCHLD, __osrfSystemSignalHandler);
>         while(1) {
>                 sleep(10000);
>         }
>
> Actually I don't think we need this signal handler at all, nor any
> call to sleep().  We can just call wait() in a loop (warning:
> uncompiled and untested):
>
>         daemonize();
>         while(1) {
>             int status;
>             pid_t pid;
>
>             pid = wait( &status );
>             osrfLogWarning( OSRF_LOG_MARK, "We lost child %d", pid);
>         }
>
> This approach avoids the problem of calling non-reentrant functions
> from within a signal handler.  It also plugs another hole: the
> possibility that a child process will die after the fork() but before
> the signal().
>
> Some version of these changes will enable us to respond to the death
> of child processes promptly and reliably, instead of haphazardly.
>
> I could submit a patch, but I'm not in a position to test it very
> readily.  In any case it has been a while since I've written this
> kind of code, and I'm more than a little rusty.


Yep, I agree this is the superior approach.  We should be able to work the
code in (from your next email, actually) without a patch.

Thanks, Scott!



-- 
Bill Erickson
PINES Systems Developer
Georgia Public Library Service
billserickson at gmail.com
http://open-ils.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://list.georgialibraries.org/pipermail/open-ils-dev/attachments/20070518/b31c305f/attachment.html


More information about the Open-ils-dev mailing list