[OPEN-ILS-DEV] C nits: osrf_system.c

Scott McKellar mck9 at swbell.net
Sat May 12 21:49:42 EDT 2007


in ILS/OpenSRF/src/libstack/osrf_system.c, we install a signal handler
__osrfSystemSignalHandler() to catch SIGCHLD signals from the death of
child processes.

The handler calls waitpid() in a small loop to release any defunct
child processes from zombie status, and issues a warning message for
each one.

What bothers me is that the signal handler does not reinstall itself
before returning.  Under standard ANSI signal handling: when the 
signal is received, it restores the default handling for that signal 
(which in the case of SIGCHLD is to ignore the signal).  The usual 
practice is for the signal handler to do whatever it does and then 
call signal() to reinstall itself.

However we don't do that.  We return to a sleep loop, and only after
10000 seconds do we reinstall the signal handler.

Suppose that one child process dies, and then a second child dies a
minute later.  We catch the first SIGCHLD signal and issue a message
for the first child.  However by the time second child dies we don't
have a signal handler any more, so we miss that signal.

Eventually, if a third child process dies after 10000 more seconds,
we will catch it because we will have reinstalled the signal handler
at the top of the sleep loop.  At that point, the waitpid() loop
will notice the deaths of both the second and the third children.
However the second child will have been a zombie in the meanwhile
for almost three hours.

If all the child processes die within a few minutes, we'll only catch
the first one.  In general we'll notice all the child deaths only if
the last one dies more than 10000 seconds after the one before.

The simplest fix is to add the following line to the end of the
signal handler:

    signal(SIGCHLD, __osrfSystemSignalHandler);

In addition we can remove the other call to signal() from the inside
of the sleep loop:

	daemonize();
	signal(SIGCHLD, __osrfSystemSignalHandler);
	while(1) {
		sleep(10000);
	}

Actually I don't think we need this signal handler at all, nor any
call to sleep().  We can just call wait() in a loop (warning:
uncompiled and untested):

	daemonize();
	while(1) {
	    int status;
	    pid_t pid;

	    pid = wait( &status );
	    osrfLogWarning( OSRF_LOG_MARK, "We lost child %d", pid);
	}

This approach avoids the problem of calling non-reentrant functions
from within a signal handler.  It also plugs another hole: the 
possibility that a child process will die after the fork() but before
the signal().

Some version of these changes will enable us to respond to the death
of child processes promptly and reliably, instead of haphazardly.

I could submit a patch, but I'm not in a position to test it very
readily.  In any case it has been a while since I've written this
kind of code, and I'm more than a little rusty.

Scott McKellar
http://home.swbell.net/mck9/aargh/



More information about the Open-ils-dev mailing list