[OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD

Josh Stompro stomproj at exchange.larl.org
Thu Jul 23 22:02:09 EDT 2015


Adding a close seems to have fixed the problem for me.  To try it out I edited /usr/local/share/perl/5.20.2/OpenILS/WWW/EGCatLoader/Record.pm and changed line 577 to
   576          # To avoid a lot of hanging connections.
   577          if ($content->{request}) {
   578              $content->{request}->shutdown(2);
   579              $content->{request}->close();
   580          }

Now when I load a bib detail record the number of orphaned sock connections doesn’t keep climbing.  I’ll test some more and open a bug if it continues to look good.

Josh

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 2:27 PM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD

I just took a look at a test system running Debian Wheezy with EG 2.8.2 and Opensrf 2.4.1, same issue, each page load of a record detail page leaks 5 file descriptors, that show up when doing a “lsof | grep “\<sock\>” | wc –l” before and after the request.

So the steps to test it are.


1.       Run “lsof |grep "\<sock\>" | wc –l” to see how many orphan FD there currently are.

2.       Load a record detail page to trigger the added_content connections back to the local host.  http://egcatalog/eg/opac/record/10

3.       Run “lsof |grep "\<sock\>" | wc –l” to see if the number increased.

I don’t have a non openvz based system to test on right now.  I would love to hear if anyone else sees this.
Josh

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 2:06 PM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD

I found this page that seems to say that a close is always needed after a shutdown of a socket to free the FD.
http://www.perlmonks.org/?node=108244

I’ll look at my other test systems and see if I see the same issue, but haven’t noticed it because of the low number of requests.
Josh

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 1:31 PM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD

This is what strace shos me.

[pid 14793] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP <unfinished ...>
[pid 14793] <... socket resumed> )      = 83
[pid 14793] ioctl(83, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS <unfinished ...>
[pid 14793] <... ioctl resumed> , 0x7fffb83df850) = -1 EINVAL (Invalid argument)
[pid 14793] lseek(83, 0, SEEK_CUR <unfinished ...>
[pid 14793] <... lseek resumed> )       = -1 ESPIPE (Illegal seek)
[pid 14793] ioctl(83, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS <unfinished ...>
[pid 14793] <... ioctl resumed> , 0x7fffb83df850) = -1 EINVAL (Invalid argument)
[pid 14793] lseek(83, 0, SEEK_CUR <unfinished ...>
[pid 14793] <... lseek resumed> )       = -1 ESPIPE (Illegal seek)
[pid 14793] fcntl(83, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 14793] <... fcntl resumed> )       = 0
[pid 14793] connect(83, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("192.168.46.32")}, 16 <unfinished ...>
[pid 14793] <... connect resumed> )     = 0
[pid 14793] write(83, "HEAD /opac/extras/ac/summary/html/r/1001 HTTP/1.1\r\nConnection: close\r\nHost: virt-egapp2.larl.org\r\n\r\n", 100 <unfinished ...>
[pid 14793] <... write resumed> )       = 100
[pid 14793] read(83,  <unfinished ...>
[pid 14793] <... read resumed> "HTTP/1.1 404 Not Found\r\nDate: Thu, 23 Jul 2015 03:16:49 GMT\r\nServer: Apache/2.4.10 (Debian)\r\nConnection: close\r\nContent-Type: text/html; charset=iso-8859-1\r\n\r\n", 1024) = 159
[pid 14793] shutdown(83, SHUT_RDWR <unfinished ...>
[pid 14793] <... shutdown resumed> )    = 0

After this point FD 83 never shows up again in the strace log, but it does show up in the lsof –p <pid> display as shown before.  I’m wondering if that is because no close for FD 83 is called?  I’ve read that sometimes the shutdown() implementation includes the close, and sometimes it does not.  I’m trying to figure out if the shutdown at the end of EGCatLoader/Record.pm includes a close.. or if the implementation changed with the versions of the perl libs that Jessie has.

I’m also wondering if I’m way off base or not.

Josh

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 10:03 AM
To: Evergreen Development Discussion List; Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD

Hello Mike,

Lsof –n –P –p <pid> doesn’t give any new info about those connections.

ot at virt-egapp2:/openils/var/templates# lsof -n -P -p 5684
COMMAND    PID    USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
/usr/sbin 5684 opensrf  cwd    DIR               0,45     4096 34071974 /
/usr/sbin 5684 opensrf  rtd    DIR               0,45     4096 34071974 /
/usr/sbin 5684 opensrf  txt    REG               0,45   654136 34947316 /usr/sbin/apache2
/usr/sbin 5684 opensrf  mem    REG              253,2          35374581 /lib/x86_64-linux-gnu/libnss_dns-2.19.so (path dev=0,45)
/usr/sbin 5684 opensrf  mem    REG              253,2          39379859 /usr/lib/x86_64-linux-gnu/perl/5.20.2/auto/Hash/Util/Util.so (path dev=0,45)
/usr/sbin 5684 opensrf  mem    REG               0,50          66964636 (deleted)/dev/zero (stat: No such file or directory)
<SNIP>
/usr/sbin 5684 opensrf   35u  sock                0,6      0t0 67405034 can't identify protocol
/usr/sbin 5684 opensrf   36u  sock                0,6      0t0 67405037 can't identify protocol
/usr/sbin 5684 opensrf   37u  sock                0,6      0t0 67405040 can't identify protocol
/usr/sbin 5684 opensrf   38u  sock                0,6      0t0 67405043 can't identify protocol
/usr/sbin 5684 opensrf   39u  sock                0,6      0t0 67405046 can't identify protocol
/usr/sbin 5684 opensrf   40u  sock                0,6      0t0 67689829 can't identify protocol
/usr/sbin 5684 opensrf   41u  sock                0,6      0t0 67689832 can't identify protocol
/usr/sbin 5684 opensrf   42u  sock                0,6      0t0 67689835 can't identify protocol
/usr/sbin 5684 opensrf   43u  sock                0,6      0t0 67689838 can't identify protocol
/usr/sbin 5684 opensrf   44u  sock                0,6      0t0 67689841 can't identify protocol

From using strace it looks like the problem connections are from apache trying to load the various added content types, the connections get shutdown but the FD for the socket never gets closed.   https://github.com/evergreen-library-system/Evergreen/blob/6bb8ea5599d39d41d623d1891b3c509c4e439178/Open-ILS/src/perlmods/lib/OpenILS/WWW/EGCatLoader/Record.pm#L577

I’ll post more info when I get a chance, time to take the kids to the park before we all go stir crazy ;-)
Josh

From: Open-ils-dev [mailto:open-ils-dev-bounces at list.georgialibraries.org] On Behalf Of Mike Rylander
Sent: Thursday, July 23, 2015 8:02 AM
To: Evergreen Discussion Group
Cc: open-ils-dev at list.georgialibraries.org<mailto:open-ils-dev at list.georgialibraries.org>
Subject: Re: [OPEN-ILS-DEV] [OPEN-ILS-GENERAL] Apache leaking sockets/FD

Josh,

When you see this happen again, please try `lsof -n -P -p <pid>` (note the -n and -P) instead.  That will give the IP addrs and port numbers without attempting to convert host or service names and should help you identify the offending connections.

Regards,


--
Mike Rylander
 | President
 | Equinox Software, Inc. / The Open Source Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com<mailto:miker at esilibrary.com>
 | web:  http://www.esilibrary.com


On Wed, Jul 22, 2015 at 9:14 PM, Josh Stompro <stomproj at exchange.larl.org<mailto:stomproj at exchange.larl.org>> wrote:
Greetings,  I’ve been trying to figure out why my two front end Evergreen application servers keep hitting some resource limits having to do with tcp sockets (numtcpsock openvz beancounters).

I’m running EG 2.8.2, OpenSRF 2.4.1, Debian Jessie in an Openvz container on Proxmox VE 3.4

Nothing looks out of the ordinary when I look at the output of ‘ss –s’ or ‘netstat –a’, but the numtcpsock counter keeps going up, until I have 5000+ reported open tcp socket connections.

I think I’ve narrowed it down to apache, since restarting apache resets the numtcpsock numbers back in line with what is reported by ‘ss –s’

If I take a look at all the open fd’s of an apache process, I see a bunch of the following.  So I think some socket connections are being opened but not closed properly.

(lsof –p <pid>)

/usr/sbin 11821 opensrf  171u  sock                0,6      0t0 61135031 can't identify protocol
/usr/sbin 11821 opensrf  172u  sock                0,6      0t0 61135034 can't identify protocol
/usr/sbin 11821 opensrf  173u  sock                0,6      0t0 61135037 can't identify protocol
/usr/sbin 11821 opensrf  174u  sock                0,6      0t0 61321969 can't identify protocol
/usr/sbin 11821 opensrf  175u  sock                0,6      0t0 61321972 can't identify protocol
/usr/sbin 11821 opensrf  176u  sock                0,6      0t0 61321975 can't identify protocol
/usr/sbin 11821 opensrf  177u  sock                0,6      0t0 61321978 can't identify protocol
/usr/sbin 11821 opensrf  178u  sock                0,6      0t0 61321981 can't identify protocol
/usr/sbin 11821 opensrf  179u  sock                0,6      0t0 61458539 can't identify protocol
/usr/sbin 11821 opensrf  180u  sock                0,6      0t0 61458542 can't identify protocol
/usr/sbin 11821 opensrf  181u  sock                0,6      0t0 61458545 can't identify protocol
/usr/sbin 11821 opensrf  182u  sock                0,6      0t0 61458548 can't identify protocol
/usr/sbin 11821 opensrf  183u  sock                0,6      0t0 61458551 can't identify protocol
/usr/sbin 11821 opensrf  184u  sock                0,6      0t0 62085495 can't identify protocol
/usr/sbin 11821 opensrf  185u  sock                0,6      0t0 62085498 can't identify protocol
/usr/sbin 11821 opensrf  186u  sock                0,6      0t0 62085501 can't identify protocol
/usr/sbin 11821 opensrf  187u  sock                0,6      0t0 62085504 can't identify protocol
/usr/sbin 11821 opensrf  188u  sock                0,6      0t0 62085507 can't identify protocol
/usr/sbin 11821 opensrf  189u  sock                0,6      0t0 63801157 can't identify protocol
/usr/sbin 11821 opensrf  190u  sock                0,6      0t0 63801160 can't identify protocol
/usr/sbin 11821 opensrf  191u  sock                0,6      0t0 63801163 can't identify protocol
/usr/sbin 11821 opensrf  192u  sock                0,6      0t0 63801166 can't identify protocol
/usr/sbin 11821 opensrf  193u  sock                0,6      0t0 63801169 can't identify protocol
/usr/sbin 11821 opensrf  194u  sock                0,6      0t0 63961716 can't identify protocol
/usr/sbin 11821 opensrf  195u  sock                0,6      0t0 63961719 can't identify protocol
/usr/sbin 11821 opensrf  196u  sock                0,6      0t0 63961722 can't identify protocol
/usr/sbin 11821 opensrf  197u  sock                0,6      0t0 63961725 can't identify protocol
/usr/sbin 11821 opensrf  198u  sock                0,6      0t0 63961728 can't identify protocol
/usr/sbin 11821 opensrf  199u  sock                0,6      0t0 64808966 can't identify protocol
/usr/sbin 11821 opensrf  200u  sock                0,6      0t0 64808971 can't identify protocol
/usr/sbin 11821 opensrf  201u  sock                0,6      0t0 64808974 can't identify protocol
/usr/sbin 11821 opensrf  202u  sock                0,6      0t0 64808977 can't identify protocol
/usr/sbin 11821 opensrf  203u  sock                0,6      0t0 64808980 can't identify protocol

I’m not sure how to track down the problem, I’ll try using strace to see what connections are being created, but I’m not quite sure what to look for.

If anyone has run into this before, please let me know.
Josh

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20150724/ebcb53b5/attachment-0001.html>


More information about the Open-ils-general mailing list