[OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD
Josh Stompro
stomproj at exchange.larl.org
Thu Jul 23 22:02:09 EDT 2015
Adding a close seems to have fixed the problem for me. To try it out I edited /usr/local/share/perl/5.20.2/OpenILS/WWW/EGCatLoader/Record.pm and changed line 577 to
576 # To avoid a lot of hanging connections.
577 if ($content->{request}) {
578 $content->{request}->shutdown(2);
579 $content->{request}->close();
580 }
Now when I load a bib detail record the number of orphaned sock connections doesn’t keep climbing. I’ll test some more and open a bug if it continues to look good.
Josh
From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 2:27 PM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD
I just took a look at a test system running Debian Wheezy with EG 2.8.2 and Opensrf 2.4.1, same issue, each page load of a record detail page leaks 5 file descriptors, that show up when doing a “lsof | grep “\<sock\>” | wc –l” before and after the request.
So the steps to test it are.
1. Run “lsof |grep "\<sock\>" | wc –l” to see how many orphan FD there currently are.
2. Load a record detail page to trigger the added_content connections back to the local host. http://egcatalog/eg/opac/record/10
3. Run “lsof |grep "\<sock\>" | wc –l” to see if the number increased.
I don’t have a non openvz based system to test on right now. I would love to hear if anyone else sees this.
Josh
From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 2:06 PM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD
I found this page that seems to say that a close is always needed after a shutdown of a socket to free the FD.
http://www.perlmonks.org/?node=108244
I’ll look at my other test systems and see if I see the same issue, but haven’t noticed it because of the low number of requests.
Josh
From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 1:31 PM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD
This is what strace shos me.
[pid 14793] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP <unfinished ...>
[pid 14793] <... socket resumed> ) = 83
[pid 14793] ioctl(83, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS <unfinished ...>
[pid 14793] <... ioctl resumed> , 0x7fffb83df850) = -1 EINVAL (Invalid argument)
[pid 14793] lseek(83, 0, SEEK_CUR <unfinished ...>
[pid 14793] <... lseek resumed> ) = -1 ESPIPE (Illegal seek)
[pid 14793] ioctl(83, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS <unfinished ...>
[pid 14793] <... ioctl resumed> , 0x7fffb83df850) = -1 EINVAL (Invalid argument)
[pid 14793] lseek(83, 0, SEEK_CUR <unfinished ...>
[pid 14793] <... lseek resumed> ) = -1 ESPIPE (Illegal seek)
[pid 14793] fcntl(83, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 14793] <... fcntl resumed> ) = 0
[pid 14793] connect(83, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("192.168.46.32")}, 16 <unfinished ...>
[pid 14793] <... connect resumed> ) = 0
[pid 14793] write(83, "HEAD /opac/extras/ac/summary/html/r/1001 HTTP/1.1\r\nConnection: close\r\nHost: virt-egapp2.larl.org\r\n\r\n", 100 <unfinished ...>
[pid 14793] <... write resumed> ) = 100
[pid 14793] read(83, <unfinished ...>
[pid 14793] <... read resumed> "HTTP/1.1 404 Not Found\r\nDate: Thu, 23 Jul 2015 03:16:49 GMT\r\nServer: Apache/2.4.10 (Debian)\r\nConnection: close\r\nContent-Type: text/html; charset=iso-8859-1\r\n\r\n", 1024) = 159
[pid 14793] shutdown(83, SHUT_RDWR <unfinished ...>
[pid 14793] <... shutdown resumed> ) = 0
After this point FD 83 never shows up again in the strace log, but it does show up in the lsof –p <pid> display as shown before. I’m wondering if that is because no close for FD 83 is called? I’ve read that sometimes the shutdown() implementation includes the close, and sometimes it does not. I’m trying to figure out if the shutdown at the end of EGCatLoader/Record.pm includes a close.. or if the implementation changed with the versions of the perl libs that Jessie has.
I’m also wondering if I’m way off base or not.
Josh
From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Thursday, July 23, 2015 10:03 AM
To: Evergreen Development Discussion List; Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] [OPEN-ILS-DEV] Apache leaking sockets/FD
Hello Mike,
Lsof –n –P –p <pid> doesn’t give any new info about those connections.
ot at virt-egapp2:/openils/var/templates# lsof -n -P -p 5684
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
/usr/sbin 5684 opensrf cwd DIR 0,45 4096 34071974 /
/usr/sbin 5684 opensrf rtd DIR 0,45 4096 34071974 /
/usr/sbin 5684 opensrf txt REG 0,45 654136 34947316 /usr/sbin/apache2
/usr/sbin 5684 opensrf mem REG 253,2 35374581 /lib/x86_64-linux-gnu/libnss_dns-2.19.so (path dev=0,45)
/usr/sbin 5684 opensrf mem REG 253,2 39379859 /usr/lib/x86_64-linux-gnu/perl/5.20.2/auto/Hash/Util/Util.so (path dev=0,45)
/usr/sbin 5684 opensrf mem REG 0,50 66964636 (deleted)/dev/zero (stat: No such file or directory)
<SNIP>
/usr/sbin 5684 opensrf 35u sock 0,6 0t0 67405034 can't identify protocol
/usr/sbin 5684 opensrf 36u sock 0,6 0t0 67405037 can't identify protocol
/usr/sbin 5684 opensrf 37u sock 0,6 0t0 67405040 can't identify protocol
/usr/sbin 5684 opensrf 38u sock 0,6 0t0 67405043 can't identify protocol
/usr/sbin 5684 opensrf 39u sock 0,6 0t0 67405046 can't identify protocol
/usr/sbin 5684 opensrf 40u sock 0,6 0t0 67689829 can't identify protocol
/usr/sbin 5684 opensrf 41u sock 0,6 0t0 67689832 can't identify protocol
/usr/sbin 5684 opensrf 42u sock 0,6 0t0 67689835 can't identify protocol
/usr/sbin 5684 opensrf 43u sock 0,6 0t0 67689838 can't identify protocol
/usr/sbin 5684 opensrf 44u sock 0,6 0t0 67689841 can't identify protocol
From using strace it looks like the problem connections are from apache trying to load the various added content types, the connections get shutdown but the FD for the socket never gets closed. https://github.com/evergreen-library-system/Evergreen/blob/6bb8ea5599d39d41d623d1891b3c509c4e439178/Open-ILS/src/perlmods/lib/OpenILS/WWW/EGCatLoader/Record.pm#L577
I’ll post more info when I get a chance, time to take the kids to the park before we all go stir crazy ;-)
Josh
From: Open-ils-dev [mailto:open-ils-dev-bounces at list.georgialibraries.org] On Behalf Of Mike Rylander
Sent: Thursday, July 23, 2015 8:02 AM
To: Evergreen Discussion Group
Cc: open-ils-dev at list.georgialibraries.org<mailto:open-ils-dev at list.georgialibraries.org>
Subject: Re: [OPEN-ILS-DEV] [OPEN-ILS-GENERAL] Apache leaking sockets/FD
Josh,
When you see this happen again, please try `lsof -n -P -p <pid>` (note the -n and -P) instead. That will give the IP addrs and port numbers without attempting to convert host or service names and should help you identify the offending connections.
Regards,
--
Mike Rylander
| President
| Equinox Software, Inc. / The Open Source Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker at esilibrary.com<mailto:miker at esilibrary.com>
| web: http://www.esilibrary.com
On Wed, Jul 22, 2015 at 9:14 PM, Josh Stompro <stomproj at exchange.larl.org<mailto:stomproj at exchange.larl.org>> wrote:
Greetings, I’ve been trying to figure out why my two front end Evergreen application servers keep hitting some resource limits having to do with tcp sockets (numtcpsock openvz beancounters).
I’m running EG 2.8.2, OpenSRF 2.4.1, Debian Jessie in an Openvz container on Proxmox VE 3.4
Nothing looks out of the ordinary when I look at the output of ‘ss –s’ or ‘netstat –a’, but the numtcpsock counter keeps going up, until I have 5000+ reported open tcp socket connections.
I think I’ve narrowed it down to apache, since restarting apache resets the numtcpsock numbers back in line with what is reported by ‘ss –s’
If I take a look at all the open fd’s of an apache process, I see a bunch of the following. So I think some socket connections are being opened but not closed properly.
(lsof –p <pid>)
/usr/sbin 11821 opensrf 171u sock 0,6 0t0 61135031 can't identify protocol
/usr/sbin 11821 opensrf 172u sock 0,6 0t0 61135034 can't identify protocol
/usr/sbin 11821 opensrf 173u sock 0,6 0t0 61135037 can't identify protocol
/usr/sbin 11821 opensrf 174u sock 0,6 0t0 61321969 can't identify protocol
/usr/sbin 11821 opensrf 175u sock 0,6 0t0 61321972 can't identify protocol
/usr/sbin 11821 opensrf 176u sock 0,6 0t0 61321975 can't identify protocol
/usr/sbin 11821 opensrf 177u sock 0,6 0t0 61321978 can't identify protocol
/usr/sbin 11821 opensrf 178u sock 0,6 0t0 61321981 can't identify protocol
/usr/sbin 11821 opensrf 179u sock 0,6 0t0 61458539 can't identify protocol
/usr/sbin 11821 opensrf 180u sock 0,6 0t0 61458542 can't identify protocol
/usr/sbin 11821 opensrf 181u sock 0,6 0t0 61458545 can't identify protocol
/usr/sbin 11821 opensrf 182u sock 0,6 0t0 61458548 can't identify protocol
/usr/sbin 11821 opensrf 183u sock 0,6 0t0 61458551 can't identify protocol
/usr/sbin 11821 opensrf 184u sock 0,6 0t0 62085495 can't identify protocol
/usr/sbin 11821 opensrf 185u sock 0,6 0t0 62085498 can't identify protocol
/usr/sbin 11821 opensrf 186u sock 0,6 0t0 62085501 can't identify protocol
/usr/sbin 11821 opensrf 187u sock 0,6 0t0 62085504 can't identify protocol
/usr/sbin 11821 opensrf 188u sock 0,6 0t0 62085507 can't identify protocol
/usr/sbin 11821 opensrf 189u sock 0,6 0t0 63801157 can't identify protocol
/usr/sbin 11821 opensrf 190u sock 0,6 0t0 63801160 can't identify protocol
/usr/sbin 11821 opensrf 191u sock 0,6 0t0 63801163 can't identify protocol
/usr/sbin 11821 opensrf 192u sock 0,6 0t0 63801166 can't identify protocol
/usr/sbin 11821 opensrf 193u sock 0,6 0t0 63801169 can't identify protocol
/usr/sbin 11821 opensrf 194u sock 0,6 0t0 63961716 can't identify protocol
/usr/sbin 11821 opensrf 195u sock 0,6 0t0 63961719 can't identify protocol
/usr/sbin 11821 opensrf 196u sock 0,6 0t0 63961722 can't identify protocol
/usr/sbin 11821 opensrf 197u sock 0,6 0t0 63961725 can't identify protocol
/usr/sbin 11821 opensrf 198u sock 0,6 0t0 63961728 can't identify protocol
/usr/sbin 11821 opensrf 199u sock 0,6 0t0 64808966 can't identify protocol
/usr/sbin 11821 opensrf 200u sock 0,6 0t0 64808971 can't identify protocol
/usr/sbin 11821 opensrf 201u sock 0,6 0t0 64808974 can't identify protocol
/usr/sbin 11821 opensrf 202u sock 0,6 0t0 64808977 can't identify protocol
/usr/sbin 11821 opensrf 203u sock 0,6 0t0 64808980 can't identify protocol
I’m not sure how to track down the problem, I’ll try using strace to see what connections are being created, but I’m not quite sure what to look for.
If anyone has run into this before, please let me know.
Josh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20150724/ebcb53b5/attachment-0001.html>
More information about the Open-ils-general
mailing list