[OPEN-ILS-GENERAL] Evergreen 2.9 and PostgreSQL 9.4 - A Cautionary Tale

Wed Feb 10 16:23:35 EST 2016

Thanks to both of you for sharing all of this information.  I'm glad
to hear things are working well again!

Jeff

On 16-02-10 01:03 PM, Chris Sharp wrote:
> Thanks, Mike - I (and hopefully future generations of Evergreen admins) appreciate the clarifications!
> 
> Chris
> 
> ----- Original Message -----
>> From: "Mike Rylander" <mrylander at gmail.com>
>> To: "Evergreen Discussion Group" <open-ils-general at list.georgialibraries.org>
>> Sent: Wednesday, February 10, 2016 3:56:01 PM
>> Subject: Re: [OPEN-ILS-GENERAL] Evergreen 2.9 and PostgreSQL 9.4 - A Cautionary Tale
>>
>> Chris,
>>
>> I was happy be able to perform that emergency diagnosis and
>> immediately implement several of the changes you listed, and glad you
>> were able to move quickly to implement my recommendations requiring
>> restarts.  I know it was a stressful time for your team, and it's
>> frustrating when the things you've tried in the past don't help in new
>> situations.
>>
>> I want to provide some context regarding the specific changes I made,
>> and had you make, just so folks can make the best use of the
>> information out there.  Though I'm sure you know this, the details of
>> the tuning any given Postgres instance are extremely dependent on the
>> particular data set, even when you're using the same application
>> (Evergreen in this case).
>>
>> With regard to the join collapse limit, there are many other data sets
>> that do not show the same sensitivity to the setting as PINES.  Most
>> are well-served by the bug fix on
>> https://bugs.launchpad.net/evergreen/+bug/1527731 .  There is a
>> not-insignificant cost to increasing this value, so there is a trade
>> off to be weighed.  In PINES case, the increase in large-query
>> planning time was far outweighed by the cost of the very-slow queries
>> being seen, of course.  You'll notice that a much smaller increase
>> described in the bug was all that was needed in Dan Wells' case,
>> though even that was not necessary with the patch.  An overly-high
>> setting here can be detrimental to performance in general -- IOW,
>> setting the join collapse limit as high as I've had you set it for
>> PINES is definitely not a blanket optimization opportunity.
>>
>> For the ROWS adjustments, that can have a significant effect on plan
>> creation (positive or negative) and is directly dependent on the data
>> set.  It was also critical in combination with the join collapse limit
>> change, as otherwise row estimates in the PINES data set were still
>> pushing plans in a sub-optimal direction.  The specific setting of 300
>> is only applicable to PINES.  It feeds into the calculus that Postgres
>> uses when constructing query plans, and must be tuned to the data set
>> if it needs to be tuned at all.  In many cases it does not need to be
>> adjusted, and a high setting here can cause bad plans.  While it does
>> look like the value is essentially the number of org units in the
>> instance, identifying the best value is not simply a matter of setting
>> ROWS to approximately the number of org units you have -- there are
>> situations (and configurations) where that isn't the correct thing to
>> do, and it requires instance-specific evaluation to determine what's
>> best.
>>
>> Indexes are also extremely data set dependent, of course, and their
>> cost comes from the regular index maintenance costs. Obviously there's
>> no need for multiple, identical indexes, so removing those is always a
>> good recommendation if they are, indeed, duplicates.  Adding an index
>> is (relative to other tuning measures) pretty cheap if you have the
>> drive space, but action.circulation is a pretty hot table (it changes
>> a lot, and quickly), so unless there is a problem with timeouts to a
>> site's collections vendor, I wouldn't recommend adding that specific
>> index.
>>
>> For shared buffers, I can't think of a situation where I would
>> recommend more than 32G, and even for PINES 16G is probably enough.
>> However, since 32G is working well for you I wouldn't change it at
>> this point.  A large shared buffers setting causes a lot of waste in
>> terms of RAM (pages are loaded twice, once in shared buffers and once
>> in the OS page cache; wasted bookkeeping memory for the page map) and
>> CPU time (locking stalls, CHECKPOINT timeout I/O storms, buffer
>> eviction clock sweep work, load on the background writer, just to name
>> a few). This is definitely one of those "every instance is different"
>> things.
>>
>> Having said all that, disabling transparent huge pages is pretty much
>> critical -- it's as important as disabling zone reclaim on any NUMA
>> system.  This was a non-issue on 9.2, became important in some
>> situations on 9.3, and is a huge issue for 9.4.  The reasons are
>> complicated, but the cause is Postgres' change from System V to POSIX
>> shared memory.  Basically, just do it. :)
>>
>> Thanks for sharing, Chris, and again, I was glad to help when needed.
>>
>> --
>> Mike Rylander
>>  | President
>>  | Equinox Software, Inc. / The Open Source Experts
>>  | phone:  1-877-OPEN-ILS (673-6457)
>>  | email:  miker at esilibrary.com
>>  | web:  http://www.esilibrary.com
>>
>>
>>
>> On Wed, Feb 10, 2016 at 2:41 PM, Chris Sharp
>> <csharp at georgialibraries.org> wrote:
>>> Hello all,
>>>
>>> As watchers of the Evergreen IRC channel, know, the PINES upgrade from
>>> Evergreen 2.7.2/PostgreSQL 9.3 to Evergreen 2.9.1/PostgreSQL 9.4 was a bit
>>> bumpy.  Our system slowed to a crawl on go-live day, and we were quick to
>>> determine that the culprit was the database.  After consulting with Mike
>>> Rylander of Equinox Software, we implemented several tweaks to our system
>>> that I'll share here in hopes that they may help others in the future:
>>>
>>> The slowness appears to have been due to two problem areas:
>>>
>>> Bad Query Plans:
>>>
>>> - we upped the join_collapse_limit from the default value of 8 to 12 - Dan
>>> Wells discusses this possibility in his bug report about slow copy load in
>>> the OPAC (https://bugs.launchpad.net/evergreen/+bug/1527731).
>>> - we altered the actor.org_unit_descendants, adding a realistic estimate of
>>> the number of rows to be expected:
>>>      ALTER FUNCTION actor.org_unit_descendants(int) ROWS 300;
>>>      ALTER FUNCTION actor.org_unit_descendants(int,int) ROWS 300;
>>> - we dropped a duplicate index on the action.hold_request table that was
>>> apparently added to address an issue in the past
>>> (ahr_current_copy_open_idx)
>>> - we added an index to action.circulation to address an issue with slow
>>> collections queries:
>>>      CREATE INDEX CONCURRENTLY circ_circ_lib_open_idx ON action.circulation
>>>      (usr,circ_lib) WHERE xact_finish IS NULL;
>>>
>>> PostgreSQL Memory Usage:
>>>
>>> - we reduced our shared_buffers size to 32GB from 128GB - we had calculated
>>> the higher value from the PostgreSQL documentation
>>> (http://www.postgresql.org/docs/9.4/static/runtime-config-resource.html#GUC-SHARED-BUFFERS)
>>> recommendation of 1/4 of installed RAM (512GB in our case).  We
>>> immediately saw a reduction of resource consumption on the server.
>>> - on the underlying OS, we turned off transparent huge pages (THP - see
>>> http://unix.stackexchange.com/questions/99154/disable-transparent-hugepages
>>> for details on how one might do this).  Further research revealed that
>>> turning off THP is generally recommended by Oracle and PostgreSQL.
>>>
>>> We also followed Mike's recommendation to run the hold targeter cron job
>>> nightly rather than every 15 minutes, but that probably didn't have much
>>> to do with the root cause (aside from adding queries to the pile-up when
>>> slowness struck).
>>>
>>> Since we made the above changes, our database servers have increased speed
>>> and are performing very happily.  I just wanted to share our experience
>>> with the Evergreen Community in hopes that others can benefit.
>>>
>>> --
>>> Chris Sharp
>>> PINES System Administrator
>>> Georgia Public Library Service
>>> 1800 Century Place, Suite 150
>>> Atlanta, Georgia 30345
>>> (404) 235-7147
>>> csharp at georgialibraries.org
>>> http://pines.georgialibraries.org/
>>
> 

-- 
Jeff Davis
Lead Evergreen Specialist
BC Libraries Cooperative