[OPEN-ILS-GENERAL] Evergreen 2.9 and PostgreSQL 9.4 - A Cautionary Tale

Chris Sharp csharp at georgialibraries.org
Wed Feb 10 16:03:31 EST 2016


Thanks, Mike - I (and hopefully future generations of Evergreen admins) appreciate the clarifications!

Chris

----- Original Message -----
> From: "Mike Rylander" <mrylander at gmail.com>
> To: "Evergreen Discussion Group" <open-ils-general at list.georgialibraries.org>
> Sent: Wednesday, February 10, 2016 3:56:01 PM
> Subject: Re: [OPEN-ILS-GENERAL] Evergreen 2.9 and PostgreSQL 9.4 - A Cautionary Tale
> 
> Chris,
> 
> I was happy be able to perform that emergency diagnosis and
> immediately implement several of the changes you listed, and glad you
> were able to move quickly to implement my recommendations requiring
> restarts.  I know it was a stressful time for your team, and it's
> frustrating when the things you've tried in the past don't help in new
> situations.
> 
> I want to provide some context regarding the specific changes I made,
> and had you make, just so folks can make the best use of the
> information out there.  Though I'm sure you know this, the details of
> the tuning any given Postgres instance are extremely dependent on the
> particular data set, even when you're using the same application
> (Evergreen in this case).
> 
> With regard to the join collapse limit, there are many other data sets
> that do not show the same sensitivity to the setting as PINES.  Most
> are well-served by the bug fix on
> https://bugs.launchpad.net/evergreen/+bug/1527731 .  There is a
> not-insignificant cost to increasing this value, so there is a trade
> off to be weighed.  In PINES case, the increase in large-query
> planning time was far outweighed by the cost of the very-slow queries
> being seen, of course.  You'll notice that a much smaller increase
> described in the bug was all that was needed in Dan Wells' case,
> though even that was not necessary with the patch.  An overly-high
> setting here can be detrimental to performance in general -- IOW,
> setting the join collapse limit as high as I've had you set it for
> PINES is definitely not a blanket optimization opportunity.
> 
> For the ROWS adjustments, that can have a significant effect on plan
> creation (positive or negative) and is directly dependent on the data
> set.  It was also critical in combination with the join collapse limit
> change, as otherwise row estimates in the PINES data set were still
> pushing plans in a sub-optimal direction.  The specific setting of 300
> is only applicable to PINES.  It feeds into the calculus that Postgres
> uses when constructing query plans, and must be tuned to the data set
> if it needs to be tuned at all.  In many cases it does not need to be
> adjusted, and a high setting here can cause bad plans.  While it does
> look like the value is essentially the number of org units in the
> instance, identifying the best value is not simply a matter of setting
> ROWS to approximately the number of org units you have -- there are
> situations (and configurations) where that isn't the correct thing to
> do, and it requires instance-specific evaluation to determine what's
> best.
> 
> Indexes are also extremely data set dependent, of course, and their
> cost comes from the regular index maintenance costs. Obviously there's
> no need for multiple, identical indexes, so removing those is always a
> good recommendation if they are, indeed, duplicates.  Adding an index
> is (relative to other tuning measures) pretty cheap if you have the
> drive space, but action.circulation is a pretty hot table (it changes
> a lot, and quickly), so unless there is a problem with timeouts to a
> site's collections vendor, I wouldn't recommend adding that specific
> index.
> 
> For shared buffers, I can't think of a situation where I would
> recommend more than 32G, and even for PINES 16G is probably enough.
> However, since 32G is working well for you I wouldn't change it at
> this point.  A large shared buffers setting causes a lot of waste in
> terms of RAM (pages are loaded twice, once in shared buffers and once
> in the OS page cache; wasted bookkeeping memory for the page map) and
> CPU time (locking stalls, CHECKPOINT timeout I/O storms, buffer
> eviction clock sweep work, load on the background writer, just to name
> a few). This is definitely one of those "every instance is different"
> things.
> 
> Having said all that, disabling transparent huge pages is pretty much
> critical -- it's as important as disabling zone reclaim on any NUMA
> system.  This was a non-issue on 9.2, became important in some
> situations on 9.3, and is a huge issue for 9.4.  The reasons are
> complicated, but the cause is Postgres' change from System V to POSIX
> shared memory.  Basically, just do it. :)
> 
> Thanks for sharing, Chris, and again, I was glad to help when needed.
> 
> --
> Mike Rylander
>  | President
>  | Equinox Software, Inc. / The Open Source Experts
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com
> 
> 
> 
> On Wed, Feb 10, 2016 at 2:41 PM, Chris Sharp
> <csharp at georgialibraries.org> wrote:
> > Hello all,
> >
> > As watchers of the Evergreen IRC channel, know, the PINES upgrade from
> > Evergreen 2.7.2/PostgreSQL 9.3 to Evergreen 2.9.1/PostgreSQL 9.4 was a bit
> > bumpy.  Our system slowed to a crawl on go-live day, and we were quick to
> > determine that the culprit was the database.  After consulting with Mike
> > Rylander of Equinox Software, we implemented several tweaks to our system
> > that I'll share here in hopes that they may help others in the future:
> >
> > The slowness appears to have been due to two problem areas:
> >
> > Bad Query Plans:
> >
> > - we upped the join_collapse_limit from the default value of 8 to 12 - Dan
> > Wells discusses this possibility in his bug report about slow copy load in
> > the OPAC (https://bugs.launchpad.net/evergreen/+bug/1527731).
> > - we altered the actor.org_unit_descendants, adding a realistic estimate of
> > the number of rows to be expected:
> >      ALTER FUNCTION actor.org_unit_descendants(int) ROWS 300;
> >      ALTER FUNCTION actor.org_unit_descendants(int,int) ROWS 300;
> > - we dropped a duplicate index on the action.hold_request table that was
> > apparently added to address an issue in the past
> > (ahr_current_copy_open_idx)
> > - we added an index to action.circulation to address an issue with slow
> > collections queries:
> >      CREATE INDEX CONCURRENTLY circ_circ_lib_open_idx ON action.circulation
> >      (usr,circ_lib) WHERE xact_finish IS NULL;
> >
> > PostgreSQL Memory Usage:
> >
> > - we reduced our shared_buffers size to 32GB from 128GB - we had calculated
> > the higher value from the PostgreSQL documentation
> > (http://www.postgresql.org/docs/9.4/static/runtime-config-resource.html#GUC-SHARED-BUFFERS)
> > recommendation of 1/4 of installed RAM (512GB in our case).  We
> > immediately saw a reduction of resource consumption on the server.
> > - on the underlying OS, we turned off transparent huge pages (THP - see
> > http://unix.stackexchange.com/questions/99154/disable-transparent-hugepages
> > for details on how one might do this).  Further research revealed that
> > turning off THP is generally recommended by Oracle and PostgreSQL.
> >
> > We also followed Mike's recommendation to run the hold targeter cron job
> > nightly rather than every 15 minutes, but that probably didn't have much
> > to do with the root cause (aside from adding queries to the pile-up when
> > slowness struck).
> >
> > Since we made the above changes, our database servers have increased speed
> > and are performing very happily.  I just wanted to share our experience
> > with the Evergreen Community in hopes that others can benefit.
> >
> > --
> > Chris Sharp
> > PINES System Administrator
> > Georgia Public Library Service
> > 1800 Century Place, Suite 150
> > Atlanta, Georgia 30345
> > (404) 235-7147
> > csharp at georgialibraries.org
> > http://pines.georgialibraries.org/
> 

-- 
Chris Sharp
PINES System Administrator
Georgia Public Library Service
1800 Century Place, Suite 150
Atlanta, Georgia 30345
(404) 235-7147
csharp at georgialibraries.org
http://pines.georgialibraries.org/


More information about the Open-ils-general mailing list