[OPEN-ILS-DEV] Another possible optimization for database access/query

Tue Oct 8 12:54:25 EDT 2013

First, thanks again for this.

Some background:  this function is replacing the query generated by
the code that lives at
Open-ILS/src/perlmods/lib/OpenILS/Application/SuperCat.pm:643 (in
master).  The function uses a variadic param to simplify things, but
we actually need to be able to filter on three different types of
values: circ_lib (covered in the patch supplied), copy status and copy
location.

I'm creating a variant of the function depesz supplied that leaves
circ_lib as the primary filter axis, because it is the most likely to
be supplied.  All three filters will be accepted as array arguments,
as only the last input argument is allowed to be a variadic, and using
a special-case test for an empty circ_lib array to open a slightly
different cursor.

This special case (in practical terms, asking for installation-wide
item-ordered bibs, or "new bibs everywhere") will benefit from an
index specifically on create_date, and the other cases benefit from
the existing index on circ_lib.  I'm not sure if we should add a new
index across both or not ... testing will tell if circ_lib+create_date
makes thing better or not.

When I have a branch, I'll create an LP bug and respond here.

--miker

On Mon, Oct 7, 2013 at 11:52 AM, hubert depesz lubaczewski
<depesz at depesz.com> wrote:
> Hi,
> Couple of days ago, I started to work on #2 most time consuming query
> (from pg logs of our client):
>
> SELECT "acn".record AS "record", max ("acp".create_date) AS "create_date"
> FROM asset.call_number AS "acn"
> INNER JOIN asset.COPY AS "acp" ON ("acp".call_number = "acn".id)
> WHERE ("acn".record > 0)
>   AND ("acp".deleted = 'f'
>        AND "acp".circ_lib IN ('279',
>                               '280',
>                               '280',
>                               '281',
>                               '281',
>                               '282',
>                               '282',
>                               '283',
>                               '283',
>                               '284',
>                               '284',
>                               '285',
>                               '285',
>                               '286',
>                               '286',
>                               '287',
>                               '287',
>                               '288',
>                               '288',
>                               '289',
>                               '289'))
> GROUP BY 1
> ORDER BY max ("acp".create_date) DESC LIMIT 20
> ;
>
> This query has explain analyze http://explain.depesz.com/s/RAMc
> - runtime of 13.14 seconds.
>
>
> First level of optimization was adding 2 indexes:
>
> create index copy_circ_lib on asset.copy (circ_lib);
> create index unit_circ_lib on serial.unit (circ_lib);
>
> This made the query faster - 8.719s -- http://explain.depesz.com/s/UOy
>
> Afterwards, I added another two indexes, in hopes that it will get used:
>
> create index copy_circ_lib_create_date on asset.copy (circ_lib, create_date);
> create index unit_circ_lib_create_date on serial.unit (circ_lib, create_date);
>
> but it wasn't used.
>
> Since I didn't see a way to optimize the query further by changing query
> or adding indexes, decided to write a function instead.
>
> First iteration of the function:
>
> CREATE OR REPLACE FUNCTION asset.some_clever_name( IN p_limit INT4, VARIADIC p_circ_libs INT4[] )
>     RETURNS TABLE ( record INT8, create_date timestamptz ) as $$
> DECLARE
>     v_results public.hstore := '';
>     v_record  int8;
>     v_temprec record;
>     v_found   INT4 := 0;
> BEGIN
>     FOR v_temprec IN
>         SELECT
>             c.call_number,
>             max(c.create_date) as create_date
>         FROM
>             asset.copy c
>         WHERE c.circ_lib = any( p_circ_libs )
>             AND NOT c.deleted
>         GROUP BY c.call_number
>         ORDER BY max(c.create_date) DESC
>     LOOP
>         SELECT cn.record INTO v_record FROM asset.call_number cn WHERE cn.id = v_temprec.call_number;
>         CONTINUE WHEN NOT FOUND;
>         CONTINUE WHEN v_record <= 0;
>         CONTINUE WHEN v_results ? v_record::TEXT;
>         v_found := v_found + 1;
>         v_results  := v_results || hstore( v_record::TEXT, v_temprec.create_date::TEXT );
>         EXIT WHEN v_found = p_limit;
>     END LOOP;
>     RETURN QUERY SELECT KEY::INT8, value::timestamptz FROM each(v_results) ORDER BY value::timestamptz DESC;
> END;
> $$ language plpgsql STRICT;
>
> It can be called like this:
> SELECT * FROM asset.some_clever_name( 20, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289 );
>
> (i'm not good with naming stuff)
> result - 1.389s -- http://explain.depesz.com/s/IIu
> much better, but it's using the less optimal indexes (first set of two
> indexes I created).
>
> So I wrote this - second iteration of functional approach:
>
> CREATE OR REPLACE FUNCTION asset.some_clever_name( IN p_limit INT4, VARIADIC p_circ_libs INT4[] )
>     RETURNS TABLE ( record INT8, create_date timestamptz ) as $$
> DECLARE
>     v_results  public.hstore := '';
>     v_seen     public.hstore := '';
>     v_records  public.hstore := '';
>     v_oldest   timestamptz   := NULL;
>     v_c_oldest timestamptz   := NULL;
>     v_found    INT4 := 0;
>     v_circ_lib INT4;
>     v_record   int8;
>     v_temprec  record;
>     v_iter     INT4;
>     v_cursor   REFCURSOR;
> BEGIN
>     FOREACH v_circ_lib IN ARRAY p_circ_libs LOOP
>         v_iter := 0;
>         v_seen := '';
>         v_c_oldest := NULL;
>         open v_cursor NO SCROLL FOR
>             SELECT c.call_number, c.create_date
>             FROM asset.copy c
>             WHERE c.circ_lib = v_circ_lib AND NOT c.deleted
>             ORDER BY c.create_date DESC;
>
>         LOOP
>             FETCH v_cursor INTO v_temprec;
>             EXIT WHEN NOT FOUND;
>
>             v_iter := v_iter + 1;
>
>             -- If we already have better data than current row (newer records in
>             EXIT WHEN v_oldest IS NOT NULL AND v_oldest >= v_temprec.create_date;
>
>             -- Ignore if we've seen given call number in current query (for current circ_lib)
>             CONTINUE WHEN v_seen ? v_temprec.call_number::TEXT;
>             v_seen := v_seen || public.hstore( v_temprec.call_number::TEXT, '1' );
>
>             -- If we don't have yet record for given call_number, we need to get it
>             IF v_records ? v_temprec.call_number::TEXT THEN
>                 v_record := v_records -> v_temprec.call_number::TEXT;
>             ELSE
>                 SELECT cn.record INTO v_record FROM asset.call_number cn WHERE cn.id = v_temprec.call_number;
>                 CONTINUE WHEN NOT FOUND;
>                 CONTINUE WHEN v_record <= 0;
>                 v_records := v_records || hstore( v_temprec.call_number::TEXT, v_record::TEXT );
>             END IF;
>
>             -- If results already contain "better" date for given record, next row
>             IF v_results ? v_record::TEXT THEN
>                 CONTINUE WHEN ( v_results -> v_record::TEXT )::timestamptz > v_temprec.create_date;
>             END IF;
>
>             v_found := v_found + 1;
>             v_results  := v_results || hstore( v_record::TEXT, v_temprec.create_date::TEXT );
>
>             IF v_c_oldest IS NULL OR v_c_oldest > v_temprec.create_date THEN
>                 v_c_oldest := v_temprec.create_date;
>             END IF;
>
>             EXIT WHEN v_found = p_limit;
>         END LOOP;
>
>         CLOSE v_cursor;
>
>         -- Update oldest information based on oldest row added in current loop
>         IF v_oldest IS NULL OR v_oldest < v_c_oldest THEN
>             v_oldest := v_c_oldest;
>         END IF;
>
>     END LOOP;
>     RETURN QUERY SELECT KEY::INT8, value::timestamptz FROM each(v_results) ORDER BY value::timestamptz DESC LIMIT p_limit;
> RETURN;
> END;
> $$ language plpgsql STRICT;
>
> It can be used in exactly the same way. And it returns, of course, the
> same results.
>
> But the time - well - it's 2.687 *ms* http://explain.depesz.com/s/NBwP
>
> This means that the 2nd function, for these database, with these
> parameters is almost 5000 times faster.
>
> The drawback is that it's not really nice function, as it uses (for
> performance) certain side-effects of plpgsql constructs (cursors).
>
> Contacted the client, and, as previously, it was authorized to get
> released to you guys, in hopes that you can make a decision whether the
> optimization is worth being included in core.
>
> Best regards,
>
> depesz
>
> --
> The best thing about modern society is how easy it is to avoid contact with it.
>                                                              http://depesz.com/

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com