[OPEN-ILS-DEV] Another possible optimization for database access/query

Mike Rylander mrylander at gmail.com
Tue Oct 8 13:58:40 EDT 2013


As promised, https://bugs.launchpad.net/evergreen/+bug/1236979 ...
note that it's quick-and-dirty, and untested as yet, and I'm about to
be unavailable for the remainder of the day.

TIA for any eyes that can be provided.

--miker


On Tue, Oct 8, 2013 at 12:54 PM, Mike Rylander <mrylander at gmail.com> wrote:
> First, thanks again for this.
>
> Some background:  this function is replacing the query generated by
> the code that lives at
> Open-ILS/src/perlmods/lib/OpenILS/Application/SuperCat.pm:643 (in
> master).  The function uses a variadic param to simplify things, but
> we actually need to be able to filter on three different types of
> values: circ_lib (covered in the patch supplied), copy status and copy
> location.
>
> I'm creating a variant of the function depesz supplied that leaves
> circ_lib as the primary filter axis, because it is the most likely to
> be supplied.  All three filters will be accepted as array arguments,
> as only the last input argument is allowed to be a variadic, and using
> a special-case test for an empty circ_lib array to open a slightly
> different cursor.
>
> This special case (in practical terms, asking for installation-wide
> item-ordered bibs, or "new bibs everywhere") will benefit from an
> index specifically on create_date, and the other cases benefit from
> the existing index on circ_lib.  I'm not sure if we should add a new
> index across both or not ... testing will tell if circ_lib+create_date
> makes thing better or not.
>
> When I have a branch, I'll create an LP bug and respond here.
>
> --miker
>
>
> On Mon, Oct 7, 2013 at 11:52 AM, hubert depesz lubaczewski
> <depesz at depesz.com> wrote:
>> Hi,
>> Couple of days ago, I started to work on #2 most time consuming query
>> (from pg logs of our client):
>>
>> SELECT "acn".record AS "record", max ("acp".create_date) AS "create_date"
>> FROM asset.call_number AS "acn"
>> INNER JOIN asset.COPY AS "acp" ON ("acp".call_number = "acn".id)
>> WHERE ("acn".record > 0)
>>   AND ("acp".deleted = 'f'
>>        AND "acp".circ_lib IN ('279',
>>                               '280',
>>                               '280',
>>                               '281',
>>                               '281',
>>                               '282',
>>                               '282',
>>                               '283',
>>                               '283',
>>                               '284',
>>                               '284',
>>                               '285',
>>                               '285',
>>                               '286',
>>                               '286',
>>                               '287',
>>                               '287',
>>                               '288',
>>                               '288',
>>                               '289',
>>                               '289'))
>> GROUP BY 1
>> ORDER BY max ("acp".create_date) DESC LIMIT 20
>> ;
>>
>> This query has explain analyze http://explain.depesz.com/s/RAMc
>> - runtime of 13.14 seconds.
>>
>>
>> First level of optimization was adding 2 indexes:
>>
>> create index copy_circ_lib on asset.copy (circ_lib);
>> create index unit_circ_lib on serial.unit (circ_lib);
>>
>> This made the query faster - 8.719s -- http://explain.depesz.com/s/UOy
>>
>> Afterwards, I added another two indexes, in hopes that it will get used:
>>
>> create index copy_circ_lib_create_date on asset.copy (circ_lib, create_date);
>> create index unit_circ_lib_create_date on serial.unit (circ_lib, create_date);
>>
>> but it wasn't used.
>>
>> Since I didn't see a way to optimize the query further by changing query
>> or adding indexes, decided to write a function instead.
>>
>> First iteration of the function:
>>
>> CREATE OR REPLACE FUNCTION asset.some_clever_name( IN p_limit INT4, VARIADIC p_circ_libs INT4[] )
>>     RETURNS TABLE ( record INT8, create_date timestamptz ) as $$
>> DECLARE
>>     v_results public.hstore := '';
>>     v_record  int8;
>>     v_temprec record;
>>     v_found   INT4 := 0;
>> BEGIN
>>     FOR v_temprec IN
>>         SELECT
>>             c.call_number,
>>             max(c.create_date) as create_date
>>         FROM
>>             asset.copy c
>>         WHERE c.circ_lib = any( p_circ_libs )
>>             AND NOT c.deleted
>>         GROUP BY c.call_number
>>         ORDER BY max(c.create_date) DESC
>>     LOOP
>>         SELECT cn.record INTO v_record FROM asset.call_number cn WHERE cn.id = v_temprec.call_number;
>>         CONTINUE WHEN NOT FOUND;
>>         CONTINUE WHEN v_record <= 0;
>>         CONTINUE WHEN v_results ? v_record::TEXT;
>>         v_found := v_found + 1;
>>         v_results  := v_results || hstore( v_record::TEXT, v_temprec.create_date::TEXT );
>>         EXIT WHEN v_found = p_limit;
>>     END LOOP;
>>     RETURN QUERY SELECT KEY::INT8, value::timestamptz FROM each(v_results) ORDER BY value::timestamptz DESC;
>> END;
>> $$ language plpgsql STRICT;
>>
>> It can be called like this:
>> SELECT * FROM asset.some_clever_name( 20, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289 );
>>
>> (i'm not good with naming stuff)
>> result - 1.389s -- http://explain.depesz.com/s/IIu
>> much better, but it's using the less optimal indexes (first set of two
>> indexes I created).
>>
>> So I wrote this - second iteration of functional approach:
>>
>> CREATE OR REPLACE FUNCTION asset.some_clever_name( IN p_limit INT4, VARIADIC p_circ_libs INT4[] )
>>     RETURNS TABLE ( record INT8, create_date timestamptz ) as $$
>> DECLARE
>>     v_results  public.hstore := '';
>>     v_seen     public.hstore := '';
>>     v_records  public.hstore := '';
>>     v_oldest   timestamptz   := NULL;
>>     v_c_oldest timestamptz   := NULL;
>>     v_found    INT4 := 0;
>>     v_circ_lib INT4;
>>     v_record   int8;
>>     v_temprec  record;
>>     v_iter     INT4;
>>     v_cursor   REFCURSOR;
>> BEGIN
>>     FOREACH v_circ_lib IN ARRAY p_circ_libs LOOP
>>         v_iter := 0;
>>         v_seen := '';
>>         v_c_oldest := NULL;
>>         open v_cursor NO SCROLL FOR
>>             SELECT c.call_number, c.create_date
>>             FROM asset.copy c
>>             WHERE c.circ_lib = v_circ_lib AND NOT c.deleted
>>             ORDER BY c.create_date DESC;
>>
>>         LOOP
>>             FETCH v_cursor INTO v_temprec;
>>             EXIT WHEN NOT FOUND;
>>
>>             v_iter := v_iter + 1;
>>
>>             -- If we already have better data than current row (newer records in
>>             EXIT WHEN v_oldest IS NOT NULL AND v_oldest >= v_temprec.create_date;
>>
>>             -- Ignore if we've seen given call number in current query (for current circ_lib)
>>             CONTINUE WHEN v_seen ? v_temprec.call_number::TEXT;
>>             v_seen := v_seen || public.hstore( v_temprec.call_number::TEXT, '1' );
>>
>>             -- If we don't have yet record for given call_number, we need to get it
>>             IF v_records ? v_temprec.call_number::TEXT THEN
>>                 v_record := v_records -> v_temprec.call_number::TEXT;
>>             ELSE
>>                 SELECT cn.record INTO v_record FROM asset.call_number cn WHERE cn.id = v_temprec.call_number;
>>                 CONTINUE WHEN NOT FOUND;
>>                 CONTINUE WHEN v_record <= 0;
>>                 v_records := v_records || hstore( v_temprec.call_number::TEXT, v_record::TEXT );
>>             END IF;
>>
>>             -- If results already contain "better" date for given record, next row
>>             IF v_results ? v_record::TEXT THEN
>>                 CONTINUE WHEN ( v_results -> v_record::TEXT )::timestamptz > v_temprec.create_date;
>>             END IF;
>>
>>             v_found := v_found + 1;
>>             v_results  := v_results || hstore( v_record::TEXT, v_temprec.create_date::TEXT );
>>
>>             IF v_c_oldest IS NULL OR v_c_oldest > v_temprec.create_date THEN
>>                 v_c_oldest := v_temprec.create_date;
>>             END IF;
>>
>>             EXIT WHEN v_found = p_limit;
>>         END LOOP;
>>
>>         CLOSE v_cursor;
>>
>>         -- Update oldest information based on oldest row added in current loop
>>         IF v_oldest IS NULL OR v_oldest < v_c_oldest THEN
>>             v_oldest := v_c_oldest;
>>         END IF;
>>
>>     END LOOP;
>>     RETURN QUERY SELECT KEY::INT8, value::timestamptz FROM each(v_results) ORDER BY value::timestamptz DESC LIMIT p_limit;
>> RETURN;
>> END;
>> $$ language plpgsql STRICT;
>>
>> It can be used in exactly the same way. And it returns, of course, the
>> same results.
>>
>> But the time - well - it's 2.687 *ms* http://explain.depesz.com/s/NBwP
>>
>> This means that the 2nd function, for these database, with these
>> parameters is almost 5000 times faster.
>>
>> The drawback is that it's not really nice function, as it uses (for
>> performance) certain side-effects of plpgsql constructs (cursors).
>>
>> Contacted the client, and, as previously, it was authorized to get
>> released to you guys, in hopes that you can make a decision whether the
>> optimization is worth being included in core.
>>
>> Best regards,
>>
>> depesz
>>
>> --
>> The best thing about modern society is how easy it is to avoid contact with it.
>>                                                              http://depesz.com/
>
>
>
> --
> Mike Rylander
>  | Director of Research and Development
>  | Equinox Software, Inc. / Your Library's Guide to Open Source
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com



-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-dev mailing list