[OPEN-ILS-DEV] Another possible optimization for database access/query

Mon Oct 7 11:52:30 EDT 2013

Hi,
Couple of days ago, I started to work on #2 most time consuming query
(from pg logs of our client):

SELECT "acn".record AS "record", max ("acp".create_date) AS "create_date"
FROM asset.call_number AS "acn"
INNER JOIN asset.COPY AS "acp" ON ("acp".call_number = "acn".id)
WHERE ("acn".record > 0)
  AND ("acp".deleted = 'f'
       AND "acp".circ_lib IN ('279',
                              '280',
                              '280',
                              '281',
                              '281',
                              '282',
                              '282',
                              '283',
                              '283',
                              '284',
                              '284',
                              '285',
                              '285',
                              '286',
                              '286',
                              '287',
                              '287',
                              '288',
                              '288',
                              '289',
                              '289'))
GROUP BY 1
ORDER BY max ("acp".create_date) DESC LIMIT 20
;

This query has explain analyze http://explain.depesz.com/s/RAMc
- runtime of 13.14 seconds.

First level of optimization was adding 2 indexes:

create index copy_circ_lib on asset.copy (circ_lib);
create index unit_circ_lib on serial.unit (circ_lib);

This made the query faster - 8.719s -- http://explain.depesz.com/s/UOy

Afterwards, I added another two indexes, in hopes that it will get used:

create index copy_circ_lib_create_date on asset.copy (circ_lib, create_date);
create index unit_circ_lib_create_date on serial.unit (circ_lib, create_date);

but it wasn't used.

Since I didn't see a way to optimize the query further by changing query
or adding indexes, decided to write a function instead.

First iteration of the function:

CREATE OR REPLACE FUNCTION asset.some_clever_name( IN p_limit INT4, VARIADIC p_circ_libs INT4[] )
    RETURNS TABLE ( record INT8, create_date timestamptz ) as $$
DECLARE
    v_results public.hstore := '';
    v_record  int8;
    v_temprec record;
    v_found   INT4 := 0;
BEGIN
    FOR v_temprec IN
        SELECT
            c.call_number,
            max(c.create_date) as create_date
        FROM
            asset.copy c
        WHERE c.circ_lib = any( p_circ_libs )
            AND NOT c.deleted
        GROUP BY c.call_number
        ORDER BY max(c.create_date) DESC
    LOOP
        SELECT cn.record INTO v_record FROM asset.call_number cn WHERE cn.id = v_temprec.call_number;
        CONTINUE WHEN NOT FOUND;
        CONTINUE WHEN v_record <= 0;
        CONTINUE WHEN v_results ? v_record::TEXT;
        v_found := v_found + 1;
        v_results  := v_results || hstore( v_record::TEXT, v_temprec.create_date::TEXT );
        EXIT WHEN v_found = p_limit;
    END LOOP;
    RETURN QUERY SELECT KEY::INT8, value::timestamptz FROM each(v_results) ORDER BY value::timestamptz DESC;
END;
$$ language plpgsql STRICT;

It can be called like this:
SELECT * FROM asset.some_clever_name( 20, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289 );

(i'm not good with naming stuff)
result - 1.389s -- http://explain.depesz.com/s/IIu
much better, but it's using the less optimal indexes (first set of two
indexes I created).

So I wrote this - second iteration of functional approach:

CREATE OR REPLACE FUNCTION asset.some_clever_name( IN p_limit INT4, VARIADIC p_circ_libs INT4[] )
    RETURNS TABLE ( record INT8, create_date timestamptz ) as $$
DECLARE
    v_results  public.hstore := '';
    v_seen     public.hstore := '';
    v_records  public.hstore := '';
    v_oldest   timestamptz   := NULL;
    v_c_oldest timestamptz   := NULL;
    v_found    INT4 := 0;
    v_circ_lib INT4;
    v_record   int8;
    v_temprec  record;
    v_iter     INT4;
    v_cursor   REFCURSOR;
BEGIN
    FOREACH v_circ_lib IN ARRAY p_circ_libs LOOP
        v_iter := 0;
        v_seen := '';
        v_c_oldest := NULL;
        open v_cursor NO SCROLL FOR 
            SELECT c.call_number, c.create_date
            FROM asset.copy c
            WHERE c.circ_lib = v_circ_lib AND NOT c.deleted
            ORDER BY c.create_date DESC;

        LOOP
            FETCH v_cursor INTO v_temprec;
            EXIT WHEN NOT FOUND;

            v_iter := v_iter + 1;

            -- If we already have better data than current row (newer records in
            EXIT WHEN v_oldest IS NOT NULL AND v_oldest >= v_temprec.create_date;

            -- Ignore if we've seen given call number in current query (for current circ_lib)
            CONTINUE WHEN v_seen ? v_temprec.call_number::TEXT;
            v_seen := v_seen || public.hstore( v_temprec.call_number::TEXT, '1' );

            -- If we don't have yet record for given call_number, we need to get it
            IF v_records ? v_temprec.call_number::TEXT THEN
                v_record := v_records -> v_temprec.call_number::TEXT;
            ELSE
                SELECT cn.record INTO v_record FROM asset.call_number cn WHERE cn.id = v_temprec.call_number;
                CONTINUE WHEN NOT FOUND;
                CONTINUE WHEN v_record <= 0;
                v_records := v_records || hstore( v_temprec.call_number::TEXT, v_record::TEXT );
            END IF;

            -- If results already contain "better" date for given record, next row
            IF v_results ? v_record::TEXT THEN
                CONTINUE WHEN ( v_results -> v_record::TEXT )::timestamptz > v_temprec.create_date;
            END IF;

            v_found := v_found + 1;
            v_results  := v_results || hstore( v_record::TEXT, v_temprec.create_date::TEXT );

            IF v_c_oldest IS NULL OR v_c_oldest > v_temprec.create_date THEN
                v_c_oldest := v_temprec.create_date;
            END IF;

            EXIT WHEN v_found = p_limit;
        END LOOP;

        CLOSE v_cursor;

        -- Update oldest information based on oldest row added in current loop
        IF v_oldest IS NULL OR v_oldest < v_c_oldest THEN
            v_oldest := v_c_oldest;
        END IF;

    END LOOP;
    RETURN QUERY SELECT KEY::INT8, value::timestamptz FROM each(v_results) ORDER BY value::timestamptz DESC LIMIT p_limit;
RETURN;
END;
$$ language plpgsql STRICT;

It can be used in exactly the same way. And it returns, of course, the
same results.

But the time - well - it's 2.687 *ms* http://explain.depesz.com/s/NBwP

This means that the 2nd function, for these database, with these
parameters is almost 5000 times faster.

The drawback is that it's not really nice function, as it uses (for
performance) certain side-effects of plpgsql constructs (cursors).

Contacted the client, and, as previously, it was authorized to get
released to you guys, in hopes that you can make a decision whether the
optimization is worth being included in core.

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/