[OPEN-ILS-DEV] Simple record materialized view - possible improvements

Thu Aug 7 23:20:26 EDT 2008

On Thu, Aug 7, 2008 at 7:09 PM, Brandon W. Uhlman
<brandon.uhlman at bclibrary.ca> wrote:
> Hi,
[snip]
>
> My proposal is to make the materialization process a little lazier. Or to
> quote Montgomery Scott, chief engineer of the Enterprise, to lock the
> rematerialization subroutine directly into the transporter's pattern buffer.
>
> First, let's add a staleness bit/boolean to metabib.full_rec. The staleness
> bit gets set to true if the row is updated or inserted when changes to the
> simple record view are not being materialized -- that is, because
> disable_materialized_simple_record_trigger() has been called. On reenabling
> the trigger, instead of truncating the entire table and rebuilding from
> scratch, only rebuild those entries which have the staleness bit set.

This concerns me because it will (at least) double number of writes
that has to occur to metabib.full_rec: once on initial insert/update,
and once to clear the flag.  It's already the biggest table in there.

For those situations where the number of records is small enough that
disabling the trigger is not warranted (I imported about 60k in a
couple minutes with the trigger in place (yes, I know, would be 30s or
so, plus refreshing of the mat-view, without the trigger) recently,
and that was acceptable in my mind) this will cause a slow-bleed bloat
of the table or, when autovacuum is in use, a very sparse and
unordered population of the table which will end up making access
slower because more disk pages will have to be read for any
record-ordered queries (such as, it turns out, population of the
mat-view, among certain other things).

At the other end of the spectrum, you have the ginormous insert case,
which is already going to take a while and presents the same bloating
issue (arguably worse, because it's invalidating a huge block in the
middle of the table, preventing tail truncation at vacuum time) and is
not going to significantly speed up the regeneration.  In fact, it
will end up slowing it down, because of the updates back to
metabib.full_rec which would have to unset the flag (and cause the
"big hole in the middle" bloat).

>
> If we're feeling particularly irreverent, we could call the column in the
> database that says whether something is waiting to be rematerialized
> 'in_pattern_buffer'.
>
> Ok, so I just re-enabled the trigger on a dataset of about 920,000 records,
> and it only took 15 minutes. I imagined it would take longer. Nonetheless, I
> am philosophically opposed to doing work I don't think I have to, so I'm
> putting this idea out to gather moss.
>

Heh ... well, I understand what you mean, but with Dan's recent
addition to automatically use the de/enable stored procs you don't
have to do anything extra.  I'd estimate that direct_ingest alone took
about 6 CPU-hours on that dataset, so IMO, the simpler (and less
maintenance-causing) setup we have now is a better option.  It's a
tradeoff between rare extra time of waiting for the mat-view to
rebuild or incurring a double-write penalty for /every/ insert and
update to the metabib.full_rec table, including during normal
cataloging.

FWIW, and as another datapoint, I recently did a test import of 1.3M
records with a hand-modified parallel_pg_loader output file including
the dis/enable calls inside the transaction (along with some wrapping
DROP INDEX and CREATE INDEX calls) and it took about 30 minutes, IIRC.
 How often would one import such a dataset?  Well, for me it's pretty
often, actually (because, well, that's a lot of what ESI does ;) ),
but for a single production installation, relatively rarely.

> Thoughts?
>
> ~B
>
> (*) I guess it would if you wrapped the whole thing in one big transaction,
> but the whole reason I'm thinking of this is that I disabled the trigger
> outside a transaction, because I had multiple imports to do. I'm not quite
> sure how transactions work in PostgreSQL. Do I cost myself any performance
> by having a single "gigantically enormous" transaction, instead of a bunch
> of "just pretty big" ones?
>

As long as there are less than a billion command IDs inside the
transaction (not impossible if you were importing, say, 50-100M
records at once) then it's less overall overhead, and less chance of
running into a problem with data half-loaded that would require
dropping everything.  The worst case would be a ROLLBACK followed by a
VACUUM FULL.

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone: 1-877-OPEN-ILS (673-6457)
 | email: miker at esilibrary.com
 | web: http://www.esilibrary.com