[OPEN-ILS-DEV] ***SPAM*** ***SPAM*** Deindexing deleted bibs; superpage settings
Mike Rylander
mrylander at gmail.com
Tue Mar 16 16:52:30 EDT 2010
On Tue, Mar 16, 2010 at 2:32 PM, Brandon W. Uhlman
<brandon at branflakes.net> wrote:
> Hi, all.
>
> The Evergreen system I work on is seeing some odd search behaviour which
> we've determined is the result of recent work merging our bibliographic
> records -- full sets imported from each of our member libraries to date --
> first against a set of 'authoritative' records, and then against each other.
>
> The result is that there are a large number of deleted bib records. This
> means a majority of our indexed terms are attached to deleted records. For
> our subject headings, for example:
>
> evergreen=# SELECT count(msfe.id), bre.deleted FROM
> metabib.subject_field_entry msfe JOIN biblio.record_entry bre ON
> (msfe.source = bre.id) GROUP BY deleted;
> count | deleted
> ---------+---------
> 2576458 | t
> 1517063 | f
> (2 rows)
>
> This results in superpages that are very sparsely populated with active
> nodes, especially for popular search terms that are limited by org unit or
> shelving location, et ceterea. From the end user perspective, this manifests
> itself sometimes as returning inaccurate result counts, or counts which vary
> wildly depending on the sort order, and sometimes returns zero results when
> in fact results exist because the result would have been on a later
> superpage then the search limits allowed.
>
> Increasing the superpage size from the default size of 10 pages of 1000
> records each to 50 pages of 1000 records each helped somewhat, but we still
> see odd results sometimes.
>
> So I have several questions that come from this experience:
> - is there a best practice for determining how big your superpages should
> be, and how many you should have?
500-1000 should be big enough, if ...
> - is cleaning up indexing for these deleted bib records as easy as 'DELETE
> FROM metabib.*_field_entry WHERE source IN (SELECT id FROM
> biblio.record_entry WHERE deleted = true)'
... you remove the metabib.metarecord_source_map entries for delete
bibs, as this will effectively hide them from all searches (yes, even
non-metarecord searches).
> - what do folks think of delete indexing information for bibs at the time of
> 'deletion', and to make re-ingesting part of the undeletion process -- since
> deleted records by design can't be found by indexed-field searches, anyway,
> IIRC. If there's consensus that this is a good idea, I can submit a patch to
> that effect.
>
Actually, I've been pondering whether it would be best to have the
delete-protecting RULE simply delete the metabib.metarecord_source_map
entry for the bre. That, to me, seems the most straight-forward thing
to do, eh? (And, incidentally, the in-db ingest trigger /does/ do
this in trunk.)
> I realize this is a bit of an edge case, but I'm hoping folks still have
> some comments.
>
It's less edgy that you might think, now that there are sites (like
yours) that have been around for several years and have been through
multiple deduplication merges. The iron is hot!
--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker at esilibrary.com
| web: http://www.esilibrary.com
More information about the Open-ils-dev
mailing list