[OPEN-ILS-DEV] Indexing/Search functionality (was: Planning for Evergreen development, post 2.0)

Wed Feb 9 15:22:19 EST 2011

So, this is a big change, and thus I'm posting the patch here first
both to solicit feedback on the direction and as a poke to those who
may be interested but might have forgotten about this over the last
several weeks.

This is phase 1, wherein I have make the db changes required for SVF
and brought QueryParser.pm in line with those changes.  This requires
Postgres 9.0+ and the hstore contrib module (along with everything
else the Evergreen needs) in order to work.  I'm continuing to move
forward, but again, feedback is welcome!

--miker

On Fri, Jan 21, 2011 at 12:20 PM, Mike Rylander <mrylander at gmail.com> wrote:
> Currently in Evergreen there are four different indexed bibliographic
> data storage mechanisms, each of which targets a different set of data
> and query use cases, and which carry their own caveats regarding
> application outside the
> designed use cases:
>
> 1) Base Search (metabib.keyword_field_entry and friends)
>  * Use: general full-text indexing; query result relevance ranking
>  * Caveat: inefficient for low-cardinality values (many records
> containing indexable data for an index definition, few unique values
> for that index defintion)
> 2) Full Record (metabib.real_full_rec)
>  * Use: sorting; reporting; base data for control field indexing;
> direct MARC field search
>  * Caveat: inefficient for general searching as the format is too close to MARC
> 3) Facets (metabib.facet_entry)
>  * Use: exact match search; post-search result refining; browse
>  * Caveat: expensive as a filter on low-cardinality facets (many
> records, few unique values)
> 4) Control Fields (metabib.rec_descriptor)
>  * Use: storage for standard-based, single value record attributes
> (fixed fields, physical characteristics, etc); sorting (date1, etc);
> filtering (type, form, audience, VR format, etc); reporting; record
> analysis; low-cardinality search for known (controlled) values; search
> weighting (language)
>  * Caveat: extension beyond standard is entirely out of scope;
> extension within standard is often prohibitively expensive, requiring
> schema, trigger, code and configuration changes
>
> Reading carefully, one will notice that what is lacking is a way to
> define and index general, user-defined single value fields.  That is,
> something akin to (4) which works well for low-cardinality attributes,
> but is extensible in a manner similar to (1) using a definition table.
>  I call this concept, including the indexing, search, filtering and
> maintenance mechanisms, Single
> Value Fields or SVF.
>
> Benefits provided by such a mechanism would be wide ranging.
>  * Some access points that might currently be implemented as facets
> because of the exact-match propert, would be both faster and more
> flexible as an SVF
>  * Other data used for sorting could be moved to an SVF to make such
> sorting more memory and time efficient
>  * Arbitrary user-defined fields from within a record could be
> indexed and automatically exposed for use in the OPAC and staff client
>
> As a secondary effect, (4) becomes a strict subset of SVF and can be
> reimplemented as such.  This too has some very attractive benefits.
>
> On the input side, the direct benefits of folding (4) and SVF together
> are a unification of configuration APIs and interfaces, a reduction in
> the complexity (and therefore maintenance cost) of the ingest code
> that extracts values from bib records, and the elimination of the cost
> to extend (4) beyond what exists already to any standard MARC control
> or fixed field.
>
> On the output side, the benefit from this will be to reduce the cost
> of searches involving both (current-style) Control Fields (4) and
> SVF-optimizable components by folding them into one mechanism.  This
> reduces cost by eliminating one or more SQL JOINs, as well as taking
> advantage a unified SVF index on all attributes for a record.  Thus,
> instead of one JOIN for metabib.rec_descriptor and a separate JOIN for
> each SVF-type facet used in a query, all would be replaced by a single
> JOIN to the SVF data table.  This table will be approximately the same
> size as the metabib.rec_descriptor table (similar I/O profile and
> cachedness) while providing expanded functionality.
>
> SVF will also include a display translation mechanism for all fields
> indexed.  This means coded values can be displayed using
> human-friendly strings in any language, just as I18N-enabled MARC
> coded fields do today when stored in metabib.full_rec.  This is
> something that facets cannot do natively, and will not be able to do
> effectively -- because facet values are uncontrolled, I18N is outside
> the design scope.
>
> Put another way, the cost of using (4) today is essentially already
> paid by the use of metabib.rec_descriptor -- this, or an analog such
> as SVF must exist.  The cost of SVF replacing (4) would be comparable,
> and in some cases lower (faster).  On the other hand, the cost of
> using Facets (3) to simulate SVF-optimizable use cases for access
> points outside the Facet design constraints can be extremely high,
> depending on the cardinality of the facet.  A good example of this is
> the use of a local Material Type value.  Imagine a facet with 20 or so
> unique values, but where one of these, such as "book", is used in more
> than 3/4 of the bibliographic dataset.  In practice, when Facets (3)
> are used this way it is has been identified as the #2 cause of slow
> searches.
>
> [NOTE: the #1 cause has been identified as the cost of "relevance
> bumps".  Research is underway to evaluate the efficacy of replacing
> the ranking function (using rank_cd() instead of rank()) to address
> this.  In previous version of Postgres, rank_cd() came at a
> significant cost, but this may not be the case today.]
>
> Now, the implementation of SVF and folding in of (4) is not without
> costs.  The largest of these is the need to push the required Postgres
> version to 9.0.  Pg 9.0 is considered stable and is very well
> supported by the Pg community and third party Pg support companies.
> The upgrade for existing production Evergreen sites is not terribly
> complicated, but non-trivial.
>
> The reason for this Postgres upgrade requirement is that the
> underlying datatypes have become more featureful and mature in 9.0,
> and the techniques I plan to use simply aren't possible in 8.4.
>
> Attached you will find my current design document which covers much of
> what is discussed above along with a basic implementation plan and
> example use-cases for each component.  There are details not included,
> such as appropriate table constraints on configuration tables, but the
> meat is there and I would welcome feedback and input!
>
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com
>

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: svf-phase-1.patch
Type: text/x-patch
Size: 57181 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20110209/c9d7e0ca/attachment-0001.bin