[OPEN-ILS-DEV] Indexing/Search functionality (was: Planning for Evergreen development, post 2.0)

Fri Jan 21 12:20:39 EST 2011

Currently in Evergreen there are four different indexed bibliographic
data storage mechanisms, each of which targets a different set of data
and query use cases, and which carry their own caveats regarding
application outside the
designed use cases:

1) Base Search (metabib.keyword_field_entry and friends)
  * Use: general full-text indexing; query result relevance ranking
  * Caveat: inefficient for low-cardinality values (many records
containing indexable data for an index definition, few unique values
for that index defintion)
2) Full Record (metabib.real_full_rec)
  * Use: sorting; reporting; base data for control field indexing;
direct MARC field search
  * Caveat: inefficient for general searching as the format is too close to MARC
3) Facets (metabib.facet_entry)
  * Use: exact match search; post-search result refining; browse
  * Caveat: expensive as a filter on low-cardinality facets (many
records, few unique values)
4) Control Fields (metabib.rec_descriptor)
  * Use: storage for standard-based, single value record attributes
(fixed fields, physical characteristics, etc); sorting (date1, etc);
filtering (type, form, audience, VR format, etc); reporting; record
analysis; low-cardinality search for known (controlled) values; search
weighting (language)
  * Caveat: extension beyond standard is entirely out of scope;
extension within standard is often prohibitively expensive, requiring
schema, trigger, code and configuration changes

Reading carefully, one will notice that what is lacking is a way to
define and index general, user-defined single value fields.  That is,
something akin to (4) which works well for low-cardinality attributes,
but is extensible in a manner similar to (1) using a definition table.
 I call this concept, including the indexing, search, filtering and
maintenance mechanisms, Single
Value Fields or SVF.

Benefits provided by such a mechanism would be wide ranging.
  * Some access points that might currently be implemented as facets
because of the exact-match propert, would be both faster and more
flexible as an SVF
  * Other data used for sorting could be moved to an SVF to make such
sorting more memory and time efficient
  * Arbitrary user-defined fields from within a record could be
indexed and automatically exposed for use in the OPAC and staff client

As a secondary effect, (4) becomes a strict subset of SVF and can be
reimplemented as such.  This too has some very attractive benefits.

On the input side, the direct benefits of folding (4) and SVF together
are a unification of configuration APIs and interfaces, a reduction in
the complexity (and therefore maintenance cost) of the ingest code
that extracts values from bib records, and the elimination of the cost
to extend (4) beyond what exists already to any standard MARC control
or fixed field.

On the output side, the benefit from this will be to reduce the cost
of searches involving both (current-style) Control Fields (4) and
SVF-optimizable components by folding them into one mechanism.  This
reduces cost by eliminating one or more SQL JOINs, as well as taking
advantage a unified SVF index on all attributes for a record.  Thus,
instead of one JOIN for metabib.rec_descriptor and a separate JOIN for
each SVF-type facet used in a query, all would be replaced by a single
JOIN to the SVF data table.  This table will be approximately the same
size as the metabib.rec_descriptor table (similar I/O profile and
cachedness) while providing expanded functionality.

SVF will also include a display translation mechanism for all fields
indexed.  This means coded values can be displayed using
human-friendly strings in any language, just as I18N-enabled MARC
coded fields do today when stored in metabib.full_rec.  This is
something that facets cannot do natively, and will not be able to do
effectively -- because facet values are uncontrolled, I18N is outside
the design scope.

Put another way, the cost of using (4) today is essentially already
paid by the use of metabib.rec_descriptor -- this, or an analog such
as SVF must exist.  The cost of SVF replacing (4) would be comparable,
and in some cases lower (faster).  On the other hand, the cost of
using Facets (3) to simulate SVF-optimizable use cases for access
points outside the Facet design constraints can be extremely high,
depending on the cardinality of the facet.  A good example of this is
the use of a local Material Type value.  Imagine a facet with 20 or so
unique values, but where one of these, such as "book", is used in more
than 3/4 of the bibliographic dataset.  In practice, when Facets (3)
are used this way it is has been identified as the #2 cause of slow
searches.

[NOTE: the #1 cause has been identified as the cost of "relevance
bumps".  Research is underway to evaluate the efficacy of replacing
the ranking function (using rank_cd() instead of rank()) to address
this.  In previous version of Postgres, rank_cd() came at a
significant cost, but this may not be the case today.]

Now, the implementation of SVF and folding in of (4) is not without
costs.  The largest of these is the need to push the required Postgres
version to 9.0.  Pg 9.0 is considered stable and is very well
supported by the Pg community and third party Pg support companies.
The upgrade for existing production Evergreen sites is not terribly
complicated, but non-trivial.

The reason for this Postgres upgrade requirement is that the
underlying datatypes have become more featureful and mature in 9.0,
and the techniques I plan to use simply aren't possible in 8.4.

Attached you will find my current design document which covers much of
what is discussed above along with a basic implementation plan and
example use-cases for each component.  There are details not included,
such as appropriate table constraints on configuration tables, but the
meat is there and I would welcome feedback and input!

-- 
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Single-valueFieldsinIndexedhstore.pdf
Type: application/pdf
Size: 105192 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20110121/dca41665/attachment-0001.pdf