[OPEN-ILS-DEV] Indexing/Search functionality (was: Planning for Evergreen development, post 2.0)

Wed Feb 9 16:38:03 EST 2011

Hello Mike,

Just related follow-up question. I noted that the install notes for 2.0 state "PostgreSQL 8.4 is the minimum supported version."

Is PostgreSQL 9.x officially supported in EG 2.0 - or is the 9.x version for dev/testing situations only.

Thanks,
George Duimovich
NRCan Library

-----Original Message-----
From: open-ils-dev-bounces at list.georgialibraries.org [mailto:open-ils-dev-bounces at list.georgialibraries.org] On Behalf Of Mike Rylander
Sent: February 9, 2011 15:22
To: Evergreen Development Discussion List
Subject: Re: [OPEN-ILS-DEV] Indexing/Search functionality (was: Planning for Evergreen development, post 2.0)

So, this is a big change, and thus I'm posting the patch here first both to solicit feedback on the direction and as a poke to those who may be interested but might have forgotten about this over the last several weeks.

This is phase 1, wherein I have make the db changes required for SVF and brought QueryParser.pm in line with those changes.  This requires Postgres 9.0+ and the hstore contrib module (along with everything else the Evergreen needs) in order to work.  I'm continuing to move forward, but again, feedback is welcome!

--miker

On Fri, Jan 21, 2011 at 12:20 PM, Mike Rylander <mrylander at gmail.com> wrote:
> Currently in Evergreen there are four different indexed bibliographic 
> data storage mechanisms, each of which targets a different set of data 
> and query use cases, and which carry their own caveats regarding 
> application outside the designed use cases:
>
> 1) Base Search (metabib.keyword_field_entry and friends)
>  * Use: general full-text indexing; query result relevance ranking
>  * Caveat: inefficient for low-cardinality values (many records 
> containing indexable data for an index definition, few unique values 
> for that index defintion)
> 2) Full Record (metabib.real_full_rec)
>  * Use: sorting; reporting; base data for control field indexing; 
> direct MARC field search
>  * Caveat: inefficient for general searching as the format is too 
> close to MARC
> 3) Facets (metabib.facet_entry)
>  * Use: exact match search; post-search result refining; browse
>  * Caveat: expensive as a filter on low-cardinality facets (many 
> records, few unique values)
> 4) Control Fields (metabib.rec_descriptor)
>  * Use: storage for standard-based, single value record attributes 
> (fixed fields, physical characteristics, etc); sorting (date1, etc); 
> filtering (type, form, audience, VR format, etc); reporting; record 
> analysis; low-cardinality search for known (controlled) values; search 
> weighting (language)
>  * Caveat: extension beyond standard is entirely out of scope; 
> extension within standard is often prohibitively expensive, requiring 
> schema, trigger, code and configuration changes
>
> Reading carefully, one will notice that what is lacking is a way to 
> define and index general, user-defined single value fields.  That is, 
> something akin to (4) which works well for low-cardinality attributes, 
> but is extensible in a manner similar to (1) using a definition table.
>  I call this concept, including the indexing, search, filtering and 
> maintenance mechanisms, Single Value Fields or SVF.
>
> Benefits provided by such a mechanism would be wide ranging.
>  * Some access points that might currently be implemented as facets 
> because of the exact-match propert, would be both faster and more 
> flexible as an SVF
>  * Other data used for sorting could be moved to an SVF to make such 
> sorting more memory and time efficient
>  * Arbitrary user-defined fields from within a record could be indexed 
> and automatically exposed for use in the OPAC and staff client
>
> As a secondary effect, (4) becomes a strict subset of SVF and can be 
> reimplemented as such.  This too has some very attractive benefits.
>
> On the input side, the direct benefits of folding (4) and SVF together 
> are a unification of configuration APIs and interfaces, a reduction in 
> the complexity (and therefore maintenance cost) of the ingest code 
> that extracts values from bib records, and the elimination of the cost 
> to extend (4) beyond what exists already to any standard MARC control 
> or fixed field.
>
> On the output side, the benefit from this will be to reduce the cost 
> of searches involving both (current-style) Control Fields (4) and 
> SVF-optimizable components by folding them into one mechanism.  This 
> reduces cost by eliminating one or more SQL JOINs, as well as taking 
> advantage a unified SVF index on all attributes for a record.  Thus, 
> instead of one JOIN for metabib.rec_descriptor and a separate JOIN for 
> each SVF-type facet used in a query, all would be replaced by a single 
> JOIN to the SVF data table.  This table will be approximately the same 
> size as the metabib.rec_descriptor table (similar I/O profile and
> cachedness) while providing expanded functionality.
>
> SVF will also include a display translation mechanism for all fields 
> indexed.  This means coded values can be displayed using 
> human-friendly strings in any language, just as I18N-enabled MARC 
> coded fields do today when stored in metabib.full_rec.  This is 
> something that facets cannot do natively, and will not be able to do 
> effectively -- because facet values are uncontrolled, I18N is outside 
> the design scope.
>
> Put another way, the cost of using (4) today is essentially already 
> paid by the use of metabib.rec_descriptor -- this, or an analog such 
> as SVF must exist.  The cost of SVF replacing (4) would be comparable, 
> and in some cases lower (faster).  On the other hand, the cost of 
> using Facets (3) to simulate SVF-optimizable use cases for access 
> points outside the Facet design constraints can be extremely high, 
> depending on the cardinality of the facet.  A good example of this is 
> the use of a local Material Type value.  Imagine a facet with 20 or so 
> unique values, but where one of these, such as "book", is used in more 
> than 3/4 of the bibliographic dataset.  In practice, when Facets (3) 
> are used this way it is has been identified as the #2 cause of slow 
> searches.
>
> [NOTE: the #1 cause has been identified as the cost of "relevance 
> bumps".  Research is underway to evaluate the efficacy of replacing 
> the ranking function (using rank_cd() instead of rank()) to address 
> this.  In previous version of Postgres, rank_cd() came at a 
> significant cost, but this may not be the case today.]
>
> Now, the implementation of SVF and folding in of (4) is not without 
> costs.  The largest of these is the need to push the required Postgres 
> version to 9.0.  Pg 9.0 is considered stable and is very well 
> supported by the Pg community and third party Pg support companies.
> The upgrade for existing production Evergreen sites is not terribly 
> complicated, but non-trivial.
>
> The reason for this Postgres upgrade requirement is that the 
> underlying datatypes have become more featureful and mature in 9.0, 
> and the techniques I plan to use simply aren't possible in 8.4.
>
> Attached you will find my current design document which covers much of 
> what is discussed above along with a basic implementation plan and 
> example use-cases for each component.  There are details not included, 
> such as appropriate table constraints on configuration tables, but the 
> meat is there and I would welcome feedback and input!
>
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com
>

--
Mike Rylander
 | VP, Research and Design
 | Equinox Software, Inc. / The Evergreen Experts
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com