[OPEN-ILS-DEV] Indexing/Search functionality (was: Planning for Evergreen development, post 2.0)

Wed Feb 9 16:50:08 EST 2011

In my testing, PG 9.0 works fine with Evergreen 2.0.  I don't know of any
production sites using that config, though.

--miker
 On Feb 9, 2011 4:38 PM, "Duimovich, George" <
George.Duimovich at nrcan-rncan.gc.ca> wrote:
> Hello Mike,
>
> Just related follow-up question. I noted that the install notes for 2.0
state "PostgreSQL 8.4 is the minimum supported version."
>
> Is PostgreSQL 9.x officially supported in EG 2.0 - or is the 9.x version
for dev/testing situations only.
>
> Thanks,
> George Duimovich
> NRCan Library
>
> -----Original Message-----
> From: open-ils-dev-bounces at list.georgialibraries.org [mailto:
open-ils-dev-bounces at list.georgialibraries.org] On Behalf Of Mike Rylander
> Sent: February 9, 2011 15:22
> To: Evergreen Development Discussion List
> Subject: Re: [OPEN-ILS-DEV] Indexing/Search functionality (was: Planning
for Evergreen development, post 2.0)
>
> So, this is a big change, and thus I'm posting the patch here first both
to solicit feedback on the direction and as a poke to those who may be
interested but might have forgotten about this over the last several weeks.
>
> This is phase 1, wherein I have make the db changes required for SVF and
brought QueryParser.pm in line with those changes. This requires Postgres
9.0+ and the hstore contrib module (along with everything else the Evergreen
needs) in order to work. I'm continuing to move forward, but again, feedback
is welcome!
>
> --miker
>
> On Fri, Jan 21, 2011 at 12:20 PM, Mike Rylander <mrylander at gmail.com>
wrote:
>> Currently in Evergreen there are four different indexed bibliographic
>> data storage mechanisms, each of which targets a different set of data
>> and query use cases, and which carry their own caveats regarding
>> application outside the designed use cases:
>>
>> 1) Base Search (metabib.keyword_field_entry and friends)
>>  * Use: general full-text indexing; query result relevance ranking
>>  * Caveat: inefficient for low-cardinality values (many records
>> containing indexable data for an index definition, few unique values
>> for that index defintion)
>> 2) Full Record (metabib.real_full_rec)
>>  * Use: sorting; reporting; base data for control field indexing;
>> direct MARC field search
>>  * Caveat: inefficient for general searching as the format is too
>> close to MARC
>> 3) Facets (metabib.facet_entry)
>>  * Use: exact match search; post-search result refining; browse
>>  * Caveat: expensive as a filter on low-cardinality facets (many
>> records, few unique values)
>> 4) Control Fields (metabib.rec_descriptor)
>>  * Use: storage for standard-based, single value record attributes
>> (fixed fields, physical characteristics, etc); sorting (date1, etc);
>> filtering (type, form, audience, VR format, etc); reporting; record
>> analysis; low-cardinality search for known (controlled) values; search
>> weighting (language)
>>  * Caveat: extension beyond standard is entirely out of scope;
>> extension within standard is often prohibitively expensive, requiring
>> schema, trigger, code and configuration changes
>>
>> Reading carefully, one will notice that what is lacking is a way to
>> define and index general, user-defined single value fields.  That is,
>> something akin to (4) which works well for low-cardinality attributes,
>> but is extensible in a manner similar to (1) using a definition table.
>>  I call this concept, including the indexing, search, filtering and
>> maintenance mechanisms, Single Value Fields or SVF.
>>
>> Benefits provided by such a mechanism would be wide ranging.
>>  * Some access points that might currently be implemented as facets
>> because of the exact-match propert, would be both faster and more
>> flexible as an SVF
>>  * Other data used for sorting could be moved to an SVF to make such
>> sorting more memory and time efficient
>>  * Arbitrary user-defined fields from within a record could be indexed
>> and automatically exposed for use in the OPAC and staff client
>>
>> As a secondary effect, (4) becomes a strict subset of SVF and can be
>> reimplemented as such.  This too has some very attractive benefits.
>>
>> On the input side, the direct benefits of folding (4) and SVF together
>> are a unification of configuration APIs and interfaces, a reduction in
>> the complexity (and therefore maintenance cost) of the ingest code
>> that extracts values from bib records, and the elimination of the cost
>> to extend (4) beyond what exists already to any standard MARC control
>> or fixed field.
>>
>> On the output side, the benefit from this will be to reduce the cost
>> of searches involving both (current-style) Control Fields (4) and
>> SVF-optimizable components by folding them into one mechanism.  This
>> reduces cost by eliminating one or more SQL JOINs, as well as taking
>> advantage a unified SVF index on all attributes for a record.  Thus,
>> instead of one JOIN for metabib.rec_descriptor and a separate JOIN for
>> each SVF-type facet used in a query, all would be replaced by a single
>> JOIN to the SVF data table.  This table will be approximately the same
>> size as the metabib.rec_descriptor table (similar I/O profile and
>> cachedness) while providing expanded functionality.
>>
>> SVF will also include a display translation mechanism for all fields
>> indexed.  This means coded values can be displayed using
>> human-friendly strings in any language, just as I18N-enabled MARC
>> coded fields do today when stored in metabib.full_rec.  This is
>> something that facets cannot do natively, and will not be able to do
>> effectively -- because facet values are uncontrolled, I18N is outside
>> the design scope.
>>
>> Put another way, the cost of using (4) today is essentially already
>> paid by the use of metabib.rec_descriptor -- this, or an analog such
>> as SVF must exist.  The cost of SVF replacing (4) would be comparable,
>> and in some cases lower (faster).  On the other hand, the cost of
>> using Facets (3) to simulate SVF-optimizable use cases for access
>> points outside the Facet design constraints can be extremely high,
>> depending on the cardinality of the facet.  A good example of this is
>> the use of a local Material Type value.  Imagine a facet with 20 or so
>> unique values, but where one of these, such as "book", is used in more
>> than 3/4 of the bibliographic dataset.  In practice, when Facets (3)
>> are used this way it is has been identified as the #2 cause of slow
>> searches.
>>
>> [NOTE: the #1 cause has been identified as the cost of "relevance
>> bumps".  Research is underway to evaluate the efficacy of replacing
>> the ranking function (using rank_cd() instead of rank()) to address
>> this.  In previous version of Postgres, rank_cd() came at a
>> significant cost, but this may not be the case today.]
>>
>> Now, the implementation of SVF and folding in of (4) is not without
>> costs.  The largest of these is the need to push the required Postgres
>> version to 9.0.  Pg 9.0 is considered stable and is very well
>> supported by the Pg community and third party Pg support companies.
>> The upgrade for existing production Evergreen sites is not terribly
>> complicated, but non-trivial.
>>
>> The reason for this Postgres upgrade requirement is that the
>> underlying datatypes have become more featureful and mature in 9.0,
>> and the techniques I plan to use simply aren't possible in 8.4.
>>
>> Attached you will find my current design document which covers much of
>> what is discussed above along with a basic implementation plan and
>> example use-cases for each component.  There are details not included,
>> such as appropriate table constraints on configuration tables, but the
>> meat is there and I would welcome feedback and input!
>>
>> --
>> Mike Rylander
>>  | VP, Research and Design
>>  | Equinox Software, Inc. / The Evergreen Experts
>>  | phone:  1-877-OPEN-ILS (673-6457)
>>  | email:  miker at esilibrary.com
>>  | web:  http://www.esilibrary.com
>>
>
>
>
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20110209/b8a388d0/attachment-0001.htm