[OPEN-ILS-GENERAL] Improving relevance ranking in Evergreen

Mike Rylander mrylander at gmail.com
Wed Mar 7 14:35:24 EST 2012


On Wed, Mar 7, 2012 at 1:29 PM, James Fournie
<jfournie at sitka.bclibraries.ca> wrote:
> Thanks guys for your feedback, just a few more thoughts I had...
>
> On 2012-03-06, at 11:28 PM, Dan Scott wrote:
>>>>> Indeed they do, however rewriting them in C to be super-fast would
>>>>> improve this situation.  It's primarily a matter of available time and
>>>>> effort.  It's also, however, pretty specialized work as you're dealing
>>>>> with Postgres at a very intimate level.
>>
>> Hmm. For sure, C beats Perl for performance and would undoubtedly offer an
>> improvement,
>
> For me, I am not terribly excited about the prospect of another bit of C floating around, particularly if it's in the Postgres bits.  I personally am not familiar with C except I can compile it and run it and maybe vaguely get an idea of what it does, but I can't really debug it very well or do anything terribly useful with it.  My concern is adding more "specialized work" to the system makes things harder to maintain and less accessible to newcomers with the typical Linux+Perl+SQL or JS knowledge.   We could write open-ils.circ in C as well, but it's just not a good idea, so I think in general we should dismiss "rewrite it in C" as a solution to anything.
>

I can't say I agree with that -- the rewrite of open-ils.auth in C is
an unmitigated win, and open-ils.cstore (and pcrud, and other
derivatives) replacing (most of) open-ils.storage is as well, IMO.
It's all about using the right tool for the job, and once an API is
deemed very stable, performance optimization is a valid next step.
And C is, generally speaking, a better speed-oriented tool than Perl.

That said, I agree that open-ils.circ is not the next thing in line
for the translation treatment.  :)

As for it integrating with postgres, IMO
http://www.postgresql.org/docs/9.1/interactive/extend-extensions.html
shows how to do this "right" in modern times, and http://pgxn.org/ is
starting to become the CPAN of postgres extensions.  Ideally, all of
our stored procs would best be rebundled as extensions.

>> ....
>> That said, some quick testing suggests that it doesn't make a difference
>> to the plan, at least for the inner query that's being sent to
>> search.query_parser_fts().
>
> Yeah, I tried it but it seemed like it did nothing.   I suspect that it is actually being treated as IMMUTABLE
>

That's unfrotunate... :(

>>
>>>  * Alternatively, I mentioned off-hand the option of direct indexing.
>>> The idea is to use an expression index defined by each row in in
>>> config.metabib_field and have a background process keep the indexes in
>>> sync with configuration as things in that table (and the related
>>> normalizer configuration, etc) changes.  I fear that's the path to
>>> madness, but it would be the most space efficient way to handle
>>> things.
>>
>> I don't think that's the path to madness; it appeals to me, at least.
>> (Okay, it's probably insane then.)
>
> I am going to be totally frank and say that think it honestly might be the path to madness.  Why do we need more moving parts like a background process?  I'm a little confused, how would a background process keep the indexes in sync when the configuration changes?  Can you flesh this out a bit more?  I don't feel I fully understand how this would work.
>

You wouldn't want to lock up the database with a DROP INDEX / CREATE
INDEX pair each time a row on that table changed, so we'd need a
process by which changes are registered and batched together.

> To go back to my original idea of normalizing the indexes -- snip back to Mike's first response -- I'm just wondering:
>
>> We need the pre-normalized form for some things
>>* We could find those things, other than search, for which we use
>> m.X_entry.value and move them elsewhere.  The tradeoff would be that
>> any change in config.metabib_field or normalizer configuration would
>> have to cause a rewrite of that column.
>
> Realistically, how often would one change normalizers or metabib_fields?   I don't think it's done lightly so it seems like a simple solution with a reasonable tradeoff -- you'd rarely want to change your normalizers so you're only running naco_normalizer on the table once in a blue moon vs running it on much of the table every time a search happens.   In Solr and I'm sure other search engines you have to reindex if you change these kinds of things.
>
> The question is where is ]the m.X_entry.value used?  It doesn't seem like anywhere when I skim the code but everything's so dynamically generated that it's hard to tell.  Any thoughts where it might be used?   I just have a hard time thinking of a use case for that field.
>

Hrm... now that facets live on their own table, and may even end up
being folded into the browse_entry infrastructure (just a thought
right now, needs analysis), I'm not thinking of anything off the top
of my head, except for the current incarnation of the display_field
branch.  If that ends up having its own table (reasonable, I think)
then ... it may be safe to fully-normalize (or, at least, store a
fully-normalized version) the value column.  There will be
intra-class, per field normalizers that won't match across the board
(vaguely recalling Dan Well's and my discussion on this long ago), but
analysis may show it's no worse than today.

There's hope yet!  But, there's also work (mainly research and
testing) to do before any real code gets written.

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-general mailing list