[OPEN-ILS-GENERAL] GSOC[2012] Some quesion regarding C-bsed stored procedures

Mike Rylander mrylander at gmail.com
Tue Mar 27 10:59:23 EDT 2012


On Sun, Mar 25, 2012 at 11:01 PM, dsy <dsy88 at sina.com> wrote:
> Hi everyone,
>
>          I’m a graduate student in Peking University, China. My research
> direction is Information Retrieve and Extraction and I have about 8 years
> programming experience in C. I intend to apply for this year’s GSOC on
> evergreen. The last two ideas interests me and I think it suits me most.
>
>          To be honest, I have not used evergreen before I see GSOC pages. So
> I’m not quite sure about the idea. I’m looking for your help to clarify some
> details and understand the whole project.
>
>          The first idea about translate certain serach-related SQL query
> components to C-based stored procedures. Does this means some of the
> components in full-text searching are working independently which leads to a
> query will be separated into many SQL query to the database?
>

Not exactly.  In order to adjust a baseline cover-density based
ranking provided by the Postgres full text search engine, we check for
several word order criteria.  Specifically, we generate three CASE
statements checking to see if the user's query word order matches the
order of words in the indexed term, to see if the first word in the
query is the first word in the indexed term, and finally to see if the
entire query matches the entire indexed term as a phrase.  Different
indexed terms make use of different combinations of these, and they
are implemented using ILIKE conditions.  Each has to re-parse the
indexed term and apply certain normalizations, so the cost is high
simply from repeated evaluation, in addition to the fact that the
Postgres executor has more work to do for each row.

If a single stored procedure were created that could implement any
combination of the possible checks, by taking a flag variable to
decide which to apply, and perform all the work in a single go, this
would speed execution of searches.

> The second idea about convert PL/Perl-based PostgreSQL stored procedures to
> PL/SQL or PL/C. As I see the problem statements, I think this idea relates
> to the first one. Is this implies we have implemented some PL/Perl based
> stored procedures to solve the first one and now we need to use some faster
> language to improve its performance?
>

That is essentially it, yes.  In particular, normalization functions
such as naco_normalize[1] and search_normalize[2] are very heavily
used and would benefit both indexing and search if reimplemented in a
faster language, but there are many candidates -- simply search the
source for 'language plperlu' (case-insensitively) to see more.

> As far as I see, the work I need to do is writing some C based string
> generator to generate SQL query to improve the query performance of the
> system. Is my understanding correct? I don’t quite sure about the bottleneck
> faced in the project, is there any reference for me to quickly understand
> the bottleneck and locate in the code? Or is there any test data shows where
> the bottleneck is and how bad it is?
>

I don't have the timings handy currently, and of course they depend on
both the data set and the server, but the relevance adjustment CASE
statements I mention above account for an overwhelming amount of the
query response time for small result sets (on the order of 80% of the
query time), and can double the time of queries on larger result sets.

> I have subscribed to the mail list and downloaded the virtual image to build
> a dev environment quickly. And I’m reading the doc in evergreen pages and
> downloading the mail archives. Is there any place I need to see?
>

Supplying a small fix or improvement is a required part of the
application, as it helps familiarize you with the codebase.  There is
a list of bugs tagged with 'bitesize' on the Evergreen launchpad
site[3] for some ideas.  Also, if you've managed to get the Evergreen
virtual image running, you can turn up query logging in Postgres to
see where query pain points are.

[1] http://git.evergreen-ils.org/?p=Evergreen.git;a=blob;f=Open-ILS/src/sql/Pg/002.functions.config.sql;h=3ea61ae289152f4f3e60f99a41aa676becd7dc9e;hb=HEAD#l638
[2] http://git.evergreen-ils.org/?p=Evergreen.git;a=blob;f=Open-ILS/src/sql/Pg/002.functions.config.sql;h=3ea61ae289152f4f3e60f99a41aa676becd7dc9e;hb=HEAD#l702
[3] https://bugs.launchpad.net/evergreen/+bugs?field.searchtext=&orderby=-importance&search=Search&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commenter=&field.subscriber=&field.structural_subscriber=&field.tag=bitesize&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_blueprints=on

-- 
Mike Rylander
 | Director of Research and Development
 | Equinox Software, Inc. / Your Library's Guide to Open Source
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at esilibrary.com
 | web:  http://www.esilibrary.com


More information about the Open-ils-general mailing list