[OPEN-ILS-DEV] ***SPAM*** Re: ***SPAM*** Re: ***SPAM*** Re: Napkin drawings for removing OpenILS::Application::Storage::FTS
Mike Rylander
mrylander at gmail.com
Tue Mar 9 14:52:50 EST 2010
So, here's some more updatification. No BNF yet, but everything
mentioned below still works as described. Also, you can now add more
than one field to a specific index definition search, thusly:
title|proper|uniform|translated: harry potter
and the query built will search all of those index definitions.
You'll find attached 3 files:
* QueryParser.pm -- the core parser and query tree building code
* EG_QueryParser.pm -- an Evergreen-specific subclass which adds
tsearch2 logic, SQL generation, and code which populates the parser in
the right way for Evergreen by filling in classes, fields, modifiers
and functions that we need
* fts-replacement.pl -- a script to test the above modules
My test command line is currently:
$ time PERL5LIB=$PERL5LIB:/home/miker/svn/OpenSRF/src/perl/lib/
./fts-replacement.pl --query='ti:jk harry kw:rowling
sort(title)#descending'
and you can adjust the PERL5LIB and query to taste (or remove it to
see the redonkulous default test query). We just need
OpenSRF::Utils::JSON ... hence the PERL5LIB mangling. If you have
OpenSRF installed, no need for that bit
Thought or comments welcome.
--miker
On Thu, Feb 11, 2010 at 2:32 PM, Mike Rylander <mrylander at gmail.com> wrote:
> What's the definition of insanity, again? ;)
>
> Here's round 2 of a possible new query parser. In this version we
> support configurable search classes, fields, modifiers and filters.
> I'll try to construct a BNF for it, but by way of example the gist is:
>
> Classed search:
> title: harry potter
> keyword: water
>
> Specific field (index definition) search:
> author|personal: rowling
> subject|name: potter
>
> Search modifiers
> #metabib
> #available
> #descending
>
> Argumented filters:
> statuses(0,7,12)
> estimation_strategy(exclusion)
> item_form(d)
> during(2009,2010)
>
> Phrases (non-stemming matches):
> "multi-word phrase (exact) matches"
> +one +word +exact +matches
>
> Grouping and Boolean operators (chocolate and peanut butter):
> (foo && bar) || baz
>
> Everything but the grouping, boolean and phrase operators is
> configurable, so adding classes and fields, modifiers and filters is
> up to (available to?) the developer. Some examples:
>
> __PACKAGE__->add_search_class( 'keyword' );
> __PACKAGE__->add_search_class( 'title' );
> __PACKAGE__->add_search_class( 'author' );
> __PACKAGE__->add_search_class( 'subject' );
> __PACKAGE__->add_search_class( 'series' );
>
> __PACKAGE__->add_search_field( author => 'corporate' );
>
> __PACKAGE__->add_search_filter( 'audience' );
> __PACKAGE__->add_search_filter( 'vr_format' );
> __PACKAGE__->add_search_filter( 'format' );
> __PACKAGE__->add_search_filter( 'item_type' );
> __PACKAGE__->add_search_filter( 'item_form' );
> __PACKAGE__->add_search_filter( 'lit_form' );
> __PACKAGE__->add_search_filter( 'location' );
>
> __PACKAGE__->add_search_modifier( 'available' );
> __PACKAGE__->add_search_modifier( 'descending' );
> __PACKAGE__->add_search_modifier( 'ascending' );
> __PACKAGE__->add_search_modifier( 'metarecord' );
>
> This also supports both class-wide and class+field aliasing via
> regexp, for use in mapping CQL relations to evergreen search classes
> and fields:
>
> __PACKAGE__->add_search_class_alias( author => 'name' );
> __PACKAGE__->add_search_class_alias( author => 'dc.contributor' );
>
> __PACKAGE__->add_search_class_alias( subject =>
> 'bib.subject(?:Title|Place|Occupation)' );
> __PACKAGE__->add_search_field_alias( subject => name => 'bib.subjectName' );
> __PACKAGE__->add_search_field_alias( keyword => standard_number =>
> 'dc.identifier' ); # keyword|standard_number isn't a stock evergreen
> index def, jfyi
>
> Anyway, as before, feedback appreciated.
>
> --miker
>
> On Wed, Feb 3, 2010 at 4:55 PM, Mike Rylander <mrylander at gmail.com> wrote:
>> On Thu, Jan 21, 2010 at 3:54 PM, Mike Rylander <mrylander at gmail.com> wrote:
>>> Search syntax thought experiment
>>> --------------------------------------------------
>>>
>>> Multi-stage search-to-query compilation
>>>
>>> Given a default search class of "keyword" and a search string of:
>>> ( foo "bar" -baz || gar || ti|proper:qux) && au:junk
>>>
>>>
>>> Stage 1: Boolean decomposition
>>>
>>> { bool : 'and',
>>> query : [
>>> { author : 'junk' },
>>> { bool : 'or',
>>> query : [
>>> { keyword : 'foo "bar" -baz || gar' },
>>> { 'title|proper' : 'qux' }
>>> ]
>>> }
>>> ]
>>> }
>>>
>>> Lacking grouping parens, we bind pairs of || or && separated atoms
>>> (assuming &&), working from left to right. If adjacent atoms belong
>>> to the same class[|field] specifier, they are folded into a single
>>> leaf. Atoms are whitespace separated components not spelled '(', ')',
>>> '||' or '&&'.
>>>
>>>
>>>
>>> Stage 2: Search decomposition
>>>
>>> { bool : 'and',
>>> query : [
>>> { ftsquery : ['junk'],
>>> phrases : [],
>>> classname : 'author',
>>> fields : [7, 8, 9, 10]
>>> },
>>> { bool : 'or',
>>> query : [
>>> { ftsquery : [[['foo', '&', 'bar'], '&!', 'baz], '|', 'gar' ],
>>> phrases : ['bar'],
>>> classname : 'keyword',
>>> fields : [15]
>>> },
>>> { ftsquery : ['qux'],
>>> phrases : [],
>>> classname : 'title',
>>> fields : [6]
>>> }
>>> ]
>>> }
>>> ]
>>> }
>>>
>>> Then, we walk that tree (probably a plperlu stored proc) looking for
>>> leaf hashes (those that don't contain a key of "query") and build up
>>> SELECT (ranking), FROM (sourcing) and WHERE (tsquery and phrase
>>> matching) clauses as we see them. We apply the union of the index
>>> normalizers defined for the referenced fields to the non-joiner atoms
>>> in the "ftsquery" structure.
>>>
>>> Eh? I can see my way from here to there, but what I'm I totally overlooking?
>>>
>>
>> Since I didn't hear any screaming, I took a little time today to put
>> together a rough parser (attached). I have not thrown horrible,
>> terrible, mean, angry data at it, but for good input and some bits of
>> unexpected input (multiple boolean operators in a row, unbalanced
>> parens) it works as I expect it to.
>>
>> The output format is not exactly the same as described above, but
>> should be recognizable.
>>
>> One of the implications of moving to this is that the "compiled query"
>> returned from the main search call will necessarily change. Also note
>> that reconstructing the exact query that the user supplied will be
>> very difficult, but we can cache that data easily enough for later
>> use.
>>
>> The command line I've been testing with is:
>>
>> ./fts-replacement.pl 'title: (foo bar) || (-baz || (subject:"1900-1910
>> junk" se:stuff)) && && && au:malarky || au:gonzo && +goo' 1
>>
>> First param is the query, second is a debug flag. || == OR, and && ==
>> AND, obviously.
>>
>> Any feedback would be appreciated.
>>
>> --
>> Mike Rylander
>> | VP, Research and Design
>> | Equinox Software, Inc. / The Evergreen Experts
>> | phone: 1-877-OPEN-ILS (673-6457)
>> | email: miker at esilibrary.com
>> | web: http://www.esilibrary.com
>>
>
>
>
> --
> Mike Rylander
> | VP, Research and Design
> | Equinox Software, Inc. / The Evergreen Experts
> | phone: 1-877-OPEN-ILS (673-6457)
> | email: miker at esilibrary.com
> | web: http://www.esilibrary.com
>
--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker at esilibrary.com
| web: http://www.esilibrary.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: QueryParser.pm
Type: application/x-perl
Size: 20838 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20100309/09680a06/attachment-0002.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: EG_QueryParser.pm
Type: application/x-perl
Size: 18690 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20100309/09680a06/attachment-0003.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fts-replacement.pl
Type: application/octet-stream
Size: 2186 bytes
Desc: not available
Url : http://libmail.georgialibraries.org/pipermail/open-ils-dev/attachments/20100309/09680a06/attachment-0001.obj
More information about the Open-ils-dev
mailing list