[OPEN-ILS-DEV] ***SPAM*** Re: ***SPAM*** Re: ***SPAM*** Re: ***SPAM*** Re: Napkin drawings for removing OpenILS::Application::Storage::FTS
Mike Rylander
mrylander at gmail.com
Wed Mar 31 11:48:32 EDT 2010
OK ... so, here's an attempt at a grammar for the new
QueryParser-based search stuff. The actual tokens used for much of
the syntax is configurable. I'll try to point that out as best I can.
[NOTE: viewing with a fixed-width font would be best]
------------------------------------------------------------------------------------------
regexp := valid PCRE
word := valid UTF-8 non-whitespace characters
whitespace := string matching PCRE /\s+/s
boolean_word := 'yes' | 'no' | 'true' | 'false' | '1' | '0'
modifier_marker := '#' ### configurable, default
phrase_boundary := '"'
require-er := '+'
negator := '-'
search_seperator := ':' | '='
class_field_seperator := '|'
boolean_and := '&&' ### configurable, EG default
boolean_or := '||' ### configurable, EG default
subquery_start := '(' ### configurable, default
subquery_end := ')' ### configurable, default
word_list := word { ',' word }
negated_word := negator word
required_word := require-er word
phrase := phrase_boundary word { whitespace word }
phrase_boundary
term := word | negated_word | required_word | phrase
{ whitespace term }
boolean_operator := boolean_and | boolean_or
registerd_class := 'keyword' | 'title' | 'author' | 'subject' |
'series'### configurable, default for EG
class_alias := regexp ### 'kw', 'ti', 'au', 'su', 'se' and
many more, configurable, loaded from IDL class cmsa where field is
null
search_class := registerd_class | class_alias
registered_field := word ### configurable, loaded from IDL class cmf
field_alias := regexp ### configurable, loaded from IDL
class cmsa where field is not null
search_field := registered_field
classed_search := search_class search_seperator term
search_field_list := registered_field { class_field_seperator
search_field_list }
fielded_search := search_class { search_field_list }
search_seperator term
field_alias_search := field_alias search_seperator term
facet_list := registered_facet { class_field_seperator facet_list }
fielded_search := search_class { facet_list } '[' term ']'
search := term | classed_search | fielded_search |
field_alias_search | facet_search
registered_modifier := 'available' | 'staff' | 'descending' ### and
many more, defined in QueryParser implementation driver class_alias
search_modifier := modifier_marker registered_modifier |
registered_modifier '(' boolean_word ')'
registered_filter := 'site' | 'sort' | 'item_type' ### and many
more, defined in QueryParser implementation driver class_alias
search_filter := registered_filter '(' word_list ')' |
registered_filter ':' word_list
boolean_term := term boolean_op term
subquery := subquery_start query subquery_end
query := boolean_term | search | search_modifier |
search_filter | subquery { [boolean_op] query }
------------------------------------------------------------------------------------------
Coming soon -- lots of example queries and a list of configured values
for the implementation class used in EG. I'll do all that on the
wiki, and add the grammar, too.
--miker
On Tue, Mar 9, 2010 at 2:52 PM, Mike Rylander <mrylander at gmail.com> wrote:
> So, here's some more updatification. No BNF yet, but everything
> mentioned below still works as described. Also, you can now add more
> than one field to a specific index definition search, thusly:
>
> title|proper|uniform|translated: harry potter
>
> and the query built will search all of those index definitions.
>
> You'll find attached 3 files:
>
> * QueryParser.pm -- the core parser and query tree building code
> * EG_QueryParser.pm -- an Evergreen-specific subclass which adds
> tsearch2 logic, SQL generation, and code which populates the parser in
> the right way for Evergreen by filling in classes, fields, modifiers
> and functions that we need
> * fts-replacement.pl -- a script to test the above modules
>
> My test command line is currently:
>
> $ time PERL5LIB=$PERL5LIB:/home/miker/svn/OpenSRF/src/perl/lib/
> ./fts-replacement.pl --query='ti:jk harry kw:rowling
> sort(title)#descending'
>
> and you can adjust the PERL5LIB and query to taste (or remove it to
> see the redonkulous default test query). We just need
> OpenSRF::Utils::JSON ... hence the PERL5LIB mangling. If you have
> OpenSRF installed, no need for that bit
>
> Thought or comments welcome.
>
> --miker
>
> On Thu, Feb 11, 2010 at 2:32 PM, Mike Rylander <mrylander at gmail.com> wrote:
>> What's the definition of insanity, again? ;)
>>
>> Here's round 2 of a possible new query parser. In this version we
>> support configurable search classes, fields, modifiers and filters.
>> I'll try to construct a BNF for it, but by way of example the gist is:
>>
>> Classed search:
>> title: harry potter
>> keyword: water
>>
>> Specific field (index definition) search:
>> author|personal: rowling
>> subject|name: potter
>>
>> Search modifiers
>> #metabib
>> #available
>> #descending
>>
>> Argumented filters:
>> statuses(0,7,12)
>> estimation_strategy(exclusion)
>> item_form(d)
>> during(2009,2010)
>>
>> Phrases (non-stemming matches):
>> "multi-word phrase (exact) matches"
>> +one +word +exact +matches
>>
>> Grouping and Boolean operators (chocolate and peanut butter):
>> (foo && bar) || baz
>>
>> Everything but the grouping, boolean and phrase operators is
>> configurable, so adding classes and fields, modifiers and filters is
>> up to (available to?) the developer. Some examples:
>>
>> __PACKAGE__->add_search_class( 'keyword' );
>> __PACKAGE__->add_search_class( 'title' );
>> __PACKAGE__->add_search_class( 'author' );
>> __PACKAGE__->add_search_class( 'subject' );
>> __PACKAGE__->add_search_class( 'series' );
>>
>> __PACKAGE__->add_search_field( author => 'corporate' );
>>
>> __PACKAGE__->add_search_filter( 'audience' );
>> __PACKAGE__->add_search_filter( 'vr_format' );
>> __PACKAGE__->add_search_filter( 'format' );
>> __PACKAGE__->add_search_filter( 'item_type' );
>> __PACKAGE__->add_search_filter( 'item_form' );
>> __PACKAGE__->add_search_filter( 'lit_form' );
>> __PACKAGE__->add_search_filter( 'location' );
>>
>> __PACKAGE__->add_search_modifier( 'available' );
>> __PACKAGE__->add_search_modifier( 'descending' );
>> __PACKAGE__->add_search_modifier( 'ascending' );
>> __PACKAGE__->add_search_modifier( 'metarecord' );
>>
>> This also supports both class-wide and class+field aliasing via
>> regexp, for use in mapping CQL relations to evergreen search classes
>> and fields:
>>
>> __PACKAGE__->add_search_class_alias( author => 'name' );
>> __PACKAGE__->add_search_class_alias( author => 'dc.contributor' );
>>
>> __PACKAGE__->add_search_class_alias( subject =>
>> 'bib.subject(?:Title|Place|Occupation)' );
>> __PACKAGE__->add_search_field_alias( subject => name => 'bib.subjectName' );
>> __PACKAGE__->add_search_field_alias( keyword => standard_number =>
>> 'dc.identifier' ); # keyword|standard_number isn't a stock evergreen
>> index def, jfyi
>>
>> Anyway, as before, feedback appreciated.
>>
>> --miker
>>
>> On Wed, Feb 3, 2010 at 4:55 PM, Mike Rylander <mrylander at gmail.com> wrote:
>>> On Thu, Jan 21, 2010 at 3:54 PM, Mike Rylander <mrylander at gmail.com> wrote:
>>>> Search syntax thought experiment
>>>> --------------------------------------------------
>>>>
>>>> Multi-stage search-to-query compilation
>>>>
>>>> Given a default search class of "keyword" and a search string of:
>>>> ( foo "bar" -baz || gar || ti|proper:qux) && au:junk
>>>>
>>>>
>>>> Stage 1: Boolean decomposition
>>>>
>>>> { bool : 'and',
>>>> query : [
>>>> { author : 'junk' },
>>>> { bool : 'or',
>>>> query : [
>>>> { keyword : 'foo "bar" -baz || gar' },
>>>> { 'title|proper' : 'qux' }
>>>> ]
>>>> }
>>>> ]
>>>> }
>>>>
>>>> Lacking grouping parens, we bind pairs of || or && separated atoms
>>>> (assuming &&), working from left to right. If adjacent atoms belong
>>>> to the same class[|field] specifier, they are folded into a single
>>>> leaf. Atoms are whitespace separated components not spelled '(', ')',
>>>> '||' or '&&'.
>>>>
>>>>
>>>>
>>>> Stage 2: Search decomposition
>>>>
>>>> { bool : 'and',
>>>> query : [
>>>> { ftsquery : ['junk'],
>>>> phrases : [],
>>>> classname : 'author',
>>>> fields : [7, 8, 9, 10]
>>>> },
>>>> { bool : 'or',
>>>> query : [
>>>> { ftsquery : [[['foo', '&', 'bar'], '&!', 'baz], '|', 'gar' ],
>>>> phrases : ['bar'],
>>>> classname : 'keyword',
>>>> fields : [15]
>>>> },
>>>> { ftsquery : ['qux'],
>>>> phrases : [],
>>>> classname : 'title',
>>>> fields : [6]
>>>> }
>>>> ]
>>>> }
>>>> ]
>>>> }
>>>>
>>>> Then, we walk that tree (probably a plperlu stored proc) looking for
>>>> leaf hashes (those that don't contain a key of "query") and build up
>>>> SELECT (ranking), FROM (sourcing) and WHERE (tsquery and phrase
>>>> matching) clauses as we see them. We apply the union of the index
>>>> normalizers defined for the referenced fields to the non-joiner atoms
>>>> in the "ftsquery" structure.
>>>>
>>>> Eh? I can see my way from here to there, but what I'm I totally overlooking?
>>>>
>>>
>>> Since I didn't hear any screaming, I took a little time today to put
>>> together a rough parser (attached). I have not thrown horrible,
>>> terrible, mean, angry data at it, but for good input and some bits of
>>> unexpected input (multiple boolean operators in a row, unbalanced
>>> parens) it works as I expect it to.
>>>
>>> The output format is not exactly the same as described above, but
>>> should be recognizable.
>>>
>>> One of the implications of moving to this is that the "compiled query"
>>> returned from the main search call will necessarily change. Also note
>>> that reconstructing the exact query that the user supplied will be
>>> very difficult, but we can cache that data easily enough for later
>>> use.
>>>
>>> The command line I've been testing with is:
>>>
>>> ./fts-replacement.pl 'title: (foo bar) || (-baz || (subject:"1900-1910
>>> junk" se:stuff)) && && && au:malarky || au:gonzo && +goo' 1
>>>
>>> First param is the query, second is a debug flag. || == OR, and && ==
>>> AND, obviously.
>>>
>>> Any feedback would be appreciated.
>>>
>>> --
>>> Mike Rylander
>>> | VP, Research and Design
>>> | Equinox Software, Inc. / The Evergreen Experts
>>> | phone: 1-877-OPEN-ILS (673-6457)
>>> | email: miker at esilibrary.com
>>> | web: http://www.esilibrary.com
>>>
>>
>>
>>
>> --
>> Mike Rylander
>> | VP, Research and Design
>> | Equinox Software, Inc. / The Evergreen Experts
>> | phone: 1-877-OPEN-ILS (673-6457)
>> | email: miker at esilibrary.com
>> | web: http://www.esilibrary.com
>>
>
>
>
> --
> Mike Rylander
> | VP, Research and Design
> | Equinox Software, Inc. / The Evergreen Experts
> | phone: 1-877-OPEN-ILS (673-6457)
> | email: miker at esilibrary.com
> | web: http://www.esilibrary.com
>
--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker at esilibrary.com
| web: http://www.esilibrary.com
More information about the Open-ils-dev
mailing list