[OPEN-ILS-DEV] ***SPAM*** Re: ***SPAM*** Re: ***SPAM*** Re: ***SPAM*** Re: Napkin drawings for removing OpenILS::Application::Storage::FTS

Mike Rylander mrylander at gmail.com
Wed Mar 31 11:48:32 EDT 2010


OK ... so, here's an attempt at a grammar for the new
QueryParser-based search stuff.  The actual tokens used for much of
the syntax is configurable.  I'll try to point that out as best I can.

[NOTE: viewing with a fixed-width font would be best]

------------------------------------------------------------------------------------------

regexp                := valid PCRE
word                  := valid UTF-8 non-whitespace characters
whitespace            := string matching PCRE /\s+/s
boolean_word          := 'yes' | 'no' | 'true' | 'false' | '1' | '0'
modifier_marker       := '#'   ### configurable, default
phrase_boundary       := '"'
require-er            := '+'
negator               := '-'
search_seperator      := ':' | '='
class_field_seperator := '|'
boolean_and           := '&&'  ### configurable, EG default
boolean_or            := '||'  ### configurable, EG default
subquery_start        := '('   ### configurable, default
subquery_end          := ')'   ### configurable, default

word_list             := word { ',' word }
negated_word          := negator word
required_word         := require-er word
phrase                := phrase_boundary word { whitespace word }
phrase_boundary
term                  := word | negated_word | required_word | phrase
{ whitespace term }

boolean_operator      := boolean_and | boolean_or

registerd_class       := 'keyword' | 'title' | 'author' | 'subject' |
'series'### configurable, default for EG
class_alias           := regexp  ### 'kw', 'ti', 'au', 'su', 'se' and
many more, configurable, loaded from IDL class cmsa where field is
null
search_class          := registerd_class | class_alias
registered_field      := word ### configurable, loaded from IDL class cmf
field_alias           := regexp ### configurable, loaded from IDL
class cmsa where field is not null
search_field          := registered_field
classed_search        := search_class search_seperator term
search_field_list     := registered_field { class_field_seperator
search_field_list }
fielded_search        := search_class { search_field_list }
search_seperator term
field_alias_search    := field_alias search_seperator term
facet_list            := registered_facet { class_field_seperator facet_list }
fielded_search        := search_class { facet_list } '[' term ']'
search                := term | classed_search | fielded_search |
field_alias_search | facet_search

registered_modifier   := 'available' | 'staff' | 'descending'  ### and
many more, defined in QueryParser implementation driver class_alias
search_modifier       := modifier_marker registered_modifier |
registered_modifier '(' boolean_word ')'
registered_filter     := 'site' | 'sort' | 'item_type'  ### and many
more, defined in QueryParser implementation driver class_alias
search_filter         := registered_filter '(' word_list ')' |
registered_filter ':' word_list

boolean_term          := term boolean_op term
subquery              := subquery_start query subquery_end
query                 := boolean_term | search | search_modifier |
search_filter | subquery { [boolean_op] query }


------------------------------------------------------------------------------------------

Coming soon -- lots of example queries and a list of configured values
for the implementation class used in EG.  I'll do all that on the
wiki, and add the grammar, too.

--miker

On Tue, Mar 9, 2010 at 2:52 PM, Mike Rylander <mrylander at gmail.com> wrote:
> So, here's some more updatification.  No BNF yet, but everything
> mentioned below still works as described.  Also, you can now add more
> than one field to a specific index definition search, thusly:
>
> title|proper|uniform|translated: harry potter
>
> and the query built will search all of those index definitions.
>
> You'll find attached 3 files:
>
>  * QueryParser.pm -- the core parser and query tree building code
>  * EG_QueryParser.pm -- an Evergreen-specific subclass which adds
> tsearch2 logic, SQL generation, and code which populates the parser in
> the right way for Evergreen by filling in classes, fields, modifiers
> and functions that we need
>  * fts-replacement.pl -- a script to test the above modules
>
> My test command line is currently:
>
> $ time PERL5LIB=$PERL5LIB:/home/miker/svn/OpenSRF/src/perl/lib/
> ./fts-replacement.pl --query='ti:jk harry kw:rowling
> sort(title)#descending'
>
> and you can adjust the PERL5LIB and query to taste (or remove it to
> see the redonkulous default test query).  We just need
> OpenSRF::Utils::JSON ... hence the PERL5LIB mangling.  If you have
> OpenSRF installed, no need for that bit
>
> Thought or comments welcome.
>
> --miker
>
> On Thu, Feb 11, 2010 at 2:32 PM, Mike Rylander <mrylander at gmail.com> wrote:
>> What's the definition of insanity, again?  ;)
>>
>> Here's round 2 of a possible new query parser.  In this version we
>> support configurable search classes, fields, modifiers and filters.
>> I'll try to construct a BNF for it, but by way of example the gist is:
>>
>> Classed search:
>>   title: harry potter
>>   keyword: water
>>
>> Specific field (index definition) search:
>>   author|personal: rowling
>>   subject|name: potter
>>
>> Search modifiers
>>   #metabib
>>   #available
>>   #descending
>>
>> Argumented filters:
>>   statuses(0,7,12)
>>   estimation_strategy(exclusion)
>>   item_form(d)
>>   during(2009,2010)
>>
>> Phrases (non-stemming matches):
>>   "multi-word phrase (exact) matches"
>>   +one +word +exact +matches
>>
>> Grouping and Boolean operators (chocolate and peanut butter):
>>   (foo && bar) || baz
>>
>> Everything but the grouping, boolean and phrase operators is
>> configurable, so adding classes and fields, modifiers and filters is
>> up to (available to?) the developer.  Some examples:
>>
>>    __PACKAGE__->add_search_class( 'keyword' );
>>    __PACKAGE__->add_search_class( 'title' );
>>    __PACKAGE__->add_search_class( 'author' );
>>    __PACKAGE__->add_search_class( 'subject' );
>>    __PACKAGE__->add_search_class( 'series' );
>>
>>    __PACKAGE__->add_search_field( author => 'corporate' );
>>
>>    __PACKAGE__->add_search_filter( 'audience' );
>>    __PACKAGE__->add_search_filter( 'vr_format' );
>>    __PACKAGE__->add_search_filter( 'format' );
>>    __PACKAGE__->add_search_filter( 'item_type' );
>>    __PACKAGE__->add_search_filter( 'item_form' );
>>    __PACKAGE__->add_search_filter( 'lit_form' );
>>    __PACKAGE__->add_search_filter( 'location' );
>>
>>    __PACKAGE__->add_search_modifier( 'available' );
>>    __PACKAGE__->add_search_modifier( 'descending' );
>>    __PACKAGE__->add_search_modifier( 'ascending' );
>>    __PACKAGE__->add_search_modifier( 'metarecord' );
>>
>> This also supports both class-wide and class+field aliasing via
>> regexp, for use in mapping CQL relations to evergreen search classes
>> and fields:
>>
>>    __PACKAGE__->add_search_class_alias( author => 'name' );
>>    __PACKAGE__->add_search_class_alias( author => 'dc.contributor' );
>>
>>    __PACKAGE__->add_search_class_alias( subject =>
>> 'bib.subject(?:Title|Place|Occupation)' );
>>    __PACKAGE__->add_search_field_alias( subject => name => 'bib.subjectName' );
>>    __PACKAGE__->add_search_field_alias( keyword => standard_number =>
>> 'dc.identifier' );  # keyword|standard_number isn't a stock evergreen
>> index def, jfyi
>>
>> Anyway, as before, feedback appreciated.
>>
>> --miker
>>
>> On Wed, Feb 3, 2010 at 4:55 PM, Mike Rylander <mrylander at gmail.com> wrote:
>>> On Thu, Jan 21, 2010 at 3:54 PM, Mike Rylander <mrylander at gmail.com> wrote:
>>>> Search syntax thought experiment
>>>> --------------------------------------------------
>>>>
>>>> Multi-stage search-to-query compilation
>>>>
>>>> Given a default search class of "keyword" and a search string of:
>>>>  ( foo "bar" -baz || gar || ti|proper:qux) && au:junk
>>>>
>>>>
>>>> Stage 1: Boolean decomposition
>>>>
>>>> { bool : 'and',
>>>>  query : [
>>>>    { author : 'junk' },
>>>>    { bool : 'or',
>>>>      query : [
>>>>        { keyword : 'foo "bar" -baz || gar' },
>>>>        { 'title|proper' : 'qux' }
>>>>      ]
>>>>    }
>>>>  ]
>>>> }
>>>>
>>>> Lacking grouping parens, we bind pairs of || or && separated atoms
>>>> (assuming &&), working from left to right.  If adjacent atoms belong
>>>> to the same class[|field] specifier, they are folded into a single
>>>> leaf.  Atoms are whitespace separated components not spelled '(', ')',
>>>> '||' or '&&'.
>>>>
>>>>
>>>>
>>>> Stage 2: Search decomposition
>>>>
>>>> { bool : 'and',
>>>>  query : [
>>>>    { ftsquery : ['junk'],
>>>>      phrases : [],
>>>>      classname : 'author',
>>>>      fields : [7, 8, 9, 10]
>>>>    },
>>>>    { bool : 'or',
>>>>      query : [
>>>>        { ftsquery : [[['foo', '&', 'bar'], '&!', 'baz], '|', 'gar' ],
>>>>          phrases : ['bar'],
>>>>          classname : 'keyword',
>>>>          fields : [15]
>>>>        },
>>>>        { ftsquery : ['qux'],
>>>>          phrases : [],
>>>>          classname : 'title',
>>>>          fields : [6]
>>>>        }
>>>>      ]
>>>>    }
>>>>  ]
>>>> }
>>>>
>>>> Then, we walk that tree (probably a plperlu stored proc) looking for
>>>> leaf hashes (those that don't contain a key of "query") and build up
>>>> SELECT (ranking), FROM (sourcing) and WHERE (tsquery and phrase
>>>> matching) clauses as we see them.  We apply the union of the index
>>>> normalizers defined for the referenced fields to the non-joiner atoms
>>>> in the "ftsquery" structure.
>>>>
>>>> Eh?  I can see my way from here to there, but what I'm I totally overlooking?
>>>>
>>>
>>> Since I didn't hear any screaming, I took a little time today to put
>>> together a rough parser (attached).  I have not thrown horrible,
>>> terrible, mean, angry data at it, but for good input and some bits of
>>> unexpected input (multiple boolean operators in a row, unbalanced
>>> parens) it works as I expect it to.
>>>
>>> The output format is not exactly the same as described above, but
>>> should be recognizable.
>>>
>>> One of the implications of moving to this is that the "compiled query"
>>> returned from the main search call will necessarily change.  Also note
>>> that reconstructing the exact query that the user supplied will be
>>> very difficult, but we can cache that data easily enough for later
>>> use.
>>>
>>> The command line I've been testing with is:
>>>
>>> ./fts-replacement.pl 'title: (foo bar) || (-baz || (subject:"1900-1910
>>> junk" se:stuff)) && && && au:malarky || au:gonzo && +goo' 1
>>>
>>> First param is the query, second is a debug flag.  || == OR, and && ==
>>> AND, obviously.
>>>
>>> Any feedback would be appreciated.
>>>
>>> --
>>> Mike Rylander
>>>  | VP, Research and Design
>>>  | Equinox Software, Inc. / The Evergreen Experts
>>>  | phone:  1-877-OPEN-ILS (673-6457)
>>>  | email:  miker at esilibrary.com
>>>  | web:  http://www.esilibrary.com
>>>
>>
>>
>>
>> --
>> Mike Rylander
>>  | VP, Research and Design
>>  | Equinox Software, Inc. / The Evergreen Experts
>>  | phone:  1-877-OPEN-ILS (673-6457)
>>  | email:  miker at esilibrary.com
>>  | web:  http://www.esilibrary.com
>>
>
>
>
> --
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com
>



--
Mike Rylander
| VP, Research and Design
| Equinox Software, Inc. / The Evergreen Experts
| phone:  1-877-OPEN-ILS (673-6457)
| email:  miker at esilibrary.com
| web:  http://www.esilibrary.com


More information about the Open-ils-dev mailing list