[OPEN-ILS-GENERAL] Synonym Dictionary - Numbers, &

Josh Stompro stomproj at exchange.larl.org
Fri May 26 11:18:43 EDT 2017


Mike, thanks for the tips, I hadn't thought about stacking dictionaries, that works great.



I'm trying out using 4 dictionaries in two config groups now.



Group 1 - 1 - Normal text synonyms (Roman numeral to text)

Group 1 - 2 - INT to text

Group 2 - 3 - Roman Numeral Int to text

Gropu 2 - 4 - Roman Numeral text to Int



So if the cataloger enters

- "Scary Movie 5" it is indexed with "V" and "Five"

- "Scary Movie V" it is indexed with "5" and "Five"

- "Scary Movie Five" it is indexed with "5" and "V"



This could really cut down on the need to add variations of the title (246) tags.



Now I need to see what I can do about hyphenated numbers.  We have about 500 titles like "A history of America in thirty-six postage stamps".  The above setup adds "30","6","xxx","vi" to the index since thirty-six is treated as two separate words.  I don't know if that is going to be a problem yet in real life usage.  And while it is possible to have a synonym dictionary with 36 -> thirty-six, it doesn't work because I believe the search subsystem would never send "thirty-six", it will break it up into two words.


For your suggestion about handling & and other special characters there is one thing that I don't understand.  Would the normalizer translate & into ☃,and then the synonym dictionary would map ☃ to & along with ☃ to ‘and’ ?  Or would this method not be using the synonym dictionaries at all?



Josh Stompro - LARL IT Director





-----Original Message-----
From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Mike Rylander
Sent: Thursday, May 25, 2017 11:19 AM
To: Evergreen Discussion Group
Subject: Re: [OPEN-ILS-GENERAL] Synonym Dictionary - Numbers, &



Josh,



To cover numbers, it looks like you just need to add dictionaries (I probably wouldn't use just one for everything) for uint, etc.  Note, you can stack dictionaries.



As for & (along with |, !, and maybe parens), it may be best to simply map those to some well-known token in search_normalize() that's very unlikely to be used in the real world.  Perhaps some unicode codepoint, like ☃ and friends.  Those are special characters used by tsearch itself.



HTH,

--

Mike Rylander

| President

| Equinox Open Library Initiative

| phone:  1-877-OPEN-ILS (673-6457)

| email:  miker at equinoxinitiative.org<mailto:miker at equinoxinitiative.org>

| web:  http://equinoxinitiative.org





On Thu, May 25, 2017 at 11:05 AM, Josh Stompro <stomproj at exchange.larl.org<mailto:stomproj at exchange.larl.org>> wrote:

> Hello, I’ve followed the steps in the following wiki pages to enable a

> synonym dictionary but I’m not getting the results I expect.

>

>

>

> https://wiki.evergreen-ils.org/doku.php?id=scratchpad:brush_up_search#

> synonym_dictionary

>

>

>

> Spelled out numbers do get translated to digits (six -> 6) but digits

> don’t get translated ( 6 -> six).

>

>

>

> When I test the synonym dictionary with something like the following

> it looks like it works:

>

> select ts_lexize('synonym_larl', '6');

>

> ts_lexize

>

> -----------

>

> {six}

>

> (1 row)

>

>

>

> But when I look at the the metabib.title_field_entry for a record that

> has been reindexed I see the following.

>

> select * from metabib.title_field_entry where source=102449 limit 100;

>

>    id    | source | field |                          value

> |

> index_vector

>

> ---------+--------+-------+----------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

>

> 2402931 | 102449 |     6 | Little house on the prairie Season 6 [disc 2]

> test seven | '2':9A,13C,20C '6':7A,12C,18C '7':14C 'disc':8A,19C

> 'hous':13C 'house':2A 'littl':12C 'little':1A 'on':3A,14C 'prairi':16C

> 'prairie':5A 'season':6A,17C 'seven':11A,22C 'test':10A,21C

> 'the':4A,15C

>

>

>

> Seven gets added as ‘seven’ and ‘7’, but the ‘2’ and ‘6’ do not.

>

>

>

> So I’m wondering if the search configuration needs to cover numeric

> tokens to make that work?

>

>

>

> select * from ts_debug('synonym_larl', '6');

>

> alias |   description    | token | dictionaries | dictionary | lexemes

>

> -------+------------------+-------+--------------+------------+-------

> -------+------------------+-------+--------------+------------+--

>

> uint  | Unsigned integer | 6     | {simple}     | simple     | {6}

>

>

>

> \dF+ synonym_larl;

>

> Text search configuration "public.synonym_larl"

>

> Parser: "pg_catalog.default"

>

>       Token      | Dictionaries

>

> -----------------+--------------

>

> asciihword      | synonym_larl

>

> asciiword       | synonym_larl

>

> email           | simple

>

> file            | simple

>

> float           | simple

>

> host            | simple

>

> hword           | simple

>

> hword_asciipart | synonym_larl

>

> hword_numpart   | simple

>

> hword_part      | simple

>

> int             | simple

>

> numhword        | simple

>

> numword         | simple

>

> sfloat          | simple

>

> uint            | simple

>

> url             | simple

>

> url_path        | simple

>

> version         | simple

>

> word            | simple

>

>

>

> Maybe the uint token needs to be set to synonym_larl also? But I’m

> wondering if this has bad side effects?

>

>

>

> Also, another mapping we would like to make is ‘&’ -> ‘and’ , ‘and’ -> ‘&’.

> But it doesn’t look like tsearch knows how to categorize ‘&’ as a token.

>

>

>

> select * from ts_debug('synonym_larl', '&');

>

> alias |  description  | token | dictionaries | dictionary | lexemes

>

> -------+---------------+-------+--------------+------------+---------

>

> blank | Space symbols | &     | {}           |            |

>

>

>

> Works fine going the other way and the ‘&’ ends up in the index.

>

>

>

> select * from ts_debug('synonym_larl', 'and');

>

>    alias   |   description   | token |  dictionaries  |  dictionary  |

> lexemes

>

> -----------+-----------------+-------+----------------+--------------+---------

>

> asciiword | Word, all ASCII | and   | {synonym_larl} | synonym_larl | {&}

>

>

>

> Thanks

>

> Josh

>

>

>

>

>

> Lake Agassiz Regional Library - Moorhead MN larl.org

>

> Josh Stompro     | Office 218.233.3757 EXT-139

>

> LARL IT Director | Cell 218.790.2110

>

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20170526/7606fc5a/attachment-0001.html>


More information about the Open-ils-general mailing list