[OPEN-ILS-GENERAL] Series index, only first entry getting indexed

Thu Mar 2 09:58:38 EST 2017

Josh,

btrim would be the natural normalizer to use, so let's test the timing to
see if it's faster...

evergreen=# explain analyze select count(btrim(value)) from
metabib.real_full_rec ;
                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6129.66..6129.67 rows=1 width=11) (actual
time=542.861..542.862 rows=1 loops=1)
   ->  Seq Scan on real_full_rec  (cost=0.00..4989.77 rows=227977 width=11)
(actual time=0.010..221.404 rows=228822 loops=1)
 Planning time: 0.080 ms
 Execution time: 542.899 ms
(4 rows)

evergreen=# explain analyze select count(regexp_replace(value,' *$','',''))
from metabib.real_full_rec ;
                                                         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6129.66..6129.67 rows=1 width=11) (actual
time=818.893..818.894 rows=1 loops=1)
   ->  Seq Scan on real_full_rec  (cost=0.00..4989.77 rows=227977 width=11)
(actual time=0.010..230.265 rows=228822 loops=1)
 Planning time: 0.079 ms
 Execution time: 818.931 ms
(4 rows)

btrim is almost 50% faster! I didn't expect that, actually.  So I'd
recommend using btrim instead.  Your future self will thank you on your
next full reingest.

HTH,

--
Mike Rylander
 | President
 | Equinox Open Library Initiative
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at equinoxinitiative.org
 | web:  http://equinoxinitiative.org

On Thu, Mar 2, 2017 at 9:29 AM, Josh Stompro <stomproj at exchange.larl.org>
wrote:

> Jason, this alone seems to leave trailing spaces in the facet entry table,
> since the space before the semicolon is left, which is required for the
> series index to not concatenate the last word of one 490 with the first
> word of the next 490.
>
>
>
> I tried adding a second normalizer that just strips trailing spaces and
> that seems to take care of it.
>
> insert into config.metabib_field_index_norm_map (field,norm,params,pos)
> values (1,18,'[" *$","",""]',-1);
>
> -- Change the first normazlier position to -2.
>
>
>
> There is also the btrim normalizer, I don’t know if that would be a
> better/faster than using another regexp_replace.
>
>
>
> Josh Stompro - LARL IT Director
>
>
>
> *From:* Open-ils-general [mailto:open-ils-general-
> bounces at list.georgialibraries.org] *On Behalf Of *Boyer, Jason A
> *Sent:* Wednesday, March 01, 2017 10:22 AM
> *To:* Evergreen Discussion Group
>
> *Subject:* Re: [OPEN-ILS-GENERAL] Series index, only first entry getting
> indexed
>
>
>
> Thanks for figuring this out, Josh. I was able to modify our normalizer
> like so to continue removing the $v:
>
> BEGIN;
>
> UPDATE config. index_normalizer SET param_count =3 WHERE id IN (SELECT id
> FROM config. index_normalizer WHERE func = 'regexp_replace');
>
> UPDATE config.metabib_field_index_norm_map SET params='[";
> *[0-9]*","","g"]' WHERE field = 1 and norm in (SELECT id FROM config.
> index_normalizer WHERE func = 'regexp_replace');
>
> COMMIT;
>
>
>
> If you have more than 1 normalizer that uses regexp_replace or are using
> it on more than one field you won't want to use this as-is, but if you only
> have the 1 and are currently only using it on your series titles it's good
> to go.
>
>
>
> Jason
>
>
>
> --
>
> Jason Boyer
>
> MIS Supervisor
>
> Indiana State Library
>
> http://library.in.gov/
>
>
>
> *From:* Open-ils-general [mailto:open-ils-general-
> bounces at list.georgialibraries.org
> <open-ils-general-bounces at list.georgialibraries.org>] *On Behalf Of *Josh
> Stompro
> *Sent:* Wednesday, March 01, 2017 10:41 AM
> *To:* Evergreen Discussion Group <open-ils-general at list.
> georgialibraries.org>
> *Subject:* Re: [OPEN-ILS-GENERAL] Series index, only first entry getting
> indexed
>
>
>
> **** This is an EXTERNAL email. Exercise caution. DO NOT open attachments
> or click links from unknown senders or unexpected email. ****
> ------------------------------
>
> Removing the regex replace normalizer did take care of it, sorry I didn’t
> try that before posting.  I think my regex will have to be more selective,
> only getting rid of the number and the ‘;’ so it doesn’t clear out too much
> data.
>
>
>
> Josh Stompro - LARL IT Director
>
>
>
> *From:* Open-ils-general [mailto:open-ils-general-
> bounces at list.georgialibraries.org
> <open-ils-general-bounces at list.georgialibraries.org>] *On Behalf Of *Josh
> Stompro
> *Sent:* Wednesday, March 01, 2017 9:19 AM
> *To:* open-ils-general at list.georgialibraries.org
> *Subject:* [OPEN-ILS-GENERAL] Series index, only first entry getting
> indexed
>
>
>
> Hello, we have noticed that only the first 490 get indexed for our series
> search index.  But all 490’s get added to the series facet entry.
>
>
>
> For example, here is a title with two 490’s in mods32 format.
>
> https://egcatalog.larl.org/opac/extras/unapi?id=tag::U2@
> bre/237592&format=mods32
>
>
>
> The second 490 of “Felicity classic” isn’t searchable.
>
>
>
> When I look at the metabib.combined_series_field_entry I see the
> following for this record.
>
> *record*
>
> *metabib_field*
>
> *index_vector*
>
> 237592
>
> 'american' 'beforev' 'beforever' 'felic' 'felicity' 'girl'
>
> 237592
>
> 1
>
> 'american' 'beforev' 'beforever' 'felic' 'felicity' 'girl'
>
>
>
> metabib.series_field_entry
>
> *id*
>
> *source*
>
> *field*
>
> *Value*
>
> *index_vector*
>
> 430451
>
> 237592
>
> 1
>
> American Girl Beforever Felicity
>
> 'american':1A,5C 'beforev':7C 'beforever':3A 'felic':8C 'felicity':4A
> 'girl':2A,6C
>
>
>
> Metabib.facet_entry
>
> *value*
>
> *count*
>
> *bibid*
>
> American Girl Beforever Felicity
>
> 1
>
> 237592
>
> Felicity classic
>
> 1
>
> 237592
>
>
>
>
>
> The one thing that I have done is to add a search normalizer to get rid of
> the series numbering from the facet entry.  Unfortunately I don’t remember
> if this issue came up before I added the normalizer.  Maybe when used on
> the index version the regex replace is actually acting on all the 490 info
> concatenated together, so by getting rid of everything after the first ‘ ;’
> I’m clearing the second 490 entry data?  But it does work correctly on the
> facet data?
>
>
>
> There is a note on  https://wiki.evergreen-ils.
> org/doku.php?id=documentation:indexing#field_normalization_settings
>
> “*Note:* Only normalizations with a negative *pos* value are applied to
> the facet version of indexed terms!”  But that must not mean that the
> normalizer only acts on the facet when there is a negative pos value?
>
>
>
> This is going to be wide, but here is our normalizer setup and our series
> metabib field info.
>
>
>
> *id*
>
> *field*
>
> *norm*
>
> *params*
>
> *pos*
>
> *id*
>
> *field_class*
>
> *name*
>
> *label*
>
> *xpath*
>
> *weight*
>
> *format*
>
> *search_field*
>
> *facet_field*
>
> *browse_field*
>
> *browse_xpath*
>
> *browse_sort_xpath*
>
> *facet_xpath*
>
> *authority_xpath*
>
> *joiner*
>
> *restrict*
>
> *id*
>
> *name*
>
> *description*
>
> *func*
>
> *param_count*
>
> 51
>
> 32
>
> 2
>
> 0
>
> 32
>
> series
>
> browse
>
> Series Title (Browse)
>
> //mods32:mods/mods32:relatedItem[@type="series"]/
> mods32:titleInfo[@type="nfi"]
>
> 1
>
> mods32
>
> false
>
> false
>
> true
>
> *[local-name() != "nonSort"]
>
> //@xlink:href
>
> false
>
> 2
>
> Normalize date range
>
> Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper
> index.
>
> split_date_range
>
> 0
>
> 1
>
> 1
>
> 2
>
> 0
>
> 1
>
> series
>
> seriestitle
>
> Series Title
>
> //mods32:mods/mods32:relatedItem[@type="series"]/
> mods32:titleInfo[not(@type="nfi")]
>
> 1
>
> mods32
>
> true
>
> true
>
> false
>
> //@xlink:href
>
> false
>
> 2
>
> Normalize date range
>
> Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper
> index.
>
> split_date_range
>
> 0
>
> 62
>
> 1
>
> 13
>
> ["[",""]
>
> -1
>
> 1
>
> series
>
> seriestitle
>
> Series Title
>
> //mods32:mods/mods32:relatedItem[@type="series"]/
> mods32:titleInfo[not(@type="nfi")]
>
> 1
>
> mods32
>
> true
>
> true
>
> false
>
> //@xlink:href
>
> false
>
> 13
>
> Replace
>
> Replace all occurences of first parameter in the string with the second
> parameter.
>
> replace
>
> 2
>
> 61
>
> 1
>
> 13
>
> ["]",""]
>
> -1
>
> 1
>
> series
>
> seriestitle
>
> Series Title
>
> //mods32:mods/mods32:relatedItem[@type="series"]/
> mods32:titleInfo[not(@type="nfi")]
>
> 1
>
> mods32
>
> true
>
> true
>
> false
>
> //@xlink:href
>
> false
>
> 13
>
> Replace
>
> Replace all occurences of first parameter in the string with the second
> parameter.
>
> replace
>
> 2
>
> 52
>
> 32
>
> 17
>
> 0
>
> 32
>
> series
>
> browse
>
> Series Title (Browse)
>
> //mods32:mods/mods32:relatedItem[@type="series"]/
> mods32:titleInfo[@type="nfi"]
>
> 1
>
> mods32
>
> false
>
> false
>
> true
>
> *[local-name() != "nonSort"]
>
> //@xlink:href
>
> false
>
> 17
>
> Search Normalize
>
> Apply search normalization rules to the extracted text. A less extreme
> version of NACO normalization.
>
> search_normalize
>
> 0
>
> 2
>
> 1
>
> 17
>
> 0
>
> 1
>
> series
>
> seriestitle
>
> Series Title
>
> //mods32:mods/mods32:relatedItem[@type="series"]/
> mods32:titleInfo[not(@type="nfi")]
>
> 1
>
> mods32
>
> true
>
> true
>
> false
>
> //@xlink:href
>
> false
>
> 17
>
> Search Normalize
>
> Apply search normalization rules to the extracted text. A less extreme
> version of NACO normalization.
>
> search_normalize
>
> 0
>
> 64
>
> 1
>
> 18
>
> [" *;.*",""]
>
> -1
>
> 1
>
> series
>
> seriestitle
>
> Series Title
>
> //mods32:mods/mods32:relatedItem[@type="series"]/
> mods32:titleInfo[not(@type="nfi")]
>
> 1
>
> mods32
>
> true
>
> true
>
> false
>
> //@xlink:href
>
> false
>
> 18
>
> Replace by regular expression
>
> regexp_replace
>
> 2
>
>
>
> Thanks for any ideas you might have.
>
> Josh
>
>
>
> Lake Agassiz Regional Library - Moorhead MN larl.org
>
> Josh Stompro     | Office 218.233.3757 EXT-139 <(218)%20233-3757>
>
> LARL IT Director | Cell 218.790.2110 <(218)%20790-2110>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20170302/7f63a972/attachment-0001.html>