[OPEN-ILS-GENERAL] Series index, only first entry getting indexed

Thu Mar 2 14:36:28 EST 2017

Good catch Josh and Mike, I still haven't reingested all of our records with 490s and I'd rather not do it twice...

And looking at our normalizer map, we've got some things duplicated here. That's certainly not going to be fast... :/ (things like running replace() twice, it's just burning cycles the second time...)

Might be something worth investigating in case it happened during an upgrade?

Jason

--
Jason Boyer
MIS Supervisor
Indiana State Library
http://library.in.gov/

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Mike Rylander
Sent: Thursday, March 02, 2017 9:59 AM
To: Evergreen Discussion Group <open-ils-general at list.georgialibraries.org>
Subject: Re: [OPEN-ILS-GENERAL] Series index, only first entry getting indexed

**** This is an EXTERNAL email. Exercise caution. DO NOT open attachments or click links from unknown senders or unexpected email. ****
________________________________
Josh,

btrim would be the natural normalizer to use, so let's test the timing to see if it's faster...

evergreen=# explain analyze select count(btrim(value)) from metabib.real_full_rec ;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6129.66..6129.67 rows=1 width=11) (actual time=542.861..542.862 rows=1 loops=1)
   ->  Seq Scan on real_full_rec  (cost=0.00..4989.77 rows=227977 width=11) (actual time=0.010..221.404 rows=228822 loops=1)
 Planning time: 0.080 ms
 Execution time: 542.899 ms
(4 rows)

evergreen=# explain analyze select count(regexp_replace(value,' *$','','')) from metabib.real_full_rec ;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=6129.66..6129.67 rows=1 width=11) (actual time=818.893..818.894 rows=1 loops=1)
   ->  Seq Scan on real_full_rec  (cost=0.00..4989.77 rows=227977 width=11) (actual time=0.010..230.265 rows=228822 loops=1)
 Planning time: 0.079 ms
 Execution time: 818.931 ms
(4 rows)

btrim is almost 50% faster! I didn't expect that, actually.  So I'd recommend using btrim instead.  Your future self will thank you on your next full reingest.

HTH,

--
Mike Rylander
 | President
 | Equinox Open Library Initiative
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker at equinoxinitiative.org<mailto:miker at equinoxinitiative.org>
 | web:  http://equinoxinitiative.org

On Thu, Mar 2, 2017 at 9:29 AM, Josh Stompro <stomproj at exchange.larl.org<mailto:stomproj at exchange.larl.org>> wrote:
Jason, this alone seems to leave trailing spaces in the facet entry table, since the space before the semicolon is left, which is required for the series index to not concatenate the last word of one 490 with the first word of the next 490.

I tried adding a second normalizer that just strips trailing spaces and that seems to take care of it.
insert into config.metabib_field_index_norm_map (field,norm,params,pos) values (1,18,'[" *$","",""]',-1);
-- Change the first normazlier position to -2.

There is also the btrim normalizer, I don’t know if that would be a better/faster than using another regexp_replace.

Josh Stompro - LARL IT Director

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org<mailto:open-ils-general-bounces at list.georgialibraries.org>] On Behalf Of Boyer, Jason A
Sent: Wednesday, March 01, 2017 10:22 AM
To: Evergreen Discussion Group

Subject: Re: [OPEN-ILS-GENERAL] Series index, only first entry getting indexed

Thanks for figuring this out, Josh. I was able to modify our normalizer like so to continue removing the $v:
BEGIN;
UPDATE config. index_normalizer SET param_count =3 WHERE id IN (SELECT id FROM config. index_normalizer WHERE func = 'regexp_replace');
UPDATE config.metabib_field_index_norm_map SET params='["; *[0-9]*","","g"]' WHERE field = 1 and norm in (SELECT id FROM config. index_normalizer WHERE func = 'regexp_replace');
COMMIT;

If you have more than 1 normalizer that uses regexp_replace or are using it on more than one field you won't want to use this as-is, but if you only have the 1 and are currently only using it on your series titles it's good to go.

Jason

--
Jason Boyer
MIS Supervisor
Indiana State Library
http://library.in.gov/

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Wednesday, March 01, 2017 10:41 AM
To: Evergreen Discussion Group <open-ils-general at list.georgialibraries.org<mailto:open-ils-general at list.georgialibraries.org>>
Subject: Re: [OPEN-ILS-GENERAL] Series index, only first entry getting indexed

**** This is an EXTERNAL email. Exercise caution. DO NOT open attachments or click links from unknown senders or unexpected email. ****
________________________________
Removing the regex replace normalizer did take care of it, sorry I didn’t try that before posting.  I think my regex will have to be more selective, only getting rid of the number and the ‘;’ so it doesn’t clear out too much data.

Josh Stompro - LARL IT Director

From: Open-ils-general [mailto:open-ils-general-bounces at list.georgialibraries.org] On Behalf Of Josh Stompro
Sent: Wednesday, March 01, 2017 9:19 AM
To: open-ils-general at list.georgialibraries.org<mailto:open-ils-general at list.georgialibraries.org>
Subject: [OPEN-ILS-GENERAL] Series index, only first entry getting indexed

Hello, we have noticed that only the first 490 get indexed for our series search index.  But all 490’s get added to the series facet entry.

For example, here is a title with two 490’s in mods32 format.
https://egcatalog.larl.org/opac/extras/unapi?id=tag::U2@bre/237592&format=mods32

The second 490 of “Felicity classic” isn’t searchable.

When I look at the metabib.combined_series_field_entry I see the following for this record.
record

metabib_field

index_vector

237592

'american' 'beforev' 'beforever' 'felic' 'felicity' 'girl'

237592

1

'american' 'beforev' 'beforever' 'felic' 'felicity' 'girl'

metabib.series_field_entry
id

source

field

Value

index_vector

430451

237592

1

American Girl Beforever Felicity

'american':1A,5C 'beforev':7C 'beforever':3A 'felic':8C 'felicity':4A 'girl':2A,6C

Metabib.facet_entry
value

count

bibid

American Girl Beforever Felicity

1

237592

Felicity classic

1

237592

The one thing that I have done is to add a search normalizer to get rid of the series numbering from the facet entry.  Unfortunately I don’t remember if this issue came up before I added the normalizer.  Maybe when used on the index version the regex replace is actually acting on all the 490 info concatenated together, so by getting rid of everything after the first ‘ ;’ I’m clearing the second 490 entry data?  But it does work correctly on the facet data?

There is a note on  https://wiki.evergreen-ils.org/doku.php?id=documentation:indexing#field_normalization_settings
“Note: Only normalizations with a negative pos value are applied to the facet version of indexed terms!”  But that must not mean that the normalizer only acts on the facet when there is a negative pos value?

This is going to be wide, but here is our normalizer setup and our series metabib field info.

id

field

norm

params

pos

id

field_class

name

label

xpath

weight

format

search_field

facet_field

browse_field

browse_xpath

browse_sort_xpath

facet_xpath

authority_xpath

joiner

restrict

id

name

description

func

param_count

51

32

2

0

32

series

browse

Series Title (Browse)

//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo[@type="nfi"]

1

mods32

false

false

true

*[local-name() != "nonSort"]

//@xlink:href

false

2

Normalize date range

Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.

split_date_range

0

1

1

2

0

1

series

seriestitle

Series Title

//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo[not(@type="nfi")]

1

mods32

true

true

false

//@xlink:href

false

2

Normalize date range

Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper index.

split_date_range

0

62

1

13

["[",""]

-1

1

series

seriestitle

Series Title

//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo[not(@type="nfi")]

1

mods32

true

true

false

//@xlink:href

false

13

Replace

Replace all occurences of first parameter in the string with the second parameter.

replace

2

61

1

13

["]",""]

-1

1

series

seriestitle

Series Title

//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo[not(@type="nfi")]

1

mods32

true

true

false

//@xlink:href

false

13

Replace

Replace all occurences of first parameter in the string with the second parameter.

replace

2

52

32

17

0

32

series

browse

Series Title (Browse)

//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo[@type="nfi"]

1

mods32

false

false

true

*[local-name() != "nonSort"]

//@xlink:href

false

17

Search Normalize

Apply search normalization rules to the extracted text. A less extreme version of NACO normalization.

search_normalize

0

2

1

17

0

1

series

seriestitle

Series Title

//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo[not(@type="nfi")]

1

mods32

true

true

false

//@xlink:href

false

17

Search Normalize

Apply search normalization rules to the extracted text. A less extreme version of NACO normalization.

search_normalize

0

64

1

18

[" *;.*",""]

-1

1

series

seriestitle

Series Title

//mods32:mods/mods32:relatedItem[@type="series"]/mods32:titleInfo[not(@type="nfi")]

1

mods32

true

true

false

//@xlink:href

false

18

Replace by regular expression

regexp_replace

2

Thanks for any ideas you might have.
Josh

Lake Agassiz Regional Library - Moorhead MN larl.org<http://larl.org>
Josh Stompro     | Office 218.233.3757 EXT-139<tel:(218)%20233-3757>
LARL IT Director | Cell 218.790.2110<tel:(218)%20790-2110>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://libmail.georgialibraries.org/pipermail/open-ils-general/attachments/20170302/e46ea9c3/attachment-0001.html>