[Evergreen-dev] Optimize SQL query

Fri Sep 1 22:07:13 EDT 2023

On 9/1/23 11:23, Tiffany Little wrote:
> Thanks Jason, that really helps!

You're welcome, Tiffany!

> 
> I don't necessarily need the mrd info, but I've pivoted completely in a 
> different direction now that you've solved one of my issues. Now I'm 
> trying to pull in the 245 and the 300 info, since that can help me 
> isolate in my list things that had a part erroneously applied. For 
> example, being able to see the 300$a can tell me that if the record says 
> something like "245 pages" it might just be a regular monograph and not 
> need parts, but someone added them for whatever reason. And the 245$a 
> and $n can tell me if it's a manga record for just one volume, which we 
> probably want to move those items to a series record.
> 
> So all that to say, I'm now joining in metabib.wide_display_entry and it 
> is taking *forever*. Is that just pretty normal, or is there a better 
> way to get that info? I've attached my current query.

I tried your original query, but stopped it after 3.5 hours with no 
output. I started with the assumption that joining with 
metabib.wide_display_entry was the main cause for the slowness, but that 
turned out out to be wrong. It looks like the left join on 
asset.copy_part_map is the real cause of the slowdown.

I tried a few variations on your query. The first that I tried replaced 
metabib.wide_display_entry with xpath() functions to get the titles and 
physical descriptions directly from the MARC, but included the left join 
on asset.copy_part_map. I stopped this version after 30 minutes with no
output.

The next attempt deleted the left join and replaced it with "not exists 
(select 1 from asset.copy_part_map where copy_part_map.target_copy = 
acp.id)." This one still used the xpath expressions. It finished in a 
"reasonable" amount of time: 13 minutes on an opitmized Pg 10 database 
and about 9.5 minutes on a Pg 15 database with default settings running 
on the same hardware with the same data.

The final attempt put the join with metabib.wide_display_entry back and 
it finished fastest: 3 minutes 45 seconds on Pg 10 and 1 minute 44 
seconds on Pg 15.

However, all but 1 of the "physical_description" fields came out "null" 
with that final version, and the one that wasn't "null" had incorrect 
information. I don't know if that is a general Evergreen problem or if 
it is just an issue with my data. (I might need to do a reingest or 
something like that.)

I have attached these two versions of the query so you can try them out. 
I did not do an EXPLAIN to see why the performance is different between 
the two, partly because the faster one was missing data in my 
circumstance and partly because it is getting late. :)

I guess the takeaway from this little experiment is don't use left joins 
to find things that are not there. Use 'not exists' instead. Also, 
upgrade your database to Pg 15. ;)

HtH,
Jason Stephenson

> 
> (I'm hoping this discussion is helpful for someone else too, since I 
> know thinking about parts can be an angsty endeavor.)
> 
> Thanks!
> Tiffany
> 
> 	
> 
> Tiffany Little, PINES Bibliographic Projects Manager
> 
> ------------------------------------------------------------------------
> 
> Georgia Public Library Service
> 
> 2872 Woodcock Blvd, Suite 250 | Atlanta, GA 30341
> 
> (404) 235-7161 | tlittle at georgialibraries.org 
> <mailto:tlittle at georgialibraries.org>
> 
> 
> Join our email list <http://georgialibraries.org/>for stories of Georgia 
> libraries making an impact in our communities.
> 
> 
> 
> On Thu, Aug 31, 2023 at 12:58 PM Jason Stephenson via Evergreen-dev 
> <evergreen-dev at list.evergreen-ils.org 
> <mailto:evergreen-dev at list.evergreen-ils.org>> wrote:
> 
>     Tiffany,
> 
>     Do you need the information from metabib.rec_descriptor? If not,
>     dropping the view from your query speeds things up for me.
> 
>     I think you're getting duplicates because of the join on
>     biblio.monograph_part. I suspect you're getting 1 output row for every
>     part. You can eliminate the duplicates by adding 'distinct' after
>     'select'.
> 
>     I made these changes in the attached file, and it runs reasonably
>     quickly on my test database with production data.
> 
>     HtH,
>     Jason
>     _______________________________________________
>     Evergreen-dev mailing list
>     Evergreen-dev at list.evergreen-ils.org
>     <mailto:Evergreen-dev at list.evergreen-ils.org>
>     http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev
>     <http://list.evergreen-ils.org/cgi-bin/mailman/listinfo/evergreen-dev>
> 
-------------- next part --------------
select distinct
acp.barcode,
bre.tcn_value,
acn.label as call_number,
xpath('/r:record/r:datafield[@tag="245"]/r:subfield[@code="a" or @code="n"]/text()', marc::xml,
array[array['r', 'http://www.loc.gov/MARC21/slim']])::text[] as title,
xpath('/r:record/r:datafield[@tag="300"]/r:subfield[@code="a"]/text()', marc::xml,
array[array['r', 'http://www.loc.gov/MARC21/slim']])::text[] as physical_description,
aou.shortname as circ_lib,
acl.name as shelving_location,
acp.create_date,
acp.circulate,
acp.holdable,
acp.ref,
acp.opac_visible,
ccs.name as status
from asset.copy acp
join asset.call_number acn on acn.id=acp.call_number
join asset.copy_location acl on acl.id=acp.location
join biblio.record_entry bre on bre.id=acn.record
join biblio.monograph_part bmp on bmp.record=bre.id
join actor.org_unit aou on aou.id=acp.circ_lib
join config.copy_status ccs on ccs.id=acp.status
where acp.deleted='f' and bmp.id>0 and bmp.deleted='f' and acp.holdable='t' and ccs.holdable='t' and acl.holdable='t'
and not exists (select 1 from asset.copy_part_map where copy_part_map.target_copy = acp.id);
-------------- next part --------------
select distinct
acp.barcode,
bre.tcn_value,
acn.label as call_number,
mwde.title,
mwde.physical_description,
aou.shortname as circ_lib,
acl.name as shelving_location,
acp.create_date,
acp.circulate,
acp.holdable,
acp.ref,
acp.opac_visible,
ccs.name as status
from asset.copy acp
join asset.call_number acn on acn.id=acp.call_number
join asset.copy_location acl on acl.id=acp.location
join biblio.record_entry bre on bre.id=acn.record
join biblio.monograph_part bmp on bmp.record=bre.id
join actor.org_unit aou on aou.id=acp.circ_lib
join config.copy_status ccs on ccs.id=acp.status
join metabib.wide_display_entry mwde on mwde.source=bre.id
where acp.deleted='f' and bmp.id>0 and bmp.deleted='f' and acp.holdable='t' and ccs.holdable='t' and acl.holdable='t'
and not exists (select 1 from asset.copy_part_map where copy_part_map.target_copy = acp.id);