[Evergreen-dev] mac_export is dead, long live eg-marc-export (was Re: marc_export - apache crashes)

Mon Oct 16 10:58:08 EDT 2023

Brian,

I was planning to write this email today anyway, and your latest email 
presents an opportunity.

TL;DR; Don't use marc_export. Switch to this: 
https://github.com/kcls/evergreen-universe-rs/blob/main/evergreen/src/bin/marc-export.rs

Warning! Wall of text incoming.

On 10/16/23 08:54, Brian Holda via Evergreen-dev wrote:
> Just want to say thank you to people for helping us out with the marc 
> export (especially Josh Stompro was really helpful in getting our 
> commands right and Jason Stephenson, I believe, updated the export to 
> make it work better and pointed me to some good resources).

You are welcome. I'm glad that our advice has been useful.

> 
> We ran an export over the weekend that worked successfully without 
> crashing the server. 1.4 million records, we did it in batches of 100k 
> per file and it was roughly 40 min. or so per 100k records. Some of our 
> bibs have A LOT of items attached, so I didn't think this was crazy in 
> timing.

I have been looking at this, and I believe the problem with MARC export 
performance is Perl. I have also been working on a large export. (A 
vendor wants all of our records with items with holdings information in 
the 852s.) Running an export of all 1.74 million* of our records with 
items takes just over 5 days, both in my test environment and our 
production server using marc_export from the Evergreen support-scripts.

While waiting on this to finish, I have had ample opportunity to examine 
the queries that marc_export uses and to run them through EXPLAIN 
ANALYZE in the database. The slowest part of any query is an index 
sequence scan on asset.copy's cp_cn_idx that takes 46 milliseconds on a 
record with a lot of copies. Most of the other things that were "slow" 
still ran on the order of 10ms, and these were joins or other index 
scans. There doesn't seem to be much (if any) room to improve the 
database performance.

I also monitored the resource use of a single process doing a full 
extraction. (Those who frequent the IRC channel will have already seen 
these numbers.) The process used 9.5GB of RAM and kept 1 processor 
pegged at 98% for most of its runtime. The latter number suggests that 
much of the time is spent in the Perl code manipulating the MARC data, 
and/or the massive rowset.

So, I also tried an export doing batches of 5,000 records at a time. I 
started this about 12:30 PM EDT on October 12. As of writing this email 
it is still running. At first, it looked like it would be faster than 
the single process, but given that it has about 400,000 records left to 
extract, I don't think it will be much faster in the end.

Doing batches of 5,000 records does reduce the memory use. These 
processes use only about 225MB of RAM. However, they still keep 1 cpu 
running at 98% to 99% capacity, which indicates that most of the time is 
spent manipulating data, not waiting on the database to return results.

One could take advantage of multiple CPUs in the machine and run more 
than 1 batch in parallel. However, my production utility server also 
does other tasks, and doing more than 2 or 3 batches at a time could 
possibly interfere with the timing of those jobs. Doing just a couple of 
batches at a time also would not buy enough saved time on the export. 
(That last point is debatable. It's mostly a matter of opinion on what's 
an acceptable amount of time to wait for a record export.)

So, what can we do? The way I see it we have two basic options:

1. Work on improvements to the MARC Perl code and/or DBI (the generic 
database connection code for Perl).

2. Reimplement marc_export in a different programming language that is 
more efficient.

My choice has already been made by Bill Erickson who has implemented 
some basic Evergreen and OpenSRF connector code in Rust: 
https://github.com/kcls/evergreen-universe-rs/tree/main

This collection includes a MARC export program that is called 
eg-marc-export when installed. It does not yet have all of the features 
of the Perl marc_export, but I plan to make pull requests for a couple 
of these missing features this week or next. (The link at the top of 
this email is to the source code for the MARC exporter.)

I ran a test of the Rust MARC export over the weekend on a test VM using 
production CW MARS data. It exported 2,251,864 records* with 7,775,264 
items in 18.3 hours. This is still a bit long, but a lot better than the 
performance that I was getting with Perl.

Not only is it faster, but it is more efficient with resources. I spot 
checked the CPU and memory usage a few times during its run. It never 
exceeded 32% CPU usage and only used about 26MB of RAM. The lower 
numbers imply that the process spends more time streaming data from the 
database than it does manipulating the data. There may be room for 
improvement here, but the savings in execution time (from 5 days to less 
than 1 day in my case) is well worth it at this point.

As a parting shot, I'd like to see us move to Rust after the 
implementation of OpenSRF with Redis is done. I suspect that we would 
get much more bang for a our buck by starting with replacing the Perl 
code rather than the C code, but that is a conversation for another time.

* The discrepancy in numbers is because the Evergreen marc_export was 
limited to only OPAC visible copies and the Rust export was not so 
limited. It may have included deleted copies. (I'll have to check that.)

Cheers,
Jason Stephenson

> 
> Thanks again - really appreciate the support of this community!
> Brian
> 
> Brian Holda
> Library Technology Manager
> Hekman Library
> Calvin University
> (616) 526-8673
> 
> <https://library.calvin.edu/>