[Evergreen-dev] mac_export is dead, long live eg-marc-export (was Re: marc_export - apache crashes)
Jason Stephenson
jason at sigio.com
Mon Oct 16 10:58:08 EDT 2023
Brian,
I was planning to write this email today anyway, and your latest email
presents an opportunity.
TL;DR; Don't use marc_export. Switch to this:
https://github.com/kcls/evergreen-universe-rs/blob/main/evergreen/src/bin/marc-export.rs
Warning! Wall of text incoming.
On 10/16/23 08:54, Brian Holda via Evergreen-dev wrote:
> Just want to say thank you to people for helping us out with the marc
> export (especially Josh Stompro was really helpful in getting our
> commands right and Jason Stephenson, I believe, updated the export to
> make it work better and pointed me to some good resources).
You are welcome. I'm glad that our advice has been useful.
>
> We ran an export over the weekend that worked successfully without
> crashing the server. 1.4 million records, we did it in batches of 100k
> per file and it was roughly 40 min. or so per 100k records. Some of our
> bibs have A LOT of items attached, so I didn't think this was crazy in
> timing.
I have been looking at this, and I believe the problem with MARC export
performance is Perl. I have also been working on a large export. (A
vendor wants all of our records with items with holdings information in
the 852s.) Running an export of all 1.74 million* of our records with
items takes just over 5 days, both in my test environment and our
production server using marc_export from the Evergreen support-scripts.
While waiting on this to finish, I have had ample opportunity to examine
the queries that marc_export uses and to run them through EXPLAIN
ANALYZE in the database. The slowest part of any query is an index
sequence scan on asset.copy's cp_cn_idx that takes 46 milliseconds on a
record with a lot of copies. Most of the other things that were "slow"
still ran on the order of 10ms, and these were joins or other index
scans. There doesn't seem to be much (if any) room to improve the
database performance.
I also monitored the resource use of a single process doing a full
extraction. (Those who frequent the IRC channel will have already seen
these numbers.) The process used 9.5GB of RAM and kept 1 processor
pegged at 98% for most of its runtime. The latter number suggests that
much of the time is spent in the Perl code manipulating the MARC data,
and/or the massive rowset.
So, I also tried an export doing batches of 5,000 records at a time. I
started this about 12:30 PM EDT on October 12. As of writing this email
it is still running. At first, it looked like it would be faster than
the single process, but given that it has about 400,000 records left to
extract, I don't think it will be much faster in the end.
Doing batches of 5,000 records does reduce the memory use. These
processes use only about 225MB of RAM. However, they still keep 1 cpu
running at 98% to 99% capacity, which indicates that most of the time is
spent manipulating data, not waiting on the database to return results.
One could take advantage of multiple CPUs in the machine and run more
than 1 batch in parallel. However, my production utility server also
does other tasks, and doing more than 2 or 3 batches at a time could
possibly interfere with the timing of those jobs. Doing just a couple of
batches at a time also would not buy enough saved time on the export.
(That last point is debatable. It's mostly a matter of opinion on what's
an acceptable amount of time to wait for a record export.)
So, what can we do? The way I see it we have two basic options:
1. Work on improvements to the MARC Perl code and/or DBI (the generic
database connection code for Perl).
2. Reimplement marc_export in a different programming language that is
more efficient.
My choice has already been made by Bill Erickson who has implemented
some basic Evergreen and OpenSRF connector code in Rust:
https://github.com/kcls/evergreen-universe-rs/tree/main
This collection includes a MARC export program that is called
eg-marc-export when installed. It does not yet have all of the features
of the Perl marc_export, but I plan to make pull requests for a couple
of these missing features this week or next. (The link at the top of
this email is to the source code for the MARC exporter.)
I ran a test of the Rust MARC export over the weekend on a test VM using
production CW MARS data. It exported 2,251,864 records* with 7,775,264
items in 18.3 hours. This is still a bit long, but a lot better than the
performance that I was getting with Perl.
Not only is it faster, but it is more efficient with resources. I spot
checked the CPU and memory usage a few times during its run. It never
exceeded 32% CPU usage and only used about 26MB of RAM. The lower
numbers imply that the process spends more time streaming data from the
database than it does manipulating the data. There may be room for
improvement here, but the savings in execution time (from 5 days to less
than 1 day in my case) is well worth it at this point.
As a parting shot, I'd like to see us move to Rust after the
implementation of OpenSRF with Redis is done. I suspect that we would
get much more bang for a our buck by starting with replacing the Perl
code rather than the C code, but that is a conversation for another time.
* The discrepancy in numbers is because the Evergreen marc_export was
limited to only OPAC visible copies and the Rust export was not so
limited. It may have included deleted copies. (I'll have to check that.)
Cheers,
Jason Stephenson
>
> Thanks again - really appreciate the support of this community!
> Brian
>
> Brian Holda
> Library Technology Manager
> Hekman Library
> Calvin University
> (616) 526-8673
>
> <https://library.calvin.edu/>
More information about the Evergreen-dev
mailing list