[OPEN-ILS-DEV] import data from voyager

Fri May 25 12:14:06 EDT 2007

On 5/25/07, Jason Zou <qzou at lakeheadu.ca> wrote:
> Dan Scott wrote:
> > On 24/05/07, Jason Zou <qzou at lakeheadu.ca> wrote:
> >> Don Hamilton wrote:
> >> > Congratulations, Jason... You beat me by a long shot.... But now
> >> > that you have, (and now that Rene has our system starting with NO
> >> > logged errors!) can you give us some details of what exactly you did?
> >> > Once I build on what you've done and loaded some records, I'll
> >> > volunteer for a wiki account and add a 'voyager data load' section
> >> > once we're done.
> >> >
> >> > Also, could you define "very slow"? As I mentioned in a query a week
> >> > or so ago, I'd like to iterate through loading some 5 million bibs,
> >> > and if 10,000 are very slow, I'm afraid to think what 5m would be. I
> >> > have some experience with bulk loading to fresh data bases, and will
> >> > take a stab at that if the 'regular' load is too slow.
> >> >
> >> > don
> >> >
> >> > >>> qzou at lakeheadu.ca 5/24/2007 10:44 AM >>>
> >> > Hi everyone,
> >> >
> >> > By using scripts in the Open-ILS/src/extra/import/, I imported about
> >> > 11,000 MARC records. Although the process was very slow, it seems that
> >> > it is working. And records have been added into the following tables:
> >> >           biblio.record_entry
> >> >           metabib.rec_descriptor
> >> >           metabib.full_rec
> >> >           metabib.title_field_entry
> >> >           metabib.author_field_entry
> >> >           metabib.subject_field_entry
> >> >           metabib.keyword_field_entry
> >> >           metabib.series_field_entry
> >> >
> >> > But when I tried to use OPAC to find some records, I always got
> >> nothing.
> >> > I am wondering whether there are something that I missed out.
> >> >
> >> > Any suggestions are highly appreciated.
> >> >
> >> > Jason
> >> >
> >> > Lakehead University
> >> >
> >> Hi Don,
> >>
> >> Thanks. I just want to see how fast the importing will be. I exported
> >> 12,343 records from our Voyager database. It took me about 1.5 hours to
> >> load them into Evergreen.
> >> In comparison with the converting process, surprisingly, it only took
> >> one minute or less to store records into Pg database.
> >>
> >> By the way, my server is on a P4 2.4 GHz machine with 1.5G RAM
> >> running FC5.
> >>
> >> Jason
> >>
> >> Lakehead University
> >>
> >>
> >
> > Strange. Here's the `time` results of running an import of the 14,449
> > Gutenberg records (including all of the steps described in my previous
> > email in this thread) on the Gentoo VMWare image with 512MB of RAM:
> >
> > real    51m7.623s
> > user    39m24.824s
> > sys     2m58.639s
> >
> > I'm not sure how my virtual machine with 1/3 of your physical
> > machine's RAM could possibly outperform your physical machine.
> > Something seems weird there. But I concur that the bulk of the time is
> > spent in the marc2bre.pl / direct_ingest.pl / pg_loader.pl processes.
> >
> Hi Don,
>
> More RAM does not guarantee loading data faster. Probably, I have a slower
> hard disk than yours. It seems that those scripts do not use a lot of
> memory (only about 50M).

We're CPU bound in direct_ingest.pl.  In addition to xml parsing and
xpath and DOM manipulation, there is some use of server-side
javascript (for fingerprinting records) which tends to be slower than
either perl or C.

>
> Although it may not be comparable, MarcEdit converts my test MARC file
> to MARCXML in 3 seconds.
> It takes marc2bre.pl about 10 minutes.
>

There is a tool called marcdumper in our CVS, originally from Index
Data and slightly modified to allow stripping of tags in XML output
via XPath, which does a very good job of MARC21 to XML conversion,
including the MARC8 to UTF-8 mapping.  (As an aside, hoever, it
doesn't like the project gutenberg records for some reason.)

In any case, I use that on occasion because it uses iconv for the
encoding conversion and is therefore MUCH faster than the
MARC::Charset perl module (currently).  It has sped up very large
(2M+) record conversions by as much as 8-10x for me, but it does fall
over less gracefully than the current implementation of MARC::Charset,
so beware on strange datasets.

--miker

> Jason
>
> Lakehead University
>
>

-- 
Mike Rylander