[OPEN-ILS-DEV] import data from voyager

Thu May 24 22:27:15 EDT 2007

On 5/24/07, Dan Scott <denials at gmail.com> wrote:
> On 24/05/07, Jason Zou <qzou at lakeheadu.ca> wrote:
> > Don Hamilton wrote:
> > > Congratulations, Jason... You beat me by a long shot.... But now
> > > that you have, (and now that Rene has our system starting with NO
> > > logged errors!) can you give us some details of what exactly you did?
> > > Once I build on what you've done and loaded some records, I'll
> > > volunteer for a wiki account and add a 'voyager data load' section
> > > once we're done.
> > >
> > > Also, could you define "very slow"? As I mentioned in a query a week
> > > or so ago, I'd like to iterate through loading some 5 million bibs,
> > > and if 10,000 are very slow, I'm afraid to think what 5m would be. I
> > > have some experience with bulk loading to fresh data bases, and will
> > > take a stab at that if the 'regular' load is too slow.
> > >
> > > don
> > >
> > > >>> qzou at lakeheadu.ca 5/24/2007 10:44 AM >>>
> > > Hi everyone,
> > >
> > > By using scripts in the Open-ILS/src/extra/import/, I imported about
> > > 11,000 MARC records. Although the process was very slow, it seems that
> > > it is working. And records have been added into the following tables:
> > >           biblio.record_entry
> > >           metabib.rec_descriptor
> > >           metabib.full_rec
> > >           metabib.title_field_entry
> > >           metabib.author_field_entry
> > >           metabib.subject_field_entry
> > >           metabib.keyword_field_entry
> > >           metabib.series_field_entry
> > >
> > > But when I tried to use OPAC to find some records, I always got nothing.
> > > I am wondering whether there are something that I missed out.
> > >
> > > Any suggestions are highly appreciated.
> > >
> > > Jason
> > >
> > > Lakehead University
> > >
> > Hi Don,
> >
> > Thanks. I just want to see how fast the importing will be. I exported
> > 12,343 records from our Voyager database. It took me about 1.5 hours to
> > load them into Evergreen.
> > In comparison with the converting process, surprisingly, it only took
> > one minute or less to store records into Pg database.
> >
> > By the way, my server is on a P4 2.4 GHz machine with 1.5G RAM running FC5.
> >
> > Jason
> >
> > Lakehead University
> >
> >
>
> Strange. Here's the `time` results of running an import of the 14,449
> Gutenberg records (including all of the steps described in my previous
> email in this thread) on the Gentoo VMWare image with 512MB of RAM:
>
> real    51m7.623s
> user    39m24.824s
> sys     2m58.639s
>
> I'm not sure how my virtual machine with 1/3 of your physical
> machine's RAM could possibly outperform your physical machine.
> Something seems weird there. But I concur that the bulk of the time is
> spent in the marc2bre.pl / direct_ingest.pl / pg_loader.pl processes.

And I can further specify that the vast bulk of both times is spent
inside direct_ingest.pl.  That is the choke point currently.  I had a
multi-process version at one point, but the IPC issues required to
sync the processes made it a maintenance nightmare ...

One point to note, though, is that on a multi-core box (which a
production server would certainly be, these days) you can effectively
parallelize all three steps by piping the output of marc2bre into
direct_ingest and then that into pg_loader.  The total time should be
no more than the current direct_ingest time, plus the time required at
the end of pg_loader to sort the data and write out the SQL file.

I also have plans (which I'll be pushing out here) to move the
majority of the ingest functions into the database, via stored procs.
It's really where the logic belongs, and it should speed up data
loading considerably.

--miker

>
> --
> Dan Scott
> Laurentian University
>

-- 
Mike Rylander