[OPEN-ILS-DEV] Fun with Marc load process...

Don Hamilton dhamilton at wlu.ca
Fri Mar 13 13:02:51 EDT 2009


Hi all.

After a year's hiatus, I am back loading data into evergreen. It goes much smoother than last time. Thanks for the improvements!

I do have a few comments.

1) the instructions at http://open-ils.org/dokuwiki/doku.php?id=evergreen-admin:importing:bibrecords, in the introductory section, say "If you take this approach, due to a current limitation of MARC::File::XML you have to do a horrible thing and ensure that there are no namespace prefixes in front of the element names. marc2bre.pl cannot parse the following example"... In fact, yaz-marcdump does not seem to insert namespace prefixes, so if you use it to convert, there are no issues with that.

2) point one not withstanding, marc2bre stops dead on marcxml records with 'bad' characters. I ended up biting the 'time' bullet and using marc2bre on 'real' marc records, cuz that doesn't choke on the odd hex character. All that being said, the speed difference in the marc2bre run was going from an average of 305 records per second with xml to 175 per second with utf8 marc, so it wasn't a terrible speed hit.

3) after using the real marc to get a bre file, I had issues with pg_ingest choking on what were probably the same bad characters in the bre file... but it, at least, just noisily dropped the bad records. If I were to take:

sub rawJSON2perl {
        my $class = shift;
    my $json = shift;
    return undef unless defined $json and $json !~ /^\s*$/o;
    return $parser->decode($json);
}

from JSON.pm and replace it with 

sub rawJSON2perl {
        my $class = shift;
    my $json = shift;
    return " " unless defined $json and $json !~ /^\s*$/o;
    return $parser->decode($json);
}

would that keep the records but just 'blank' the otherwise bad character. (there seem to be more than a few hex 06 and hex 01 and some others in my data... what can I say?).


Other than those points, I am presently flogging my poor old 1gb ubuntu system by doing a pg_loader on a 6million line ingest file... and it may be slow, but it sure isn't fast, either!

don (waiting for loader to finish).

ps... If you could remind me what my id/password for the wiki might be, I would make these notes on the loader page.

pps. As I re-read this, it occurs to me I may have had too much green tea today.... but I'm hitting send, anyway





More information about the Open-ils-dev mailing list