[OPEN-ILS-DEV] PATCH: osrf_json_object.c (tweaks)

Sun Dec 9 13:57:40 EST 2007

I've never used a SAX-style parser, but as I understand it, the idea
is to parse the data stream incrementally, and respond to syntactic
features as you encounter them.  The alternative DOM-style approach
is to load the whole thing into memory at once in some kind of data
structure, such as a jsonObject.

THe SAX approach is useful, if not essential, for large data files,
where there is no need to hold the whole thing in memory at once,
and you really don't want to.  Generally the file structure is
repetitive, with the same kind of sub-object appearing repeatedly
with variations in the content, or a small number of different types
of sub-objects.

For example, you might want to read a list of books, and extract
just the ones published by Houghton-Mifflin.  By the time you get
to book number 6734, there's no need to retain any details about the
first 6733.

DOM is more appropriate for smallish amounts of data like 
configuration files.  Typically the structure is not very repetitive,
though there may be islands of repetition here and there internally.

What we have in OSRF is a bit of a hybrid.  I would call it a 
customizable DOM-style parser.  The client code can change the way the
parser responds to various events by installing different function
pointers.  We don't really use that capability yet, but we could
use it to build different flavors of jsonObjects according to need,
as you have suggested.  That would reduce the churn of creating and 
destroying jsonObjects.

However it's still closer to a DOM approach, because once we call
jsonParseString() (or any of its siblings), the parser parses the
whole thing from beginning to end.

Consider the Houghton-Mifflin extractor that I described above.  The
most natural way to implement it within the current framework would
be to parse the whole file into a single massive jsonObject, and then 
traverse the jsonObject looking for Houghton-Mifflin.  The memory
requirements would be proportional to the size of the file.

If we installed sufficiently clever function pointers, we could
probably build and destroy book objects as we go, but it wouldn't
be simple.

Now imagine that the requirement is a bit more complicated (but still
not really exotic).  You have two lists of books, sorted by some 
common key, and you need to identify the books that are present in 
both lists.

With the present framework I don't see any way to avoid loading at
least one of the files, if not both, into a massive jsonObject
before you can even start comparing them.

I haven't spent much time looking at the ILS-specific code, and I 
don't have a good sense what jsonObjects are actually used for,
other than configuration files.  Both of my examples would no doubt
be implemented in the database with SQL.  They're just thought 
experiments, devised to illustrate a point.

I doubt that you use JSON much for data files.  Since the parser
expects to see the input as a null-terminated string, you would have
to load the file into memory before calling the parser, and that
would use a lot of memory in itself.  Or you could use memory-mapped
IO, but I don't see any calls to mmap().

What I'm leading up to is the following suggestion.

Some kinds of needs might be well met by something like a
jsonParseNextSubobject function, and maybe a jsonParseFirstSubobject.
Given a pointer to the middle of a buffer somewhere, it would parse
until it reached the end of the next subobject and then return
a pointer to a jsonObject representing it.

In the example of the Houghton-Mifflin extractor, it would extract
one book's worth of jsonObject at a time, including whatever
lower-level objects were contained therein.  Once you were done with
a given book you could free it and grab the next one.

However I have no idea whether such a thing would be useful in the
context of Evergreen -- nor can I claim to know much about what I'm
talking about.

Scott McKellar
http://home.swbell.net/mck9/ct/

--- Mike Rylander <mrylander at gmail.com> wrote:

> On Dec 8, 2007 1:52 PM, Scott McKellar <mck9 at swbell.net> wrote:
> > These patches tidy up a few things and introduce a modest
> performance
> > boost.  Summary:
> 
> Applied.
> 
> As an aside, I think Bill's design inside the parser was to model a
> SAX parser, which would allow for construction of differently shaped
> objects given different handlers.  In particular, we currently do the
> standard JSON->jsonObject dance and then go back and build a classed,
> semantically equivalent (though structurally different) object in
> parallel.  One day, if it's warranted (and I think it is), we may
> have
> a set of handlers that build the semantically correct objects the
> first time around.
> 
> Bill, can you confirm that thinking?
> 
> -- 
> Mike Rylander
>  | VP, Research and Design
>  | Equinox Software, Inc. / The Evergreen Experts
>  | phone:  1-877-OPEN-ILS (673-6457)
>  | email:  miker at esilibrary.com
>  | web:  http://www.esilibrary.com
>