[OPEN-ILS-DEV] Ruminations on osrfHash

Sat Mar 22 15:26:39 EDT 2008

I've been looking at osrfHash lately -- something I had been putting
off, because whenever I looked too closely at it, I got confused.
I think I grasp it now.

An osrfHash has two levels of osrfList.  The outer level is a hash
table.  Each slot in the hash table, if not empty, contains an 
inner osrfList, in which each entry is an osrfHashNode that
represents one of the keys hashing to this slot.  The osrfHashNode 
also contains a pointer to the data associated with the key.

So far, this is pretty straightforward stuff.  What had confused me
was the osrfStringArray, which contains its own copies of all the 
keys.

Now I see the reason for it.  The double layer of osrfLists supports
random access by key, as is typical for a hash table.  The
osrfStringArray supports traversal of the keys in the same sequence
in which they were added.  This ability is particularly important 
in the context of a jsonObject, where the sequence of entries in a
series of name/value pairs may be significant.

Ordinary hash tables don't support the latter requirement, unless
you add something extra.  The osrfStringArray is something extra.

However this capability comes at a cost.  We have to allocate and
free an osrfStringArray.  We also allocate and free an extra copy 
of every key.

Traversal by an osrfHashIterator is clunky as well.  We don't store
a pointer to an entry in the hash table, we store an index into
the osrfStringArray.  To get the corresponding data we have to fetch
the key, hash it, and then search the corresponding slot in the hash 
table.  

I suspect that the idea behind this is that we want the iterator 
to do something sensible even if the item has been deleted (though 
it's not clear to me whether this situation occurs in practice).

-------------

I have come up with what I think is a more lightweight design by
eliminating the osrfStringArray.

To each osrfHashNode, add two pointers to osrfHashNode, a forward
and a backward pointer.  Then each osrfHashNode can reside not only
in a hash table but also in a doubly linked list.  New entries are
added to the end of the list.  

An osrfHashIterator would store a pointer the current osrfHashNode.
For a sequential traversal we traverse the linked list instead of
the osrfStringArray, without ever having to calculate a hash.

With one version of this design, the iterator becomes invalid if the
node to which it points is deleted.  That's not necessarily a bad
thing,  That's how iterators normally behave in C++, and no one 
seems to mind.  You just have to avoid dereferencing an iterator 
that has become invalid.

Another approach is to leave the osrfHashNode physically in place 
when you delete an entry.  Delete it logically by freeing the key 
and the data, and setting their pointers to NULL.  Subsequent 
searches can skip over a logically deleted entry.  But because the 
linkage pointers are intact, an iterator pointing to a dead node
can still find the next (or previous) entry in the list.

The nodes adjacent to the dead node in the linked list can be
pointed to each other, bypassing the next node.  Subsequent 
traversals of the linked list will therefore skip over dead nodes 
automatically.

----------------

Since we use osrfLists all over the place, replacing it with a new 
design will not be simple.

The first step is to provide enough functions that the application
code can use an osrfHash or an osrfHashIterator without knowing
anything about the internal structure.  For example, instead of
referencing osfrHashIterator.current directly, you could call
osrfHashIteratorCurrent().

The second step is to make sure that these functions are used.
Nothing outside of osrf_hash.c should access the internal members
directly.

(The extra layer of function call will add a bit of extra overhead
at first.  We will recoup that overhead later.)

The third step is to move the declarations of osrfHash and
osrfHashIterator out of osrf_hash.h and into osrf_hash.c.  The
header would just declare them as incomplete types.  The application
code could still declare pointers to them, but the internals would
be invisible to it.

The final step would be to replace osrf_hash.c with a new version.
The rest of the code would continue to work as before, but more
efficiently.  If the new version doesn't work for some reason, 
revert to the old version until the new version is fixed.

This line of attack is low-risk, because it proceeds in incremental
steps, and every step is reversible.

-------------

Any interest?

Scott McKellar
http://home.swbell.net/mck9/ct/