[OPEN-ILS-DEV] caching and High Availability

Fri Jun 1 14:24:22 EDT 2007

On 6/1/07, dkyle <dkyle at grpl.org> wrote:
> a short while ago, in a not too distant thread....
>
> miker wrote this:
> Once in a while a bug would pop up that would take a single machine
> down.  That's no problem at all for the application proper; OpenSRF
> just ignores the misbehaving machine and distributes the load to the
> rest (yay opensrf).  However, it was a problem for memcache.  1/20th
> of the session keys (and the search cache, and some other stuff) would
> go away
>
> I got off on a tangent looking at memory cache redundancy, but it
> suddenly occurred to me to ask:
>
> Shouldn't the application take into account cache failure and gracefully
> adjust?

Ideally, yes, but global cache coherency is the issue (and not one
easily solved without overhead that negates memcache's advantages),
not cache failure (as in total outage).

>
> Why are critical things like session keys not written to disk during

I wouldn't call session keys critical, but I understand your point.

> CachePut calls? and that disk store accessed if a Get call to memcache
> fails?
>

The Get /should/ fail to return data if the session is timed out, and
that's exactly what we want -- to not even see a timed-out session.

> Isn't that the typical way memcache is used?
>

For data that is already stored somewhere permanent, yes. Not for data
that is considered transient, which is all we store in memcache.
(Again, sessions are not critical in Evergreen.  It will simply ask
you to reauthenticate.)

> Although memcache distributes objects across machines/instances, any
> particular object is cached on only one machine, so there is always a
> possible single point of failure despite setting up a HA Cluster.
>
>
> ... or am I missing something?

You're not missing the details of the implementation, but I think we
disagree on the importance of the data being stored in memcache.  With
the current setup as I described it, the current realistic worst-case
scenario is the complete failure of one of the memcache servers (due
to, say, power supply failure), in which case 1/4 of the users will be
asked to reauthenticate (that doesn't mean log out and back in, just
type in the user name and password).  This seems to us to be a good
balance between performance and recoverability.

The previous worst-case was much more insidious and dangerous.
Imagine a cluster of 20 machines, all sharing up to 512M of memcache
data.  Now, these machines are also doing other things, one of which
is running Apache.

For some reason one of the apache processes goes nuts, eating all cpu
and available memory.  This causes the memcache process to be starved
and it can't answer queries.  Then, just as quickly as it started,
Apache stops being dumb and settles down.

Because of the wide distribution of machines, and thus keys, only a
few requests actually come to that machine.  This means that only a
few of the other 19 machines think that this one server is dead as far
as memcache is concerned.  These few will start using an alternate
hash to pick another host for keys that would have gone to the
partially dead server, but everyone else just trucks along with the
old server now answering.

So, now you have (at least) 2 servers that respond to requests for a
particular set of keys, and no authoritative source for said keys.

As you can see, that make using even transient data dangerous, and
can't be solved in the face of (inevitable, external, and
unpredictable) partial outages or (in the case of memcache) even
temporary slowdowns.

LiveJournal learned this lesson too, and they have also moved to
dedicated small(ish) clusters of memcache servers ... so there's
precedent for the current configuration and for very similar reasons.

Anyway, I think it all comes down to what one considers critical.
What balance do we strike?

Is 100% service uptime with minimal end user visibility an acceptable
balance?  Based on previous experience of multiple-day-long outages, I
would consider having to reauthenticate (and never losing data or
having to reopen the app!) a pretty big step forward.

As a datapoint, and for whatever it's worth, we haven't seen a
memcache failure since we made the move to a dedicated small cluster,
and we see on the order of 1.5M session-oriented transactions each day
in PINES.

Does that help answer your questions?

-- 
Mike Rylander