[OPEN-ILS-DEV] caching and High Availability

Fri Jun 1 15:51:09 EDT 2007

On 6/1/07, dkyle <dkyle at grpl.org> wrote:
> Mike,
>
> I was seeing errors preventing login, not just being asked to
> re-authenticate - thus the word critical, and this message.  So I'm not
> sure we disagree on anything, but apparently I still don't have my
> cluster working "normally" to begin with, and am perhaps drawing faulty
> conclusions.

Sure, but that's because you hadn't told all the servers about all the
memcache instances to begin with.  Now, this may very well be a
deficiency in the documentation, but when configured correctly and
following (as yet admittedly underdocumented) best practices, you
wouldn't have that situation.  The memcache service is required to be
running and all servers must know about all instances in use, but the
data stored therein is not critical.

>
> I was not quibbling with the dedicated cache servers idea, but thanks
> for the detailed description of how things could go wrong, that was
> helpful to my understanding of memcache, and, yeah, I also understand
> fine points like a Get should fail if a session is timed out .  I guess
> I should have said 'fails when it should not'.
>

I understand.  I wanted make sure that the failure mode (which was
actually simulated as a permanent version of that worst possible case
directly in your configuration -- though by no fault of your own) was
understood by everyone on-list ... and responding with a full
explanation seemed like a good way to do that.

> I see you're going no vacation - enjoy the beach!
>

Thanks!  I'll try not to think about this stuff for a bit. :)

--miker

> Doug.
>
>
>
> Mike Rylander wrote:
> > On 6/1/07, dkyle <dkyle at grpl.org> wrote:
> >> a short while ago, in a not too distant thread....
> >>
> >> miker wrote this:
> >> Once in a while a bug would pop up that would take a single machine
> >> down.  That's no problem at all for the application proper; OpenSRF
> >> just ignores the misbehaving machine and distributes the load to the
> >> rest (yay opensrf).  However, it was a problem for memcache.  1/20th
> >> of the session keys (and the search cache, and some other stuff) would
> >> go away
> >>
> >> I got off on a tangent looking at memory cache redundancy, but it
> >> suddenly occurred to me to ask:
> >>
> >> Shouldn't the application take into account cache failure and gracefully
> >> adjust?
> >
> > Ideally, yes, but global cache coherency is the issue (and not one
> > easily solved without overhead that negates memcache's advantages),
> > not cache failure (as in total outage).
> >
> >>
> >> Why are critical things like session keys not written to disk during
> >
> > I wouldn't call session keys critical, but I understand your point.
> >
> >> CachePut calls? and that disk store accessed if a Get call to memcache
> >> fails?
> >>
> >
> > The Get /should/ fail to return data if the session is timed out, and
> > that's exactly what we want -- to not even see a timed-out session.
> >
> >> Isn't that the typical way memcache is used?
> >>
> >
> > For data that is already stored somewhere permanent, yes. Not for data
> > that is considered transient, which is all we store in memcache.
> > (Again, sessions are not critical in Evergreen.  It will simply ask
> > you to reauthenticate.)
> >
> >> Although memcache distributes objects across machines/instances, any
> >> particular object is cached on only one machine, so there is always a
> >> possible single point of failure despite setting up a HA Cluster.
> >>
> >>
> >> ... or am I missing something?
> >
> > You're not missing the details of the implementation, but I think we
> > disagree on the importance of the data being stored in memcache.  With
> > the current setup as I described it, the current realistic worst-case
> > scenario is the complete failure of one of the memcache servers (due
> > to, say, power supply failure), in which case 1/4 of the users will be
> > asked to reauthenticate (that doesn't mean log out and back in, just
> > type in the user name and password).  This seems to us to be a good
> > balance between performance and recoverability.
> >
> > The previous worst-case was much more insidious and dangerous.
> > Imagine a cluster of 20 machines, all sharing up to 512M of memcache
> > data.  Now, these machines are also doing other things, one of which
> > is running Apache.
> >
> > For some reason one of the apache processes goes nuts, eating all cpu
> > and available memory.  This causes the memcache process to be starved
> > and it can't answer queries.  Then, just as quickly as it started,
> > Apache stops being dumb and settles down.
> >
> > Because of the wide distribution of machines, and thus keys, only a
> > few requests actually come to that machine.  This means that only a
> > few of the other 19 machines think that this one server is dead as far
> > as memcache is concerned.  These few will start using an alternate
> > hash to pick another host for keys that would have gone to the
> > partially dead server, but everyone else just trucks along with the
> > old server now answering.
> >
> > So, now you have (at least) 2 servers that respond to requests for a
> > particular set of keys, and no authoritative source for said keys.
> >
> > As you can see, that make using even transient data dangerous, and
> > can't be solved in the face of (inevitable, external, and
> > unpredictable) partial outages or (in the case of memcache) even
> > temporary slowdowns.
> >
> > LiveJournal learned this lesson too, and they have also moved to
> > dedicated small(ish) clusters of memcache servers ... so there's
> > precedent for the current configuration and for very similar reasons.
> >
> > Anyway, I think it all comes down to what one considers critical.
> > What balance do we strike?
> >
> > Is 100% service uptime with minimal end user visibility an acceptable
> > balance?  Based on previous experience of multiple-day-long outages, I
> > would consider having to reauthenticate (and never losing data or
> > having to reopen the app!) a pretty big step forward.
> >
> > As a datapoint, and for whatever it's worth, we haven't seen a
> > memcache failure since we made the move to a dedicated small cluster,
> > and we see on the order of 1.5M session-oriented transactions each day
> > in PINES.
> >
> > Does that help answer your questions?
> >
>
>

-- 
Mike Rylander