A brief conceptual framework for high availability planning
Posted on March 23, 2014Summary I've put together a brief conceptual framework for high availability planning. That is, what to do when you get the request: ‘We want no downtime.’
In Greek mythology, Panacea was a goddess of Universal remedy. Panacea was said to have a potion with which she healed the sick. This brought about the concept of the panacea in medicine, a substance meant to cure all diseases. The term is also used figuratively as something intended to completely solve a large, multi-faceted problem. Unfortunately - when it comes to business computing applications - when someone says "We want no downtime" there is simply no panacea.
About the only thing you can be sure of when doing high availability planning is that there are a lot of tools to consider using, a lot of decisions to get to making, and a lot of work to get to doing. This is why a good conceptual framework is important. To make sure the right things are considered and the appropriate decisions are made.
In this post I've attempted to outline the framework I use. The aim is to help - in any given situation and over the life of a business application and the business itself - figure out which approaches make sense and what path to take to getting there. We'll get to that framework in a moment, but let's make sure we're clear on one other important thing first.
There's being proactive and then there's luck ...and then there's being smart.
The "no downtime" request is perhaps somewhat akin to a patient telling a doctor "I want to be healthy" (which, I suppose, is typically driven by the desire to minimize downtime of a different sort). You can't literally be healthy anymore than you can literally design for zero downtime. You can only control the inputs, manage which knobs you turn to minimize the likelihood of being unhealthy (or having an outage), and plan so that you are prepared for (or have options - or at least are willing to accept) the inevitable things you can't prevent with 100% certainty. And you have to make some decisions along the way as to how much you're willing to invest - time, energy, money, distraction.
Just as it's possible to eat cheeseburgers your entire life and still live to see your 90s, it is possible to have no downtime with your web application without even making any investments in eliminating single points of failure. Single physical server hosting your web/app and database? No data backups? Experienced no downtime or data loss? Congratulations. Sometimes you just luck out. At the same time, that doesn't make it a good strategy.
Minimize downtime by managing it
Having a business goal of managing downtime is a perfectly reasonable request, but as with most things involving technology, the requirement must be broken down and analyzed in a practical way before any actions can be made surrounding it. The following conceptual framework is about the closest to a universal way I know of to break down the meaning of "We want no downtime" into something meaningful and useful so that engineering decisions and investment decisions can be made surrounding it.
How to think about "managing downtime"
There are numerous facets to managing downtime - preventing it, minimizing its negative impact, handling it gracefully when it does occur, and having options for handling those really bad situations no one anticipated too. So let's get to breaking these facets down with specificity:
1. Minimize downtime
...for all reasonable events
2. Speed up recovery time
...for all unreasonable to protect against events
3. Handle outages as gracefully as possible
...don't leave users hanging (blind) even when the app becomes unavailable (e.g. continue to provide reduced functionality if possible or, when not possible, then provide a friendly outage message)
...provide options for response (see next item)
4. Have enough depth in the architecture so that there are multiple options when the unforeseen occurs
...have data stored in multiple repositories that are as independent as possible
...have various data rollback points
...understand the architecture/platform and individual elements well enough that these options can be used if need be
I threw out two seemingly straightforward terms above - reasonable and unreasonable - that can have very different definitions to different stakeholders (and at different points in time over the life of an application and organization). Defining these is paramount to getting this stuff right. I'd even go as far as to state that defining these well is the crux of getting high availability investments aligned with the business requirements.
What are "reasonable" events?
The definition of reasonable events:
- What we can anticipate; or
- What we can afford
What are "unreasonable" events?
The definition of unreasonable events:
- What we can't anticipate; or
- What we can't afford to prevent
The "or" between each of the above bullet points is important. We can't always afford all the things we need or know we want. Thinking about "what-ifs" in the above context provides a conceptual framework which technologists and business sponsors can use to make informed decisions about how to proceed.
Once the above are defined, the particular situations / events that apply to a given business application can be discussed with clarity and the decisions made surrounding them.
The decisions made in the above categories drive the architecture and overall investment.
In other words
All of the above put another way:
- We want to be able to sleep better at night
- We want to prevent what we can
- We want to manage what we can't prevent as best as possible
- We want to have options when the shit really hits the fan
- We want to invest wisely
- We want to be able to improve as tools mature, lessons are learned, business requirements change, and our resources increase
As the saying goes, it's not a question of if, only when.
That doesn't mean we have to blindly spend money on every conceivable scenario. Nor does it mean we even can spend money on every possible scenario (i.e. unlimited resources is not a panacea, sorry). We can, however, get better as our business maturity demands it and as our resources permit it.
While every business and every situation is different, the analytical framework to make these decisions within is simple enough. Every conceivable scenario can be incorporated into the framework above. Combined with a strong understanding of the capabilities of the infrastructure and people, and the resources available for investment, the development of an architecture to support your business application's high availability requirements is completely do-able.
Or you can just wing it on a single server and pray to your favorite Greek goddess.
alas, does not exist as far as I know ↩
Paraphrased from Wikipedia: http://en.wikipedia.org/wiki/Panacea ↩
regardless of your spiritual beliefs and, incidentally, regardless of whether you outsource this problem or handle it all in-house ↩
cloud offerings have increased the tools available and decreased the barriers to their use, but each service provider’s elements still cannot simply be adopted blindly if one hopes to achieve their organization’s particular business goals ↩
I think; I’m not perfect, but apparently I don’t mind making bold claims. Ha! ↩
to be clear: there’s nothing wrong with starting with a single server. Everybody has to start somewhere. Do make sure you have reliable data backups though. ↩
If you enjoyed this, you're invited to subscribe to be notified when I post similar items. I also invite you to connect with me by email or on Twitter if you have a comment, idea, or question.