Sunday, June 1, 2008

Failure as a Service - Cloud Redundancy

This weekend we had one of those "I should have known better moments".
For the last few years we've hosted our primary and secondary DNS
servers at The Planet. Around 5pm on Saturday our data center
literally blew up. Even though most of our application servers are
hosted at Amazon EC2 or in house, this one relatively minor point of
failure managed to take down our entire IT infrastructure. We
mistakenly assumed that the chances of both DNS servers going offline
at the same time were slim and up until today, we had assumed
correctly.

This disaster is especially difficult for me since I spend my days
pitching the merits of geographically redundant cloud computing which
I call "failure as service". The concept goes like this; If you assume
you may lose any of your servers at any point in time, you'll design a
more fault tolerant environment. For us that means making sure our
application components are always replicated on more then one machine,
preferably geographically dispersed. This way we can lose groups of
VMs, physical machines, data centers, or whole geographic regions
without taking down the overall cloud. This approach in a lot of ways
is similar to the architecture of a P2P network or even a modern
botnet which rely heavily on a decentralized command and control.

As an early user of Amazon EC2 we quickly learned about failure, we
would routinely lose EC2 instances and it became almost second nature
to design for this type of transient operating environment. To make
matters worse for a long time EC2 had no persistent storage available,
if you lost an instance, the data was also lost. So we created our
own Amazon S3 based disaster recover system we called ElasticDrive.

ElasticDrive allows us to mount amazon s3 as a logical block device,
which looks and acts like a local storage system. This enables us to
always have a "worst case scenario" remote backup for exactly this
type of event, and luckily for us we lost no data because it. What we
did lose was time, our time on a Sunday afternoon fixing something
that shouldn't have even been an issue.

Our application servers, databases and content had been designed to be
distributed, but our key point of failure was in our use of a single
data center to host both of our name servers. When the entire data
center went offline, so did our dns servers and so did our 200+
domains. If we had made one small, but critical change (adding a
redundant remote name server) our entire IT infrastructure would have
continued to work uninterpreted. But when I awoke Sunday morning (to
my surprise) everything from email, to our web sites, to even our
network monitoring system failed to work.

I should also note that recently Amazon has worked to overcome some of
the early limitation of a EC2 with the inclusion of persistent storage
options as well as something they call Amazon EC2 Availability Zones.
They describe availability zones as: "The ability to place instances
in multiple locations. Amazon EC2 locations are composed of regions
and availability zones. Regions are geographically dispersed and will
be in separate geographic areas or countries. Currently, Amazon EC2
exposes only a single region. Availability zones are distinct
locations that are engineered to be insulated from failures in other
availability zones and provide inexpensive, low latency network
connectivity to other availability zones in the same region. Regions
consist of one or more availability zones. By launching instances in
separate availability zones, you can protect your applications from
failure of a single location."

Well Amazon, if you were looking for a "Use Case" look no further,
Cause I'm your guy.

I've learned a valuable, if not painful lesson. No matter how much
planning you do, nothing beats a geographically redundant
configuration.

(Original Post:
http://elasticvapor.com/2008/06/failure-as-service-cloud-redundancy.html)
----
If anyone is interested in the learning more about the issues at the
planet. (9000 servers offline)
http://tech.slashdot.org/article.pl?sid=08/06/01/1715247

Or EC2 Availability Zones

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1347

--
--

Reuven Cohen
Founder & Chief Technologist, Enomaly Inc.
www.enomaly.com :: 416 848 6036 x 1
skype: ruv.net // aol: ruv6

blog > www.elasticvapor.com
-
Get Linked in> http://linkedin.com/pub/0/b72/7b4

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Cloud Computing" group.
To post to this group, send email to cloud-computing@googlegroups.com
To unsubscribe from this group, send email to cloud-computing-unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.ca/group/cloud-computing?hl=en
-~----------~----~----~----~------~----~------~--~---

No comments: