Lessons Learned from Amazon’s Cloud Outage
More than five days after its outage began, Amazon Web Services has finally restored virtually all of its services, with some mopping up of a small number of customer accounts with “stuck” data in its Elastic Block Storage (EBS) service. “EBS is now operating normally for all APIs and recovered EBS volumes,” Amazon reports on its status dashboard. “The vast majority of affected volumes have now been recovered. We’re in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes.” The company promises a detailed incident report will follow.
What are the lessons and implications of the outage? Discussion continued over the weekend. Here’s a look at some notable links with analysis and commentary:
- How SmugMug survived the Amazonpocalypse – SmugMug’s Don MacAskill: “Despite using a lot of Amazon services, SmugMug didn’t go down because we spread across availability zones and designed for failure to begin with, among other things.”SmugMug also didn’t use Elastic Block Storage. “We’ve never felt comfortable with the unpredictable performance and sketchy durability that EBS provides, so we’ve never taken the plunge,” MacAskill writes.
- Amazon’s Cloud Outage Catches Most Clients Off Guard – The recent Amazon cloud outage at its Northern Virgina data center from 5 am Thursday, April 21, 2011 to roughly 5 am Friday, April 22 has shaken the confidence of some executives on public cloud computing. Most notably, FourSquare, HootSuite, Reddit, and Quora publicly suffered visible performance issues. The industry’s reassurances in the past on up time performance and massive redundancy capabilities combined with the massive corporate adoption had everyone believing that public clouds were bullet proof.
- Seven lessons to learn from Amazon’s outage – What are the lessons to learn? Phil Wainewright at ZDNet urges close scrutiny of SLAs. “Since it has been the EBS and RDS services rather than EC2 itself that has failed (and all the failures have been restricted to Availability Zones within a single Region), the SLA has not been breached, legally speaking.”
- Amazon’s Trouble Raises Cloud Computing Doubts: The New York Times examines the potential impact on cloud adoption. “Industry analysts said the troubles would prompt many companies to reconsider relying on remote computers beyond their control.”
- The AWS Outage: The Cloud’s Shining Moment – At O’Reilly, George Reese makes the claim that the cloud is better than ever, but its users are not: “In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model.”
- Magical Block Store: When Abstractions Fail Us – Joyent takes a closer look at EBS: “I certainly don’t claim to know how EBS works, but of course people go to bars and have beers and talk. It’s commonly believed that EBS is built on DRBD with a dose of S3-derived replication logic. … Maybe (Amazon) did what a dozen billion dollar companies before them tried to do and never pulled off. Or maybe EBS is indeed bandaids and chicken wire. I have no idea. Which is a problem, as a user of EBS.”
- AWS Developer Forums: Life of our patients is at stake – Not all apps are appropriate for cloud computing.Case in point: An Amazon user who was apparently using EC2 to run a service monitoring cardiac patents.
- Bye, Bye, My Clustered AMIs: Christofer Hoff memorializes the outage with updated lyrics for Don McLean’s classic “American Pie.”
More details to come: Allyance.net | 949-863-0025 | firstname.lastname@example.org