AWS EBS Outage at US-EAST-1

I am the CTO of a startup using AWS and we had to suffer from an EBS outage this morning at the US-EAST-1 data center / region.

About 2 weeks ago we moved one of the services to AWS following an outage on a dedicated server. At the time we had no hot backup and were down for too many hours although our (well-known) provider had told us via email that we should expect 1 hour recovery time just a few weeks prior.

For our new setup I provisioned 2 servers with MySQL Master-Slave replication, one on the East coast, the other one on the West coast. We switched over before the setup was entirely finalized for monetary reasons (that's the life of a startup)...

Early this morning our master database hosted on EBS on US-EST-1 was not responding, I could login onto the servers but the database on EBS was no longer processing requests. There was no way to recover the service and I decided to switch over to the slave server on US-WEST-1.

It was much harder than anticipated because our setup and disaster recovery plan was not fully in place, but none of our customers have reported their service down. We were supposed to do our first drill over the weekend but we had a live opportunity to test this new setup and it was a lot of stress due to the nature of this service.

We had proper documentation that was crucial for the fast and full recovery.

AWS management console has worked flawlessly during the incident although we could not do snapshots on the East coast because the EBS volumes were not reachable.

Overall I am very happy with the end result as it rewarded a lot of hard work. Cloud or not, AWS is not better nor worse than other solutions from a reliability standpoint. But at least AWS makes it somewhat easy to setup over different regions when the other provider required that we have both our servers in the same rack!

The ability to start new instances almost instantly also proved useful in this outage. So overall I can say that Cloud Computing helped us from an availability standpoint.

Startups have do build for failures, hardware, network, power, data centers, but also human error (drop table replicated on slave server, oops) and vandalism (delete snapshots and everything else, #@!t).