On Software Development
The chronicle of an IT tragedy
Preamble: I’m working at Spreaker, an innovative startup about live streaming and podcasting based in Italy. Spreaker is widely deployed on the Amazon Cloud Computing infrastructure, we use different services, including EC2, S3, EBS.
Now, imagine an almost perfect summer sunday evening. Chillin’out with friends, drinking a cold beer, everything seems so good.
Wait, something starts vibrating… I’ve got mail! It will be an advertisment email from a website…. no, it’s Pingdom. And when Pingdom calls, it’s for telling you something you don’t want to know.
This is the rapid progress of the events:
- Notification from Pingdom: “Frontend is DOWN”
- Notification from Pingdom: “Api are DOWN”
- Notification from ServerDensity: “No data received”
WTF?
- The phone rings: “There are problems in EC2. I’m trying to understand what happened. I’ll call you back.”
- The phone rings again: “EC2 is fucked up. See you at the office, I’m already in the car”
The fact: a lightning strike ( in the cloud? :D ) hit the power transformers of an EC2 Availability Zone, the “eu-west-1b”, sparking an explosion and fire. Also part of the backup power plant was impacted by the hit, and parts of the AZ had loss power.
This is an extract from the Amazon EC2 documentation:
Amazon EC2 provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of Regions and Availability Zones. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location. Regions consist of one or more Availability Zones, are geographically dispersed, and will be in separate geographic areas or countries.
Unfortunately, our instances were all in the same AZ, the “eu-west-1b”, so we started the “disaster recovery” procedure. All the procedure took about 2 hours for coming back online with the core service, and another two hours to restore all the features. We restored the database from EBS snapshots, moved all instances and services outside of the impaired AZ and slowly the situation becomes normal.
What we have learned from this disaster:
- Act like a paranoic for all of your backups and service redundancy
- Implement a multi-AZ AND multi-Region architecture
In concrete this is what have done in these days after the tragedy:
- Cross-AZ and Cross-Region database warm standby replication
- Cross-Region AMI storage
- Cross-Region backups on S3
- Cross-AZ load balancing with Elastic load balancer
- Streaming servers on multiple AZ balanced by our API
- Skynet ( our autoscaling software ) has been modified to autoscale instances on multiple AZ, based on the available zones reported by the Amazon API
It has been a stressful but challenging week, isnt’it?