On Software Development
The chronicle of an IT tragedy
Preamble: I’m working at Spreaker, an innovative startup about live streaming and podcasting based in Italy. Spreaker is widely deployed on the Amazon Cloud Computing infrastructure, we use different services, including EC2, S3, EBS.
Now, imagine an almost perfect summer sunday evening. Chillin’out with friends, drinking a cold beer, everything seems so good.
Wait, something starts vibrating… I’ve got mail! It will be an advertisment email from a website…. no, it’s Pingdom. And when Pingdom calls, it’s for telling you something you don’t want to know.
This is the rapid progress of the events:
- Notification from Pingdom: “Frontend is DOWN”
- Notification from Pingdom: “Api are DOWN”
- Notification from ServerDensity: “No data received”
WTF?
- The phone rings: “There are problems in EC2. I’m trying to understand what happened. I’ll call you back.”
- The phone rings again: “EC2 is fucked up. See you at the office, I’m already in the car”
The fact: a lightning strike ( in the cloud? :D ) hit the power transformers of an EC2 Availability Zone, the “eu-west-1b”, sparking an explosion and fire. Also part of the backup power plant was impacted by the hit, and parts of the AZ had loss power.
This is an extract from the Amazon EC2 documentation:
Amazon EC2 provides the ability to place instances in multiple locations. Amazon EC2 locations are composed of Regions and Availability Zones. Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location. Regions consist of one or more Availability Zones, are geographically dispersed, and will be in separate geographic areas or countries.
Unfortunately, our instances were all in the same AZ, the “eu-west-1b”, so we started the “disaster recovery” procedure. All the procedure took about 2 hours for coming back online with the core service, and another two hours to restore all the features. We restored the database from EBS snapshots, moved all instances and services outside of the impaired AZ and slowly the situation becomes normal.
What we have learned from this disaster:
- Act like a paranoic for all of your backups and service redundancy
- Implement a multi-AZ AND multi-Region architecture
In concrete this is what have done in these days after the tragedy:
- Cross-AZ and Cross-Region database warm standby replication
- Cross-Region AMI storage
- Cross-Region backups on S3
- Cross-AZ load balancing with Elastic load balancer
- Streaming servers on multiple AZ balanced by our API
- Skynet ( our autoscaling software ) has been modified to autoscale instances on multiple AZ, based on the available zones reported by the Amazon API
It has been a stressful but challenging week, isnt’it?
An iOS Developer Takes on Android
Recently, we released the Android version of Meridian, our platform for building location-based apps.
We didn’t use one of these “Cross Platform!” tools like Titanium. We wrote it, from scratch, in Java, like you do in Android.
We decided it was important to keep the native stuff native, and to respect each platform’s conventions as much as possible. Some conventions are easy to follow, like putting our tabs on the top. Other conventions go deep into the Android Way, like handling
Intents, closing oldActivities, implementing Search Providers, and being strict about references to help the garbage collector.Now, our platform leverages HTML5 (buzzword, sorry) in many places for branding and content display, so we got a fair amount of UI for free. But there was much platform code written in Objective-C that needed translation into Java, such as map navigation, directions, and location switching.
So, we rolled up our sleeves, downloaded the Android SDK, and got to work.
This is not a joke….. WTF!
A new era for PHP
In the last months a very vibrant community has been grown for PHP. I think that most of the credits for this has to be granted to two amazing projects:
Github gave to all of us a great collaboration tool
Symfony2 is bringing the whole PHP environment to the next level. A level never seen before.
These are some examples of what’s happening in the PHP world:
When Eclipse doesn’t start….
Sometimes Eclipse can be a real pain in the ass of a developer.
Eclipse use a sort of “index” for advanced features like code assist, and if something goes wrong when that index is built, the IDE can freeze during the startup.
The only trick I’ve found to exit from this kind of deadlock is to remove an hidden file in the workspace.
rm <workspace>/.metadata/.plugins/org.eclipse.core.resources/.snap
This file contains the index I’ve said before, removing this file force Eclipse to rebuild the index from scratch and the situation returns normal.
Cheers ;)
