Keeping Cool in The Face of Disaster

In Washington, during the peak of Hurricane Sandy, my power went out. The wind outside was howling, the UPS that my MacBook was connected to began to scream, and my Comcast Internet was hosed. We were in the middle of manually failing over our application servers to our New Jersey datacenter, and now I had to continue on my T-mobile tethered cellphone.

Disaster recovery (DR) is serious business. When a catastrophe strikes you’re never completely prepared. At Bloomberg, as I imagine most large companies, we take care in designing our systems to be reslient. For us, this means highly available, uninterrupted service for all of our clients. In order to be able to sustain a complete datacenter outage we must be fully confident in our automated solutions for failover. A disaster recovery plan is no good if it hasn’t been tested and a disaster is the wrong time to execute that test.

Wikipedia says that there are seven tiers to disaster recovery but this gives a thousand foot view of how to plan for the technical portion of business continuity. From experience there are a few practical actions you and your team can take to keep your cool during a disaster.

Reboot your machines on a regular basis.

Years ago, any system administrator worth their salt laughed at those who counted their uptime in years. If you never reboot your machine how can you be sure that it will actually come back online in a usable state? A few thousand dollars worth of hardware is no good if you are not sure how to QC it before putting it in front of customer traffic. Spend the time and make a checklist of the services that you want to come online automatically after booting.

Work towards making your scalable systems ephemeral.

Scaling your application out horizontally does not immediately mean that the individual nodes are capable of spinning up and down without any major problems. This often means designing all of your services with high availability in mind. This is much more than just sticking HAProxy in front of all your application endpoints. You need to be sure that your applications and services will fail gracefully, or better yet, swing to their alternate slave.

Don’t put all your eggs in one basket.

When deploying physical hardware in a datacenter you wouldn’t put it all on the same power circuit. Unfortunately there is often no longer physical access to hardware. All provisioning is performed over the wire. How much sense would it make for all of your company’s virtual machines all being tenants on the same physical hardware? Not much at all! If possible ask your service provider if their compute services are rack aware.

Package complex steps into shell scripts.

If you cannot automatically orchestrate a failover of a machine take the time and define a clear and concise shell script to perform the work. It is much easier in documentation to point to a single bash script which illustrates the more complex procedures. In the middle of the night your operations staff may not have any context in the problem that they’re trying to solve. Rather than calling an all hands on deck they can run through a list of procedures.

Clearly document manual procedures. The analog way.

This brings me to the final item on the checklist. There’s a lot of companies out there that offer digital solutions for organizing your team and their work. Did you know that the popular Heroku Platform as a Service is built on top of Amazon Web Services? If that fancy SaaS application which contains all of your documents goes offline you’re now unable to bring your systems back online manually. Spend the time and physically document procedures and distribute it to all of your team members. Especially your operations staff.

These are only but a few action points that you must cover while planning for business continuity. As I mentioned above this is serious business. You should begin first with hiring a proper operations staff, and remember that these should not be the same people that are your developers. Writing code and managing infrastructure are two completely separate beasts. But your engineers should start by designing their systems to sustain failure at any level in your stack.

I hope that these steps help you while planning for failure. If you find this useful give me a shout and let me know!