We tend to build solutions where we assume the worst case scenario is merely performative— it will never actually happen. We slack with those solutions because they’re incredibly unlikely. This is the story of a worst case scenario.
There are folks who can corroborate this story; if you read this and can add, please reach out as the disaster was about a decade ago and some of the details are fuzzy1.
I lived close to the datacenter and routinely got calls from junior engineers who were flummoxed with this or that. I didn’t mind an occasional trip to help sort out some miscellany.
I was standing at the gas pump filling up while a couple buddies grabbed a case of beer when the call came in. We were headed to the big city to party for the night. The call was from a principal engineer. Odd.
“The datacenter is on fire, can you come in?”
Sure, no problem.
“Hey, there’s a problem at work, do you mind if we swing by on the way?”
My friends, good people who understand that shit happens, were unfazed. We’d get to the city a little later than planned, no big deal. We arrived and I locked them in my office to have a few beers and make bad jokes. We never made it to the city that night.
No really, the datacenter was literally on fire.
The DC was fairly modern in both age and approach. An auto-transfer switch had failed during a grid power fluctuation and sent a surge (memory says 1400 amps) into a circuit not rated for the power (800-900 amps, maybe?) pummeling through. With massive power comes massive heat, a small fire broke out. The safety systems triggered; saved the building and most of its contents. No one was injured. The fault was isolated to an electrical room. Had the failover happened a couple microseconds faster, the event never would have occurred.
There were cost-saving measures in play. Rather than bringing multiple legs of power to each rack, it was decided we could get by on a single leg since it was unlikely to ever cause an issue. That decision bit us in the ass. Hard.
We lost a leg of power, it was completely fried and we couldn’t risk sending power to those racks. We also lost a set of UPSes that valiantly attempted to buffer the surge. 2/3 of the 300+ rack facility was humming along as if nothing happened, but we were down about a 100 racks. The system architecture didn’t permit us to instantly transition workloads to other gear or datacenters, we had to get them back online.
I’ll take a moment to lament that I’ve long lost the photos of impromptu problem solving that happened that night. I giggle about the solutions occasionally. One of my favorites was a PDU velcro’d to a crash cart (like this, thanks to mainlinecomputer.com from whom I stole the image) that let power be moved where we needed it2. There were lots of wacky solutions.
Photos would illustrate well my contention that good enough is fine in the moment, but you have to fix flaws properly long term. Instead, a picture of a donut truck that uses the vehicle’s HVAC to keep donuts warm I snapped the other day:
I was specifically told I couldn’t have a copy of the video footage because they knew I would write this up one day. I waited nearly a decade, Ryan3. 😂
Nearly everything hardware-related was handled in house. I can’t imagine how long getting back online would have taken had we needed to call a vendor for replacement parts and gear. Sometimes being cheap is an asset.
The surge fried gear all over the facility. When the UPSes failed, the power supplies went to war and many sacrificed themselves for the cause. An aggressive deployment schedule left us with gear on hand to swap swap swap4. We were lucky.
Even though the architecture was dependent on local (to the datacenter, but also frequently to the machine) storage, our failure domains were well defined and aided in semi-prompt restoration of service as hardware was swapped and brought back online. Easily 100k customers were directly impacted, numerous more indirectly.
The most pressing issue was that one of two “support” rows (approximately 480 U) was partially offline. In the modern era, we talk about pets versus cattle, the support rows were home to a bunch of legacy pets. We had partial monitoring and worked quickly to restore full visibility. You can’t fix what you don’t know is broken.
We needed to let our customers know that there’s a problem and we’re working on it: a) it’s the right thing to do, and b) support was getting slammed. The emergency voice of our phone system stepped away, attempted to suppress the “oh god the world is burning” in his voice, and recorded a notice for inbound calls. Messages acknowledging the issue were posted on the status site and support portals.
We still needed to email everyone. Emailing customers ended up a much bigger task. The customer database and outbound mail proxies, both of which were in a degraded state, were necessary for mass mail. It didn’t help that outbound corporate mail was intentionally restricted to a small subset of the larger outbound mail pool. Getting messages to customer inboxes was going to take time, time we didn’t have. Wake up the senior devs. Sorry folks, we rarely wake you in the middle of the night, but we need you.
Waking the devs not only put more eyes on the problem, it also let us segment work that would normally be handled by a single team during crisis. It meant the systems engineering team could focus on increasing outbound mail capacity temporarily (using all the dirty tricks of spammers) while someone else figured out a message and coaxing the database back to health. The messages went out.
Hours went by, we gradually brought impacted systems back online. Power tails were dragged from all over the facility through the plenum and dropped into unpowered rows. Chassis after chassis was swapped. One-off scripts were written to deal with issues never considered5.
The machine limped back to life.
24 hours in, consumption of burritos had likely reached 100; pizzas 50; and cases of energy drinks 10. We were dragging, but weren’t done.
Stay tuned for part two where we’ll examine the decisions that caused the failure and extended recovery effort.
-
In the course of writing, I’ve reached out to a number of folks who were there. While I swapped a few chassises, my primary focus then and in this piece was getting the whole system back online. Hopefully those who raced between staging and the datacenter, carts piled high with hardware to swap will chime in, but they’re all senior engineers with big projects and families now that limit their time for reminiscing. Fingers crossed. ↩︎
-
We, unfortunately, later lost the engineer who came up with the solution to mental illness. He was a brilliant mind who had a penchant for collecting and gifting. I dragged those electric motors he gave me from office to office for years until finally a move cross-country made it impractical. Paul inherited and I suspect still has them. No matter how inconvenient finding a spot to display them was, they were a better reminder than any hunk of granite could ever be. You are missed, Aaron. ↩︎
-
No Ryans were harmed in the writing of this piece. One particular Ryan was repeatedly harmed over the course of years wrangling a dowt6 of engineers. Thanks for dealing with us, Ryan. ↩︎
-
Jeff was hoarding legacy power supplies, none of us knew, it was a blessing to find out that night. ↩︎
-
I wrote a spectacularly terrible Perl script that I wish I had a copy of just to marvel at in horror. ↩︎
-
There may be disagreement over what to call a group of engineers, I find a group of feral cats apt imagery. I’m also somewhat partial to “a fixture.” ↩︎