Hang on a second while we grab that post for you.
Three cloud ops lessons you should learn before your next outage « Cloudability by Jeremy Wagner-Kaiser
It’s a quiet Thursday night in Portland. It’s about 9:45 PM. Your humble author is preparing to meet a friend for drinks.
These rather promising plans are rudely interrupted when the on-call engineer informs me that our alerting systems are doing their best christmas tree impression and it doesn’t seem to be stopping. Hilarity ensued.
When the dust settled and the systems were purring along again, it was time to look back and draw a few lessons.
This doesn’t sound all that bad. It’s how and when things break that gets rough.
The recent AWS outage is a perfect example. No instances were rebooted by the service going down, but that wasn’t the problem. The problem was the cascading failures.
The AWS outage meant that Heroku failed. This means that our app fell over. Moreover, it meant that many of the services we depend on fell over. Some of them were responsible for logging, monitoring, or exception handling. While those were important, they weren’t critical to continued operation.
No, the real problem was that our Redis service died. The Redis service was used to connect our tightly secured backend boxes to our frontend. When AWS and Heroku came back, our Redis service didn’t. This caused an interesting array of internal errors and we learned quite a bit from it.
Chiefly, though, we always approach our systems as if something can break. Because it can, and it will.