Chaos Engineering resources

Overview

Articles

Our blog focuses on the basics of Chaos Engineering and shares our experiences in running effective failure tests.

Sign up for our mailing list to receive blog posts as they are published.

Talks

The Evolution of Chaos - Chaos Engineering is intentionally injecting failure into a system to proactively identify and fix problems before they cause outages. It's an emerging discipline, but its roots are decades old. So why is it now becoming the go-to approach for building resilient systems? Why does the current state of distributed architectures require chaos as the best solution for system failure?

Breaking Things on Purpose - Failure Testing prepares us, both socially and technically, for how our systems will behave in the face of failure. By proactively testing, we can find and fix problems before they become crises. Practice makes perfect, yet a real calamity is not a good time for training. Knowing how our systems fail is paramount to building a resilient service.

Monkeys in Lab Coats - In this talk, we present our experience: a fruitful industry/academic collaboration. We describe how a “big idea” -- lineage-driven fault injection -- evolved from a theoretical model into an automated failure testing system that leverages Netflix’s state-of-the-art fault injection and tracing infrastructures.

Podcasts

Software Engineering Daily - Servers in a data center fail. Bugs in an application make it into production. Human operators make mistakes. Failure is unavoidable. Jeff and Kolton discuss failure testing, start up life, and culture at Amazon and Netflix.

InfoQ - QCon chair Wesley Reisz talks to Kolton Andrus, founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website.

The Cloud Cast - Brian talks with Kolton Andrus about his background at Amazon and Netflix, the discipline of Chaos Engineering, the challenges of breaking things in production, and Gremlin Inc’s approach to building better applications and systems.

Published Papers

Automatic Failure Testing Research at Internet Scale - uscs.edu - In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix.

"Required" Reading

On Designing and Deploying Internet Scale Services - James Hamilton - Proceedings of the 21st Large Installation System Administration Conference (LISA '07)

Chaos Engineering - Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones & Ali Basiri - Netflix

Site Reliability Engineering - Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Murphy - Google

Antifragile - Nassim Nicholas Taleb

Drift into Failure - Sidney Dekker