At the Nethope Conference, one of the better plenary sessions was by Joe Baguley of VMware. One of the things he mentioned in his talk that resonated for me was something that Netflix had developed called the Chaos Monkey.
The Chaos Monkey is a programme that Netflix run on their systems that randomly shuts down processes and services. The idea is that the world is a chaotic place, and at some point one of your processes or services will shut down. The chaos monkey simulates this, forcing everyone to design systems that can handle this or that part failing.
This seems to be a particularly important concept to grasp, particularly when building on platforms that market themselves as extremely resiliant.
At Christian Aid, I don’t think we need to build our own chaos monkeys. In our international environment, we are frequently interrupted by chaotic events, from giant signs falling on VSAT dishes (Abuja, 2008) to seemingly random VPN outages caused by ISP config errors (Port Au Prince, Dhaka, Delhi, La Paz, all to often recently). Whilst these are a proper pain in the derrier, we must learn from them, and take this learning to build more resilant infrastructure, but also organisational processes that can handle everything from Earthquakes to SAN failure taking out our email system
The Chaos Monkey teaches us to expect the unexpected.
- Coding horror discussion of the Chaos Monkey
- The Monkey was Apple’s invention back in the early 80s, and may have inspired Netflix engineers