7 Steps to Execute Chaos EngineeringAmala Amura
Listen on the go!
We’ve all heard about the significant WhatsApp breakdowns that have happened in the recent past, during which the app was unavailable for the public for an hour. However, from a technical standpoint, WhatsApp returned in less than an hour. What would have enabled the engineers at WhatsApp to quickly restore the services?
Technically speaking, the team experienced an extremely stressful production failure because of this. It’s true that major corporations like Netflix, Facebook, Google, and others use a technique called Chaos Engineering.
The purpose of chaos engineering is to learn how our system will behave in the event of catastrophic failures in production and how resilient our system is, which then gives us an opportunity to optimize and fix the issues.
The practice of chaos engineering involves testing a system to increase confidence in its ability to endure turbulence during production. You can use chaos engineering to compare what you expected to happen to what actually happened. To understand how to create systems that are more resilient, we literally need to “break things on purpose“.
When we were little, we used to pick up wooden sticks off the ground and bend them to split them in half. The point at which the stick breaks is what most interests us, though. The stick’s ability to bear stress and pressure is truly represented by the point.
Start with a hypothesis – Measuring the strength of a wooden stick
Measure baseline behavior – Typical strength of the stick
Inject a fault or failure – Break / Hammer the stick
Monitor the resulting behavior – Observing the point when the stick breaks
Chaos engineering is to observe, track, react to, and enhance the reliability of our systems in challenging conditions.
7 Steps to Execute Chaos Engineering
Step 1. Get Approval from your Leadership
Getting your leader’s clearance to conduct the tests in the test environment is the first step. Chaos experiments should often be done in a real-world setting and start out slowly. Chaos experiments can be carried out in any valid environment. Whereas it is recommended to run the experiments in a non-production (QA, Staging) environment rather than a production environment.
Explain the ideals you are bringing by performing chaos experiments like identifying failures and bottlenecks, resilient validation, and scaling validation.
Step 2. Understand the System Architecture
Before running your chaos experiments, thoroughly understand your system’s architecture. Discuss the application architecture in a working session with your developers, architects, and SREs, and learn about the upstream/downstream components, dependencies, timeframe, deployment schedule, and other factors. This will help you understand where exactly your system could fail.
Step 3. Write a Hypothesis
Start writing a list of hypotheses, such as what might go wrong. Example: If a site has numerous nodes and one of them goes down, the load balancer must rapidly reroute traffic to the remaining, healthy nodes. Additional instances of this kind include failing hard drives, broken network connections, potential production interruptions, etc. The main point to note here is that there is no right or wrong while listing down the hypotheses. It is an iterative process. Making our hypothesis TRUE or FALSE is NOT our goal. Each theory will provide us with a chance to learn more about our system.
Step 4. Minimize the Bang
Always get going slowly. By reducing the blast radius, chaotic experiments can be conducted with less impact on the users. Example: Delete the build deployment flow in Jenkins and validate the resiliency. Even if you are deleting a deployment flow, make sure GitOps is active so that the GitOps flow will create the build deployment automatically. Another illustration would be to only shut down a zone of the server rather than the full region or to only turn down 50% of the cluster’s active nodes. You can progressively extend the blast radius once the chaos process has evolved and your crew is at ease.
Step 5. Plan for a Play Date
Always think ahead and have a Plan-B handy. Set up a unified communication channel in Teams (or your company’s communication platform) to post the updates periodically and notify all relevant stakeholders at least one week in advance. It is advisable to assemble your own Avengers team consisting of developers, testers, DevOps, SREs, and others to support you when you ignite your first experiment.
Step 6. Run your First Experiment
Running the first chaos experiment is like riding a thrilling roller coaster. Make sure you can stop the experiment and reverse the infrastructure with the aid of your Avengers squad in case things go wrong. To conduct an experiment, your system must be intentionally broken so that some components of your infrastructure are unavailable. Examples include shutting down working processes, deleting database tables, stopping access to internal-external services, and terminating cluster machines.
Although these experiments are challenging, you will be astonished by how much you can learn from Chaos no matter what you choose to try. Watch your Observability dashboard throughout the experiment to keep track of important metrics like response time, disc usage, pass/fail transactions, health checks, etc. Nobody is flawless. It’s okay if your initial experiment doesn’t go as planned. Post an update as soon as possible, notifying all parties involved.
Step 7. Analyze & Brainstorm the Experiment Results
Once the experiment is complete, record all your observations in a spreadsheet, analyze them, and define your hypothesis verdict. Again, there is only learning and no PASS or FAIL. Schedule a meeting with the respective stakeholders, including your Avengers team, to discuss your verdict. This will help the team to understand the verdicts and fix the issues that you discovered. You can repeat the experiments after addressing the problems.
If you discover that the system is durable, you might want to consider enlarging the explosion radius and repeating the experiments.
Chaos engineering aims to experience disastrous circumstances. Although it may seem like a challenging undertaking and it does call for a lot of imagination, the extra work is unquestionably worthwhile. You must inject failures in your system to make some components of your infrastructure become unavailable. Later, you can mimic situations like high latency caused by slow networks that can upset the steady state.
Enterprises building distributed systems must exercise Chaos engineering as part of their resilience strategy. Running Chaos tests in a continuous manner is one of several things that you can do to improve the resiliency of your applications and infrastructure.
Schedule a discussion with our Chaos Engineering and Testing experts to find out more about the steps involved in executing chaos engineering.
Leave a Reply