Breaking or Learning to Fix?
‘Chaos Engineering’ is applied to distributed computing systems to ensure they can withstand unexpected disruptions, provided organizations have the maturity to conduct these tests.
The word ‘Chaos’ in ‘Chaos Engineering’ is a misnomer. Contrary to belief, it requires a lot of planning, control, and a structured approach to acquire the gains through this testing.
In other words, ‘Chaos Engineering’ is not about haphazard testing to ‘break the system’, but rather it is a structured approach to unraveling the behavior of the system under various experimental conditions. Therefore, you do not ‘break’, but rather you learn to fix the shortcomings.
Structure within ‘Chaos’
Chaos means “complete disorder or confusion”; however, in the current context, it could be interpreted as random and unpredictable behavior. However, while doing Chaos Tests, we form a hypothesis and then plan and run the experiments, so those are not truly random, though the outcome could be random as the hypothesis can also fail.
Further, neither every system requires ‘Chaos Engineering’, nor can it be done by every organization, as it requires organizational maturity and resources.
Chaos Engineering Services
The Culture
A ‘Chaos’ is structured and contained only if an organization has the mature processes and capabilities across,
- Culture of experimentation and learning—psychological safety
- Risk management and governance
- Monitoring and observability
- Incidence response and resolution practices and capabilities
Set the baseline: ‘One that can’t be measured can’t be improved—Six Sigma’
We establish a baseline of “current” and articulate “How the system must operate under normal conditions.” That is, we are defining “What is ‘Normal”.
Form a hypothesis
Think of a test, i.e., define the scope of a test, which must be specific and not too broad or generic. For example, it could be “What will happen if a large traffic spike occurs?” or “What if the Iaac provisioning fails (at a specific level)?”.
Conduct the test
The experiment could be in pre-production or production, based on organizational maturity, and be governed through the entire life of the experiment through various measures and metrics automatically.
Evaluate the results
Evaluate the metrics during and after the experiment, decide how the hypothesis has fared, and determine the weak points to be strengthened.
Understand the system and practices
Understanding and baselining the current capabilities, constraints, dependencies, performance, business impact, etc. Does it make sense to conduct ‘Chaos’ tests?
Establish the organization’s maturity
Tools, processes, techniques, automation, culture… To determine the impact of the “blast” and if it could be contained, managed, and resolved effectively.
Understand and set objectives
What does that organization want to achieve out of these ‘Chaos’ experiments? What needs to be discovered or improved upon? Will it be ‘increased availability’, MTTR, MTTD, a few bugs, or reduced supervision…?
Select the test
environment
Will it be pre-production or production based on the organizational maturity assessment? Will it be ‘Canary’ or ‘Generic’?…
Articulate the tests and “blast” scope
What kind of tests will be conducted- Latency injection, fault injection, load generation…?
Establish measures and metrics
Tracking the progress of the experiment, and its impact of the system, corrective actions and impact, collection for future experiments, and baseline…
Incidence response plan
How to contain the incident which has happened during the experiment- containment, corrective actions, rollback…
Evaluate results
Generate insights from the collected data to plug the weaknesses.
Think of a system in terms of well-coordinated but demarcated verticals or functions
Essentially, a component interaction map to design and plan the experiments better.
These verticals or functions are further detailed during planning, e.g., infrastructure will have CPU, Memory, Storage, Load Balancers, etc.
Consulting services
- Need for Chaos Engineering
- Organizational maturity assessment
- Objectives, measures, and metrics formulation
- Tools, fitment, and training
Chaos Engineering as a service
- Infrastructure team
- Network team
- Traffic team
- Data Streaming team
- Storage team
- Database team
- Application team