Chaos Engineering


Breaking or Learning to Fix?

‘Chaos Engineering’ is applied to distributed computing systems to ensure they can withstand unexpected disruptions, provided organizations have the maturity to conduct these tests.

The word ‘Chaos’ in ‘Chaos Engineering’ is a misnomer. Contrary to belief, it requires a lot of planning, control, and a structured approach to acquire the gains through this testing.

In other words, ‘Chaos Engineering’ is not about haphazard testing to ‘break the system’, but rather it is a structured approach to unraveling the behavior of the system under various experimental conditions. Therefore, you do not ‘break’, but rather you learn to fix the shortcomings.

Structure within ‘Chaos’

Chaos means “complete disorder or confusion”; however, in the current context, it could be interpreted as random and unpredictable behavior. However, while doing Chaos Tests, we form a hypothesis and then plan and run the experiments, so those are not truly random, though the outcome could be random as the hypothesis can also fail.

Further, neither every system requires ‘Chaos Engineering’, nor can it be done by every organization, as it requires organizational maturity and resources.

Chaos Engineering Services

The Culture

A ‘Chaos’ is structured and contained only if an organization has the mature processes and capabilities across,

  1. Culture of experimentation and learning—psychological safety
  2. Risk management and governance
  3. Monitoring and observability
  4. Incidence response and resolution practices and capabilities
Security Testing
Set the baseline: ‘One that can’t be measured can’t be improved—Six Sigma’

We establish a baseline of “current” and articulate “How the system must operate under normal conditions.” That is, we are defining “What is ‘Normal”.

Security Testing
Form a hypothesis

Think of a test, i.e., define the scope of a test, which must be specific and not too broad or generic. For example, it could be “What will happen if a large traffic spike occurs?” or “What if the Iaac provisioning fails (at a specific level)?”.

Security Testing
Conduct the test

The experiment could be in pre-production or production, based on organizational maturity, and be governed through the entire life of the experiment through various measures and metrics automatically.

Security Testing
Evaluate the results

Evaluate the metrics during and after the experiment, decide how the hypothesis has fared, and determine the weak points to be strengthened.

Security Testing
Understand the system and practices

Understanding and baselining the current capabilities, constraints, dependencies, performance, business impact, etc. Does it make sense to conduct ‘Chaos’ tests?

Security Testing
Establish the organization’s maturity

Tools, processes, techniques, automation, culture… To determine the impact of the “blast” and if it could be contained, managed, and resolved effectively.

Security Testing
Understand and set objectives

What does that organization want to achieve out of these ‘Chaos’ experiments? What needs to be discovered or improved upon? Will it be ‘increased availability’, MTTR, MTTD, a few bugs, or reduced supervision…?

Security Testing
Select the test

Will it be pre-production or production based on the organizational maturity assessment? Will it be ‘Canary’ or ‘Generic’?…

Security Testing
Articulate the tests and “blast” scope

What kind of tests will be conducted- Latency injection, fault injection, load generation…?

Security Testing
Establish measures and metrics

Tracking the progress of the experiment, and its impact of the system, corrective actions and impact, collection for future experiments, and baseline…

Security Testing
Incidence response plan

How to contain the incident which has happened during the experiment- containment, corrective actions, rollback…

Security Testing
Evaluate results

Generate insights from the collected data to plug the weaknesses.

Think of a system in terms of well-coordinated but demarcated verticals or functions


Essentially, a component interaction map to design and plan the experiments better.

These verticals or functions are further detailed during planning, e.g., infrastructure will have CPU, Memory, Storage, Load Balancers, etc.

Security Testing
Consulting services
  1. Need for Chaos Engineering
  2. Organizational maturity assessment
  3. Objectives, measures, and metrics formulation
  4. Tools, fitment, and training
Security Testing
Chaos Engineering as a service
  1. Infrastructure team
  2. Network team
  3. Traffic team
  4. Data Streaming team
  5. Storage team
  6. Database team
  7. Application team
As no two businesses and their maturity levels are the same, we tailor our approach to your unique needs. Our expertise in technology and domains provides an edge over generalists.Trust our established practices to deliver complete and robust ‘Chaos’ evaluations. Partner with us on your journey to maturity.