Chaos Engineering — Attack, Observe, Improve
Walkthrough and demo of the Gremlin Chaos engineering platform
The science of performing intentional experimentation on a system by injecting precise and measured amounts of stress to observe how the system responds for the purpose of understanding and improving the system’s resilience.
Intro.
I’ll be walking through some of the features provided by Gremlin — if you’re not familiar with Chaos Engineering, the Gremlin team have provided a very detailed glossary of all the important terminology.
These are a few example scenarios:
- If we put the system under heavy load, does it auto-scale as expected?
- If we shut down a host or service, does the system respond gracefully?
Prerequisites
- Create a Gremlin account — their free tier is great and provides everything needed to try things out
- Create an AWS account (if you’re new to this, I’d advise AWS — its really easy to setup and we’ll be using their free offering)
There is also support for Docker, Kubernetes, Ubuntu, Windows etc., if you prefer to work with a different platform — full guides can be found in the Gremlin docs.
Getting started
- Install Gremlin on your environment with one of the quick start guides
I followed this guide and installed Gremlin in my AWS environment.
Let’s attack!
First we have to choose whether we are attacking services or infrastructure — we will be selecting infrastructure, and choosing our host from AWS.
Gremlin provide quite a few different categories of attack —
Resource
- CPU / Disk / IO / Memory
State
- Process Killer / Shutdown / Time travel
Network
- Blackhole / DNS / Latency / Packet loss
We will be carrying out a resource attack against the hosts CPU.
Demo
On the left side we have the Gremlin interface — we will be targeting the system, selecting the attack details and monitoring the attack.
On the right side we have the AWS instance being attacked — we are monitoring the active processes, where you’ll be able to see the Gremlin command maxing out the CPU.
What’s next
This snippet is taken from the Gremlin website, and concisely summarises the approach we can take going forward:
Creating a scenario
Gremlin provides a lot of scenario templates for you to experiment with — you’ll just need to provide the target and run the scenario.
This is an example of a shutdown attack that I was able to run:
And I was prompted to provide details on the outcome and status of the run.
Dashboard Overview
Gremlin makes it really easy to execute scenarios and individual attacks — as you can see below, I ran 2 successful scenarios, and also carried out numerous resource attacks on the instance.
Conclusion
This was my first time using the Gremlin platform — overall it was a great user experience, the documentation was really helpful, and I was able to get working examples up and running very quickly.
There is a lot offered by the platform, and I was only able to test out some of the basic features — there is a lot of room for customisation, scheduling and integration with other platforms. Definitely worth checking out!
Join FAUN: Website 💻|Podcast 🎙️|Twitter 🐦|Facebook 👥|Instagram 📷|Facebook Group 🗣️|Linkedin Group 💬| Slack 📱|Cloud Native News 📰|More.
If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇