Chaos Engineering — Attack, Observe, Improve

Walkthrough and demo of the Gremlin Chaos engineering platform

Gary Parker
FAUN — Developer Community 🐾

--

The science of performing intentional experimentation on a system by injecting precise and measured amounts of stress to observe how the system responds for the purpose of understanding and improving the system’s resilience.

Intro.

I’ll be walking through some of the features provided by Gremlin — if you’re not familiar with Chaos Engineering, the Gremlin team have provided a very detailed glossary of all the important terminology.

These are a few example scenarios:

  • If we put the system under heavy load, does it auto-scale as expected?
  • If we shut down a host or service, does the system respond gracefully?

Prerequisites

  • Create a Gremlin account — their free tier is great and provides everything needed to try things out
  • Create an AWS account (if you’re new to this, I’d advise AWS — its really easy to setup and we’ll be using their free offering)

There is also support for Docker, Kubernetes, Ubuntu, Windows etc., if you prefer to work with a different platform — full guides can be found in the Gremlin docs.

Getting started

I followed this guide and installed Gremlin in my AWS environment.

Let’s attack!

First we have to choose whether we are attacking services or infrastructure — we will be selecting infrastructure, and choosing our host from AWS.

Gremlin provide quite a few different categories of attack —

Resource

  • CPU / Disk / IO / Memory

State

  • Process Killer / Shutdown / Time travel

Network

  • Blackhole / DNS / Latency / Packet loss

We will be carrying out a resource attack against the hosts CPU.

Demo

On the left side we have the Gremlin interface — we will be targeting the system, selecting the attack details and monitoring the attack.

On the right side we have the AWS instance being attacked — we are monitoring the active processes, where you’ll be able to see the Gremlin command maxing out the CPU.

What’s next

This snippet is taken from the Gremlin website, and concisely summarises the approach we can take going forward:

Creating a scenario

Gremlin provides a lot of scenario templates for you to experiment with — you’ll just need to provide the target and run the scenario.

This is an example of a shutdown attack that I was able to run:

And I was prompted to provide details on the outcome and status of the run.

Dashboard Overview

Gremlin makes it really easy to execute scenarios and individual attacks — as you can see below, I ran 2 successful scenarios, and also carried out numerous resource attacks on the instance.

Conclusion

This was my first time using the Gremlin platform — overall it was a great user experience, the documentation was really helpful, and I was able to get working examples up and running very quickly.

There is a lot offered by the platform, and I was only able to test out some of the basic features — there is a lot of room for customisation, scheduling and integration with other platforms. Definitely worth checking out!

Join FAUN: Website 💻|Podcast 🎙️|Twitter 🐦|Facebook 👥|Instagram 📷|Facebook Group 🗣️|Linkedin Group 💬| Slack 📱|Cloud Native News 📰|More.

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇

--

--

Senior QA Architect, responsible for QA Architecture, tooling, frameworks and processes. Specializing in front-end web and mobile technologies.