What a big topic! My goal here is just to introduce Kubernetes concepts specifically to support testing activity. In particular, the testing activity we’re trying to get to is a fully automatable, cloud-agnostic, chaos testing framework.
The tool we’ll use for Chaos testing is called Litmus (https://github.com/litmuschaos/litmus), and it runs inside this Kubernetes thing. We’ll talk more about that in a minute.
First of all, why do we want to do chaos testing at all?
Chaos testing helps you find out how your application responds to infrastructure events. This isn’t that big a deal for a monolithic app where all the state is probably on one server. If your app is a suite of microservices that are all spread out across a bunch of servers, and if those microservices need to use the network to talk to other related services, then you’ll want to make sure the app comes back after a network interruption, right?
Or maybe you want to make sure your app continues working after:
Litmus can help make sure of that. The way Litmus helps with that is by injecting chaos. This means Litmus triggers infrastructure events randomly and reports back on how Kubernetes handles it. The net effect is that you have event logs for an infrastructure event that you can correlate with application logs to find out how the app behaved in that particular infrastructure event. Say, for example, my app reads from a MinIO object-store Deployment, and the chaos I inject is to kill the pod associated with that MinIO Deployment; I can see when Litmus killed the pod, and when the pod went away, and what my app did when it couldn’t get to the object store, all in my Kubernetes logs. I don’t always get accurate information about the type and onset of real-world infrastructure events, but with Litmus, I can see the whole picture.
Sounds cool, right? There are lots of ways to Chaos Test your app, but if you’re already running your app in Kubernetes (and maybe you Devs have an app that you’re trying to do that with now?), then Litmus is your tool. But let’s come back to Kubernetes, what is it, and how does it work?
There’s no way a blog post can sum up Kubernetes. It is, fundamentally, a container orchestration suite that you can use to find a server to run your containers on. To help it decide where to put the containers, it includes a consistent, declarative language for describing what the app components need. Kubernetes provides a nice abstraction of the compute infrastructure it runs on, and developers can use the same language to describe their app’s infrastructure requirements. We’ll walk through some components and architecture in a bit, but first, a little theory.
Kubernetes is an orchestrator for Cloud Native applications that are designed as a suite of microservices, which services all run in containers. Now, Kubernetes reads the description of the app that the developer writes. Kubernetes also collects info about the current state of the infrastructure. It uses all this info to make changes to infrastructure. When the Infrastructure looks the same as the developer’s specification, we’re done. Well, we’re done until something changes.
What’s going to change? Either the application or the infrastructure. The app might change because people or test programs are connecting to it, so that the application might be allocating more memory, asking databases for info, processing that info, and logging about that connection. The infrastructure might change because people or infrastructure management programs are adding, changing, or removing nodes from the cluster, so maybe Kubernetes is spinning up app components, connecting those components to load balancers, and then decommissioning other app components on the nodes that are changing. Kubernetes keeps track of all this change and continually makes sure the current state looks like the configured state. The software components that make these changes are called controllers. Controllers watch for changes in Kubernetes resources and change the infrastructure to get the current and configured states to match. A far better description of how all of this works can be found here: https://kubernetes.io/docs/concepts/architecture/controller/. For this document, we’ll call this pattern of watching for changes and modifying the infrastructure in response, the Reconciliation Loop.
Kubernetes is pretty complicated. Many components are working in concert to accomplish all of the above. In the next section, we’re going to race through some of the jargon associated with Kubernetes. Hang on!
Kubernetes was initially designed for hosting stateless applications. For stateless apps, the app gets everything it needs for its output in its input. For stateful apps, the app remembers stuff and then uses that in conjunction with inputs to generate its output. The app is said to “keep state” and use state during its processing. So what to do with the application state?
Put it in the cloud, of course! The cloud providers all have services that are designed for you to keep state for your 12-factor app. The app runs in Kubernetes, it keeps its state in proprietary solutions like RDS and S3, and then you use and ELB to talk to the front end. Done. Or, you can put it in Kubernetes itself using tools like OpenEBS to manage your storage; this way, your databases work like any other app you deploy.
One advantage of putting it in Kubernetes is that you then get to find out what happens when the app’s connection to the storage is interrupted. Since you have control over all of the infrastructure just by changing Kubernetes, you can use tools like Litmus, which work within Kubernetes, to simulate an outage. Outages happen. What’s important is how your application responds to the outage. Does it come back up on its own? After how long? With Litmus, you know exactly when you’ve injected the chaos, and you can observe the results in a controlled environment. If you just put all of your stateful components in cloud services, you can’t necessarily simulate outages the same way, and you’ll have to rely on your cloud provider’s info to correlate storage events with application behaviors.
This is one compelling reason for keeping it all inside Kubernetes — storage, stateful apps, and stateless apps, along with networking components — because it’s easy to correlate event logs. Kubernetes also provides a nice consistent user interface for developers (as opposed to internal docs for internal infra management tools, which are harder for new engineers to find and digest, generally speaking). With Litmus, since it’s implemented as a Controller (we talked about controllers, right?) with its own CRDs (remember that too?) and takes advantage of all of this built-in observability and usability, it’s pretty easy to fit chaos testing into your existing k8s deployment workflows.
So,
kubectl apply -f https://litmuschaos.github.io/pages/litmus-operator-v1.1.0.yaml
Should get you started with Litmus, and you can see it running with
kubectl get pods -n litmus
https://docs.openebs.io/docs/next/chaostest.html has a full walkthrough, but I’ll summarize here:
For the SA, Role, and RoleBinding, I’ll include a working sample from our documentation. RBAC in Kubernetes is a whole other topic.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: chaos-sa
namespace: backup
labels:
name: chaos-sa
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
name: chaos-sa
namespace: backup
labels:
name: chaos-sa
rules:
- apiGroups: ["","litmuschaos.io","batch","apps"]
resources: ["pods","jobs","daemonsets","pods/exec","chaosengines","chaosexperiments","chaosresults"]
verbs: ["create","list","get","patch","update","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
name: chaos-sa
namespace: backup
labels:
name: chaos-sa
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: chaos-sa
subjects:
- kind: ServiceAccount
name: chaos-sa
namespace: backup
Applying this authorizes litmus to manipulate the app. We can tell Litmus what to do to the app with a Chaos Experiment like this one:
---
# chaosengine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: minio-chaos
namespace: backup
spec:
annotationCheck: 'false'
components:
runner:
image: 'litmuschaos/chaos-executor:1.0.0'
type: 'go'
jobCleanUpPolicy: 'delete'
monitoring: false
appinfo:
appns: 'backup'
applabel: 'app=minio'
appkind: 'deployment'
chaosServiceAccount: chaos-sa
experiments:
- name: pod-delete
spec:
components:
- name: TARGET_CONTAINER
value: minio-stretched
When we apply this, sure enough, after a couple of minutes, the pod associated with the deployment called “minio-stretched” goes away, and Kubernetes dutifully creates another one to replace it.
Now, I can fire up Kibana or what have you, and look at how the app responds during the object store outage. I know exactly when it went down and came back up. There are loads of tools to help find out what happened to my application. Once I fix any bugs related to that particular type of event, I’ve got a ready-made regression test for the change. When I integrate my component with everyone else’s, I can find out how my application’s failure and recovery impacts other components. And when it’s all running in my prod cluster, my SREs can use the same test to validate their alerting infrastructure and SLA requirements.
Thanks all for taking this journey with me. If this sounds like something you want to check out, head over to the Litmus Chaos Hub and then meet us in #Litmus.
Cheers,
Brian