75 closed PRs from 20+ new contributors, more than 100 new stargazers, and some incredible new adopters. This is just part of the story for the LitmusChaos 0.8 release, aided in no small part by the annual HacktoberFest that has been at the forefront of our calendars over the last month or so. From college students to experienced SREs, and documentation updates to new chaos utils, contributions were incredible, and we are sincerely thankful! While community engagement was a primary focus, we also managed to introduce new experiments to the chaosHub, provide increased context around experiments via enhanced ChaosExperiment specs, provide greater execution control via chaosEngine, and sow the seeds for some critical enhancements. Now, let’s talk some more about what has gone into Litmus 0.8!
Experiment Job Cleanup Policy
With 0.8, users can now choose to auto-cleanup chaos experiment jobs upon completion via a ChaosEngine tunable (especially if dedicated logging infrastructure such as EFK or Loki is present), thereby reducing the etcd burden that leads to repeated executions.
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: chaos
namespace: default
spec:
monitoring: true
jobCleanUpPolicy: delete
appinfo:
appkind: deployment
applabel: app=hello
appns: default
chaosServiceAccount: nginx
experiments:
- name: pod-delete
spec:
components:
While the previous release featured the ability to override experiment defaults via the ChaosEngine, we identified a need to control versions of the executor that run the experiment itself. This was implemented to better support downstream projects (that may use private image registries) and additional open source/third-party executors and exporters, which can read the experiment details and inject/monitor chaos in their unique way. Provided below is one such example, where the chaos executor, monitor and other images are maintained in different image repositories:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: chaos
namespace: default
spec:
monitoring: false
appinfo:
appkind: deployment
applabel: app=hello
appns: default
chaosServiceAccount: nginx
components:
monitor:
image: gcr.io/<project-id>/chaos-exporter:ci
runner:
image: gcr.io/<project-id>/ansible-runner:ci
experiments:
- name: container-kill
spec:
components:
- name: TARGET_CONTAINER
value: hello
- name: LIB
value: pumba
- name: LIB_IMAGE
value: gcr.io/<project-id>/pumba:0.4.8
Chaos experiments can be classified based on applications targeted, or they may be categorized as generic infrastructure chaos experiments. In the Kubernetes world, the term “infrastructure” is a bit porous. Traditionally, the (VM/cloud) node instances, node/cluster network, storage (disks/disk connectors/controllers) are considered infrastructure components. However, with Kubernetes, application clusters are made up of “nodes,” which are actually pods. The network also includes pod-network and storage, including PVs, all residing within a more traditional “virtual/physical infrastructure.” By extension, some infra chaos experiments (that inject faults into traditional/platform-level components) impact more than one (sometimes, many, depending upon deployment practices) applications, while others (Kubernetes-resource/app-level) have a much lower blast radius and impact only a given “application cluster.”
In the litmus lingua franca, the former continues to be called actual “infra chaos experiments” with platform-specific chaoslib (GKE, AWS, DOKS, Packet) often requiring vendor-specific out-of-band/API access. The latter are referred to as “app chaos experiments” that can typically run regardless of the underlying Kubernetes provider.
With 0.8, Litmus supports the following infra chaos experiments:
The update also includes new OpenEBS persistent storage chaos experiments, allowing users to inject failure on storage controller/target pods and storage pool/data replica pods:
Application-Specific (Kafka) Chaos Experiments
With a vast and rapidly growing ecosystem, Kafka, in its myriad distributions, is a crucial piece of many microservice tech stacks today. In line with LitmusChaos’s objective to provide application chaos experiments (with specific pre/post checks, hypothesis validation and external liveness clients), we chose Kafka as one of the first few supported applications. With tremendous help and feedback from the community, the following experiments have been added, where an aliveness pod creates test topics with desired properties and sets up the pub/sub infra. The resiliency of the Kafka clusters is monitored via message continuity during fault injections on the broker replicas. The brokers are derived dynamically for the test topics, though the experiments also allow for specification of the same.
One of the prerequisites for the execution of litmus chaos experiments is the “chaosServiceAccount,” which is propagated to the chaos executor as well as individual experiments. The permissions necessary to execute the experiments may not be readily apparent and may vary across experiment categories. In this release, the permissions required for an experiment are described in the chaosExperiment CR spec itself, thereby allowing users to provide the required permissions to their serviceAccount before creating the chaosEngine. Below is a portion of an OpenEBS chaos experiment, with necessary service account permissions specified:
apiVersion: litmuschaos.io/v1alpha1
description:
kind: ChaosExperiment
metadata:
labels:
litmuschaos.io/name: openebs
name: openebs-target-container-failure
version: 0.1.1
spec:
definition:
permissions:
apiGroups:
- ""
- "extensions"
- "apps"
- "batch"
- "litmuschaos.io"
- "openebs.io"
- "storage.k8s.io"
resources:
- "statefulsets"
- "deployments"
- “jobs”
- "pods"
- "pods/exec"
- "chaosengines"
- "chaosexperiments"
- "chaosresults"
- "persistentvolumeclaims"
- "storageclasses"
- "persistentvolumes"
verbs:
- "*"
With the addition of new contributors and users to the LitmusChaos project, we decided to up our community discipline by being more transparent in project management, while also creating additional aids to get users started in creating chaos experiments.
I want to give a tremendous shoutout to our users/contributors @bhaumikshah and @jayadeepkm for their continued support and feedback. With increased community initiative, we hope to strengthen the LitmusChaos framework even more in upcoming releases. Here is a quick sneak peek into the 0.9 release with some of the high-level backlogs: