75 closed PRs from 20+ new contributors, more than 100 new stargazers, and some incredible new adopters. This is just part of the story for the LitmusChaos 0.8 release, aided in no small part by the annual HacktoberFest that has been at the forefront of our calendars over the last month or so. From college students to experienced SREs, and documentation updates to new chaos utils, contributions were incredible, and we are sincerely thankful! While community engagement was a primary focus, we also managed to introduce new experiments to the chaosHub, provide increased context around experiments via enhanced ChaosExperiment specs, provide greater execution control via chaosEngine, and sow the seeds for some critical enhancements. Now, let’s talk some more about what has gone into Litmus 0.8!
Experiment Job Cleanup Policy
With 0.8, users can now choose to auto-cleanup chaos experiment jobs upon completion via a ChaosEngine tunable (especially if dedicated logging infrastructure such as EFK or Loki is present), thereby reducing the etcd burden that leads to repeated executions.
- name: pod-delete
Increased Execution Control via ChaosEngine
While the previous release featured the ability to override experiment defaults via the ChaosEngine, we identified a need to control versions of the executor that run the experiment itself. This was implemented to better support downstream projects (that may use private image registries) and additional open source/third-party executors and exporters, which can read the experiment details and inject/monitor chaos in their unique way. Provided below is one such example, where the chaos executor, monitor and other images are maintained in different image repositories:
- name: container-kill
- name: TARGET_CONTAINER
- name: LIB
- name: LIB_IMAGE
Infra Chaos Experiments
Chaos experiments can be classified based on applications targeted, or they may be categorized as generic infrastructure chaos experiments. In the Kubernetes world, the term “infrastructure” is a bit porous. Traditionally, the (VM/cloud) node instances, node/cluster network, storage (disks/disk connectors/controllers) are considered infrastructure components. However, with Kubernetes, application clusters are made up of “nodes,” which are actually pods. The network also includes pod-network and storage, including PVs, all residing within a more traditional “virtual/physical infrastructure.” By extension, some infra chaos experiments (that inject faults into traditional/platform-level components) impact more than one (sometimes, many, depending upon deployment practices) applications, while others (Kubernetes-resource/app-level) have a much lower blast radius and impact only a given “application cluster.”
In the litmus lingua franca, the former continues to be called actual “infra chaos experiments” with platform-specific chaoslib (GKE, AWS, DOKS, Packet) often requiring vendor-specific out-of-band/API access. The latter are referred to as “app chaos experiments” that can typically run regardless of the underlying Kubernetes provider.
With 0.8, Litmus supports the following infra chaos experiments:
- Storage Disk Loss (GKE, AWS)
- (Ephemeral) Storage Exhaustion/Fill (GKE)
- CPU Resource Hog (GKE)
The update also includes new OpenEBS persistent storage chaos experiments, allowing users to inject failure on storage controller/target pods and storage pool/data replica pods:
- OpenEBS target container Kill
- OpenEBS pool container Kill
- OpenEBS pool network delay
- OpenEBS pool network loss
Application-Specific (Kafka) Chaos Experiments
With a vast and rapidly growing ecosystem, Kafka, in its myriad distributions, is a crucial piece of many microservice tech stacks today. In line with LitmusChaos’s objective to provide application chaos experiments (with specific pre/post checks, hypothesis validation and external liveness clients), we chose Kafka as one of the first few supported applications. With tremendous help and feedback from the community, the following experiments have been added, where an aliveness pod creates test topics with desired properties and sets up the pub/sub infra. The resiliency of the Kafka clusters is monitored via message continuity during fault injections on the broker replicas. The brokers are derived dynamically for the test topics, though the experiments also allow for specification of the same.
- Kafka Broker Pod Failure
- Kafka Broker Disk Failure (GKE)
Chaos Experiment Permissions
One of the prerequisites for the execution of litmus chaos experiments is the “chaosServiceAccount,” which is propagated to the chaos executor as well as individual experiments. The permissions necessary to execute the experiments may not be readily apparent and may vary across experiment categories. In this release, the permissions required for an experiment are described in the chaosExperiment CR spec itself, thereby allowing users to provide the required permissions to their serviceAccount before creating the chaosEngine. Below is a portion of an OpenEBS chaos experiment, with necessary service account permissions specified:
Community Best Practices
With the addition of new contributors and users to the LitmusChaos project, we decided to up our community discipline by being more transparent in project management, while also creating additional aids to get users started in creating chaos experiments.
- Project Management: Starting with this release, the availability of a GitHub issue (with relevant information) on the litmuschaos/litmus meta repo is necessary before any major fixes/enhancements/features are picked up. A host of new labels have been created to help with categorization, while the GitHub Projects board is used for prioritization. The basis on which items are chosen for release is typically driven by the community/users, and their tracking occurs over regular sync up.
- Weekly Sync Up Meeting: The community sync up meeting occurs every Tuesday @ 17.30 IST. The agenda is determined in the lead-up to the call (primarily consisting of the demo, status updates, issue discussions, and group review of PRs on request) with the meeting notes captured in a dedicated Google Doc. Status updates on project items are tracked via a Google Sheet.
- Developer Guide and Experiment Scaffold Scripts: Based on the contributors’ feedback, we created a developer guide to ease the journey of new chaos experiment submissions. A python script now scaffolds all the necessary artifacts (ansible business logic playbook, experiment CR, experiment ChartServiceVersion, K8s job, etc.,) to get a chaos experiment listed on the ChaosHub, based on a simple metadata YAML filled out by the user. This provides a base into which experiment-specific libs/utils can be filled out before the creation of the PR.
- Experiment Maturity Guidelines: This is a set of first-cut guidelines that help determine the maturity of a chaos experiment. The experiments are classified to belong to different maturity levels (alpha, beta, GA) based on recovery procedures, supported platforms, available documentation, etc.
- Process Docs: As part of the bookkeeping improvements, the litmus repo now consists of a Release Guidelines doc, with the Architecture/Design docs in the works as well.
I want to give a tremendous shoutout to our users/contributors @bhaumikshah and @jayadeepkm for their continued support and feedback. With increased community initiative, we hope to strengthen the LitmusChaos framework even more in upcoming releases. Here is a quick sneak peek into the 0.9 release with some of the high-level backlogs:
- Introduce an improved (Go-based) chaos executor: As the chaos-executor’s functionality increases (in accordance with the increasing complexity of the experiments and the variance in their spec), there is a need to affect some refactoring in order to provide better performance, increase scalability, and gain more exceptional control over experiment execution. The inherent capability of the go-client for Kubernetes can be handy in achieving this. A pre-alpha version of this is currently in the works, which we hope to bring to alpha in 0.9.
- Additional Kafka Experiments: As a community, we will continue the efforts to provide a well-rounded chaos suite for Kafka.
- Additional Infra (node) Experiments: Node (instance) failures for GKE platform.
- Improved user documentation with how-to guides for experiment categories.