LitmusChaos 1.2: Makes Chaos Eventful

The sure-shot sign of a growing community around an OSS project is when the repository is witnessing a steady increase in GitHub issues for enhancement and feature requests (and of course..er..bugs, not to mention PRs for fixing those!), organizations are building demos around it, contributors/users pop up at community sync up calls and share feedback about how useful the project is to them, how they would like to see it shape up, and senior technology enthusiasts and developers from different organizations gladly agree to sign-up as maintainers.

Litmus 1.2 Release

Litmus has started to see this happen, and we couldn’t be more excited and motivated. As a consequence, we decided that it makes sense to donate Litmus to the CNCF as a sandbox project to help with neutral governance, increased development/traction, and a community-driven roadmap! Especially since we believe it is one of the first Cloud Native Chaos Engineering projects out there. While we hope it does get accepted, we as a team are focused on churning out monthly releases to incorporate the capabilities the community has decided on!

In 1.2, the focus was around improving observability around chaos in the form of events, supporting newer platforms, adding and enhancing experiments in different suites & not least, improving the governance info to augment the amazing community build-up. Read on to learn more about it!

Increased Support for Chaos Events

While the previous release briefly introduced Kubernetes events for creation/removal of chaos resources, 1.2 adds increased events to the chaosengine with events generated at each stage through the lifecycle of the experiment, thereby giving users a complete understanding of the happenings during execution. Considering Litmus is itself composed of different sub-components working together to perform chaos, there are various event sources (chaos-operator, chaos-runner, chaos-experiment) that generate the events. Provided below is an example of chaos events for the new node memory hog experiment.

On a lighter note, here is a tweet that illustrates this capability (i.e., events pouring in from different sources) in an alternate world !! https://twitter.com/ispeakc0de/status/1237825651169579008?s=21

Experiment Failure Reason in ChaosResult

One of the constant endeavors of the litmus team has been to improve the result/reporting infra for the chaos experiment, with different issues being worked at this point to ensure this. One of the first steps towards that direction is to mark out the point of failure (task) in the experiment. In this release, we have introduced an additional field .status.experimentstatus.failStep in the ChaosResult status to indicate the same. This field is patched with the name of the ansible task that has failed. This can aid with debugging the execution, especially when used with the jobCleanupPolicy set to retain. Though this enhancement in 1.2.0 is specific to experiments executed as Ansible playbooks, the techniques for achieving this in python/golang will be made clear via respective experiment templates in subsequent releases.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
  name: engine-pod-delete
  labels:
    chaosUID: 83c439de-6082-11ea-89a5-0cc47ab587b8
spec:
  engine: engine
  experiment: pod-delete
status:
  experimentstatus: 
    phase: Completed
    verdict: Fail
    failStep: Checking whether application pods are in running state

Integration with Chaostoolkit

Litmus was built with the principles of extensibility in mind and is already capable of using other popular chaos tools such as Pumba & Powerfulseal. In this release, a new category of charts has been introduced based on integration with another popular chaos tool called chaostoolkit, which was submitted by the amazing folks from Intuit. The experiments are python scripts with their steady-state-checks, which in turn invoke the chaostoolkit libraries to inject the chaos. Needless to say, these scripts are containerized & Litmus orchestrates these experiments via the standard ChaosEngine/ChaosExperiment interfaces to achieve the desired benefits of orchestration as well as reuse of existing chaostoolkit-based scripts.

Support for Chaos on Amazon EKS platform

One of the criteria mentioned in the experiment maturity guidelines we formed sometime back was to support chaos on multiple Kubernetes platforms, which includes EKS. In this release, we have certified all experiments in the Generic Chaos suite on EKS. Experiments belonging to other suites are being tested as of writing this blog & upcoming experiments shall be too!

Support for RAMP time in an experiment

One of the common features of any “testing” or “experimentation” toolset is the ability to provide a “warm-up” and “soak” time before and after the test/experiment itself, thereby enabling the system to make a better decision on the steady-state and avoid transient statuses. This is especially true when dealing with the network, storage, or other infra components. In 1.2.0, we have introduced an option to specify a RAMP_TIME in the experiment via an ENV variable (typically provided as an overriding value in the chaosengine, with the default set to no wait / zero seconds) for precisely this purpose.

experiments:
    - name: node-memory-hog
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '120'

            # time to wait before injecting chaos and before post-chaos steady state checks
            - name: RAMP_TIME
              value: '60'

            ## specify the size as percent of total available memory (in percentage %)
            ## default value 90%
            - name: MEMORY_PERCENTAGE
              value: '90'   

             # It supports GKE and EKS Platform
             # GKE is the default Platform
            - name: PLATFORM
              value: 'GKE'

New Chaos Experiments On the Block

As with every release, we have some new experiments added to the chaos suites this time too.

OpenEBS Pool Disk Loss (OpenEBS): This experiment is supported on GKE/AWS and reuses the disk loss chaoslib. On a different note, we are even recommending the OpenEBS users to try out the LitmusChaos experiments at the various stages of their day-0/day-1 journey to verify the resiliency of the OpenEBS deployments.
Node Memory Hog (Generic): Created on popular request, this experiment accepts a memory consumption percentage (of the node memory) and hogs the same, leading to a resource crunch on the node, which may cause forced evictions of the application pods. This experiment uses stress-ng (an improvement over the standard stress library in Linux), which allows us to expand the scope of resource chaos widely. Trust us. we are on it!

Governance - What is that?

While Litmus has been a project following the standard OSS principles, there were some crucial missing artifacts in the Github repositories around Governance. Governance, in layman terms, pertains to clearly calling out details such as who the maintainers are, what the process is for becoming one, what is the near-term/long-term roadmap and how it is decided, what are the licenses involved, what kind of code/build dependencies exist, who are the initial adopters/users, how are releases carried out, etc., With this release, these details have been incorporated into the Litmus repo to aid the growing community with essential information on making the right decisions around contributions/adoptions.

Other Notable Improvements

Some of the other improvements that made it into 1.2 include the following:

The jobCleanupPolicy (ChaosEngine) is now extended to the chaos-runner pod, with the runner & monitor pods automatically removed (along with the experiment pod) if the policy is set to delete.
The chaos-runner properties override capability in the ChaosEngine (under .spec.components.runner) is now extended to imagePullPolicy & command/args attributes as well to aid debug/development.
The experiments now run with a reduced resource footprint (in terms of pods) in cases where node chaos is performed, by replacing the use of daemonsets with jobs
Introduces an nfs liveness client to aid with NFS storage provisioner chaos
A cleaner documentation website with no stale (pre 1.0) data!!

For a full list of enhancements/fixes, please refer to the release notes.

Conclusion

As with 1.1, one of the highlights of this release was that most of the items picked came purely from community requests. And there are many more to be taken up! As with any project picking up momentum, there is a considerable amount of prioritization involved, which we aim to accomplish during the monthly community sync up calls. There are daily public standups conducted at 16:00 IST, which the community can participate in to catch up on how things are progressing and also to seek info/help on any contributions one may be working on. And don’t hesitate to reach out to us on our increasingly chatty Slack Channel on the Kubernetes workspace!! Some of the items coming out soon (backlogs generated from previous sync-ups) include:

Support for AdminMode of chaos, with all chaos resources centralized in a single namespace & application chaos controlled via opt-in annotation model
Improved Chaos Abort functionality for cases where chaos processes are started on target containers
OpenEBS NFS chaos experiments
Improved experiment logging via better context/task descriptions
Improvements to chaos metrics visualization

Last but not the least, a hearty welcome to our new maintainers Jayesh Kumar (@k8s-dev) & Sumit Nagal (@sumitnagal) and a big shoutout to our awesome contributors & users: David Gildeh (@dgildeh), Laura Henning (@laumiH) & Christopher Pitstick (@cpistick-argo).

BLOG

LitmusChaos 1.2: Makes Chaos Eventful

Karthik Satchitanand

Karthik Satchitanand

Increased Support for Chaos Events

Experiment Failure Reason in ChaosResult

Integration with Chaostoolkit

Support for Chaos on Amazon EKS platform

Support for RAMP time in an experiment

New Chaos Experiments On the Block

Governance - What is that?

Other Notable Improvements

Conclusion

Evan Powell

Evan Powell

Abhishek Raj

Abhishek Raj

Oum Kale

Oum Kale

LitmusChaos 1.2: Makes Chaos Eventful

Increased Support for Chaos Events

Experiment Failure Reason in ChaosResult

Integration with Chaostoolkit

Support for Chaos on Amazon EKS platform

Support for RAMP time in an experiment

New Chaos Experiments On the Block

Governance - What is that?

Other Notable Improvements

Conclusion

MayaData launching ChaosNative for LitmusChaos and more

Achieving cross zone HA in GKE

Predefined Workflow with Kubera Chaos