Scheduling Chaos: An introduction to the LitmusChaos Scheduler

Written by Sanjay Nathani | Aug 6, 2020 4:16:46 PM

Introduction

Hey all! I am Sanjay Nathani, one of the Contributors to the LitmusChaos Project & a Software Engineer at MayaData. By now, I assume that you are already familiar with the concept of cloud-native chaos engineering and how the litmuschaos project enables you to achieve it here.

As members of the larger chaos engineering community, one of the observations we made while examining the use-cases of different adopters was that chaos needs to be made available as a background service. While random injections via manual execution of the experiments in pre-prod/production (read gamedays) and CI-driven execution on dev environments is still the norm in many cases, there are a lot of organizations adopting a continuous-chaos strategy as part of a shift-left paradigm, in which staging clusters (or equivalent environments that mimic prod characteristics and traffic) are subject to service and infrastructure faults repeatedly in a periodic or random fashion. In most of these cases, the goal is to observe the resilience of the microservices at various times/operational states. It is common knowledge that the load on the microservices in a cluster varies throughout its existence - there are peak traffic periods - which may last for a few hours in a day or few days a month. It is necessary to compare how the KPIs (key performance indicators) fare at different periods upon failures.

Based on this, we decided to create the chaos-scheduler to inject chaos repeatedly, while providing a flexible schema for developers and SREs by which they can automate chaos runs while being able to define minimum intervals between two instances of chaos or specify the total number of chaos instances across a time range, etc.,

What is Chaos Scheduler?

The Chaos Scheduler is a Kubernetes controller (built using the Operator-SDK framework) that reconciles a custom resource called ChaosSchedule, which is essentially a higher-level abstraction that embeds within itself the (now-familiar) ChaosEngine template along with a schedule specification. While still an alpha component today, the Chaos Scheduler sees adoption already and is poised towards becoming an optional component in the Litmus deployment bundle (helm chart).

In this blog, let’s take a closer look at the scheduling options provided by the chaos scheduler and how you can give it a spin in your cluster.

Dissecting the ChaosSchedule Custom Resource

The ChaosSchedule is the core schema that defines the chaos workflow for a given Application Under Test (AUT) or Node Under Test (NUT). It describes the following:

Execution Schedule for the experiments
Template Spec of ChaosEngine detailing the chaos action

As mentioned earlier, one of the goals with the Chaos Scheduler was to provide a flexible and rich set of configuration options, for, there is already the standard Kubernetes Cron Job if the requirement is only about repeating the chaos action. As of today, there are three ways in which we can schedule the chaos to be injected:

Now: This will trigger the chaos as soon as the ChaosSchedule CR is created and is similar to the on-demand execution model available today.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    now: true
  engineTemplateSpec:
    appinfo:
      appns: 'default'
      applabel: 'app=nginx'
      appkind: 'deployment'
    # It can be true/false
    annotationCheck: 'true'
    #ex. values: ns1:name=percona,ns2:run=nginx
    auxiliaryAppInfo: ''
    chaosServiceAccount: pod-delete-sa
    monitoring: false
    # It can be delete/retain
    jobCleanUpPolicy: 'delete'
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              # set chaos duration (in sec) as desired
              - name: TOTAL_CHAOS_DURATION
                value: '30'

              # set chaos interval (in sec) as desired
              - name: CHAOS_INTERVAL
                value: '10'

              # pod failures without '--force' & default terminationGracePeriodSeconds
              - name: FORCE
                value: 'false'

Once: This will schedule the chaos at a specific time denoted by executionTime.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    once:
      executionTime: "2020-05-12T05:47:00Z"   #should be modified according to current UTC Time
  engineTemplateSpec:
    appinfo:
      appns: 'default'
      applabel: 'app=nginx'
      appkind: 'deployment'
    # It can be true/false
    annotationCheck: 'true'
    #ex. values: ns1:name=percona,ns2:run=nginx
    auxiliaryAppInfo: ''
    chaosServiceAccount: pod-delete-sa
    monitoring: false
    # It can be delete/retain
    jobCleanUpPolicy: 'delete'
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              # set chaos duration (in sec) as desired
              - name: TOTAL_CHAOS_DURATION
                value: '30'

              # set chaos interval (in sec) as desired
              - name: CHAOS_INTERVAL
                value: '10'

              # pod failures without '--force' & default terminationGracePeriodSeconds
              - name: FORCE
                value: 'false'

Repeat: This type of schedule will ensure the repeated execution of chaos over a time range. We define the startTime & endTime with a minChaosInterval specified to ensure a mandatory cool-off period to observe adherence to MTTR (Mean-Time-To-Recover). This option also allows whitelisting/blacklisting days of a week for chaos. Here is a sample of how to inject the chaos in this way.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-nginx
spec:
  schedule:
    repeat:
      startTime: "2020-05-12T05:47:00Z"   #should be modified according to current UTC Time
      endTime: "2020-05-12T05:52:00Z"   #should be modified according to current UTC Time
      minChaosInterval: "2m"   #format should be like "10m" or "2h" accordingly for minutes and   hours
      instanceCount: "2"
      includedDays: "mon,tue,wed"
  engineTemplateSpec:
    appinfo:
      appns: 'default'
      applabel: 'app=nginx'
      appkind: 'deployment'
    # It can be true/false
    annotationCheck: 'true'
    #ex. values: ns1:name=percona,ns2:run=nginx
    auxiliaryAppInfo: ''
    chaosServiceAccount: pod-delete-sa
    monitoring: false
    # It can be delete/retain
    jobCleanUpPolicy: 'delete'
    experiments:
      - name: pod-delete
        spec:
          components:
            env:
              # set chaos duration (in sec) as desired
              - name: TOTAL_CHAOS_DURATION
                value: '30'

              # set chaos interval (in sec) as desired
              - name: CHAOS_INTERVAL
                value: '10'

              # pod failures without '--force' & default terminationGracePeriodSeconds
              - name: FORCE
                value: 'false'

Needless to say, the ChaosSchedule is referenced as the owner of the secondary resources (chaosengine) with Kubernetes DeletePropagation policies ensuring their removal too upon deletion of the ChaosSchedule CR.

In the subsequent section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.

Getting Started

In this section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.

Install LitmusChaos Operator, RBAC, and CRDs

kubectl apply -f https://litmuschaos.github.io/pages/litmus-operator-latest.yaml

namespace/litmus created
serviceaccount/litmus created
clusterrole.rbac.authorization.k8s.io/litmus created
clusterrolebinding.rbac.authorization.k8s.io/litmus created
deployment.apps/chaos-operator-ce created
customresourcedefinition.apiextensions.k8s.io/chaosengines.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosexperiments.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosresults.litmuschaos.io created

Install Chaos Scheduler, and it's CRDs

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/crds/chaosschedule_crd.yaml

customresourcedefinition.apiextensions.k8s.io/chaosschedules.litmuschaos.io created

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/chaos-scheduler.yaml

deployment.apps/chaos-scheduler created

Create the pod delete Chaos Experiment in the default namespace

NOTE: In this example, I intend to inject chaos on a single replica Nginx deployment running in the default namespace. Modify according to your environment.

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/experiment.yaml

chaosexperiment.litmuschaos.io/pod-delete created

Setup the RBAC for executing the pod-delete chaos

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/rbac.yaml

serviceaccount/pod-delete-sa created
role.rbac.authorization.k8s.io/pod-delete-sa created
rolebinding.rbac.authorization.k8s.io/pod-delete-sa created

Annotate your application to enable chaos

kubectl annotate deploy/nginx-deployment litmuschaos.io/chaos="true"

deployment.extensions/nginx-deployment annotated

Before proceeding, let's see whether all the things are up and running successfully or not.

Chaos Scheduler and Chaos Operator should be running perfectly.

 kubectl get po -n litmus

   chaos-operator-ce-5cd5894879-k7wgz   1/1     Running   0          
   10m
   chaos-scheduler-84fcccb5bd-mjpnj     1/1     Running   0          
   10m

Ensure the Service Accounts for scheduler and operator are created

kubectl get sa -n litmus

   default     1         10m
   litmus      1         10m
   scheduler   1         10m

Ensure the service account for the intended experiment is created successfully

 kubectl get sa

   default         1         10m
   pod-delete-sa   1         10m

Now we can safely move further

Create a ChaosSchedule yaml with the application and experiment information along with the scheduling logic

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
namespace: litmus
spec:
    schedule:
        repeat:
startTime: "2020-05-12T05:47:00Z"   #should be modified according to current UTC Time
 endTime: "2020-05-12T05:52:00Z"   #should be modified according to current UTC Time
 minChaosInterval: "2m"   #format should be like "10m" or "2h" accordingly for minutes   and   hours
 instanceCount: "2"
 includedDays: "mon,tue,wed"
engineTemplateSpec:
  appinfo:
    appns: 'default'
    applabel: 'app=nginx'
    appkind: 'deployment'
  # It can be true/false
  annotationCheck: 'true'
  #ex. values: ns1:name=percona,ns2:run=nginx
  auxiliaryAppInfo: ''
  chaosServiceAccount: pod-delete-sa
  monitoring: false
  # It can be delete/retain
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '30'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '10'

            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'

Create a ChaosSchedule custom resource

kubectl apply -f chaos-schedule.yaml

Watch the injection of chaos at any point of time

watch kubectl get pod

Describe the ChaosSchedule for the details of chaos injection.

kubectl describe chaosschedule schedule-nginx

Name:         schedule-nginx
Namespace:    default
Labels:       <none>
Annotations:  API Version:  litmuschaos.io/v1alpha1
Kind:         ChaosSchedule
Metadata:
  Creation Timestamp:  2020-05-14T08:44:32Z
  Generation:          3
  Resource Version:    899464
  Self Link:           /apis/litmuschaos.io/v1alpha1/namespaces/default/chaosschedules/  schedule-nginx
  UID:                 347fb7e6-2c9d-428e-9ce1-42bdcfdab37d
Spec:
  Chaos Service Account:  
  Engine Template Spec:
    Appinfo:
      Appkind:              deployment
      Applabel:             app=nginx
      Appns:                default
    Chaos Service Account:  litmus
    Components:
      Runner:
    Experiments:
      Name:  pod-delete
      Spec:
        Components:
        Rank:             0
    Job Clean Up Policy:  retain
  Schedule:
    Repeat:
      End Time:            2020-05-12T05:52:00Z
      Included Days:       Mon,Tue,Wed
      Instance Count:      2
      Min Chaos Interval:  2m
      Start Time:          2020-05-12T05:47:00Z
  Schedule State:        active
Status:
  Active:
    API Version:       litmuschaos.io/v1alpha1
    Kind:              ChaosEngine
    Name:              schedule-nginx
    Namespace:         default
    Resource Version:  899463
    UID:               14f49857-8879-4129-a5b9-a3a592149725
  Last Schedule Time:  2020-05-14T08:44:32Z
  Schedule:
    Start Time:              2020-05-14T08:44:32Z
    Status:                  running
    Total Instances:         1
Events:
  Type    Reason            Age   From             Message
  ----    ------            ----  ----             -------
  Normal  SuccessfulCreate  39s   chaos-scheduler  Created engine schedule-nginx

Halting ChaosSchedule

At any point in time, we can halt a chaosschedule, which simply means stopping the further execution of chaos. Here is the way to halt the chaosschedule.
The power of halting a schedule comes into action when we do not want to disturb the production cluster or an application at some point in time because of some important activity(migration) going on. We can halt the schedule without putting in the efforts of deleting and recreating the schedule.

Change the spec.ScheduleState to halt

spec:
    scheduleState: halt

Conclusion

With the Chaos Scheduler, the user is not burdened with trying to re-apply chaosengine manifests or remember to do chaos at different times by himself/herself and instead only has to compare execution results! As you read this, the Chaos Scheduler is being improved to support randomized execution within a time range. So, more power coming your way!! Do try out the steps and let us know what you feel about the scheduler and what use-cases it must support!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community #litmus channel in Kubernetes Slack
Contribute to LitmusChaos and share your feedback on Github
If you like LitmusChaos, become one of the many stargazers here

View full post