Introduction
Hey all! I am Sanjay Nathani, one of the Contributors to the LitmusChaos Project & a Software Engineer at MayaData. By now, I assume that you are already familiar with the concept of cloud-native chaos engineering and how the litmuschaos project enables you to achieve it here.
As members of the larger chaos engineering community, one of the observations we made while examining the use-cases of different adopters was that chaos needs to be made available as a background service. While random injections via manual execution of the experiments in pre-prod/production (read gamedays) and CI-driven execution on dev environments is still the norm in many cases, there are a lot of organizations adopting a continuous-chaos strategy as part of a shift-left paradigm, in which staging clusters (or equivalent environments that mimic prod characteristics and traffic) are subject to service and infrastructure faults repeatedly in a periodic or random fashion. In most of these cases, the goal is to observe the resilience of the microservices at various times/operational states. It is common knowledge that the load on the microservices in a cluster varies throughout its existence - there are peak traffic periods - which may last for a few hours in a day or few days a month. It is necessary to compare how the KPIs (key performance indicators) fare at different periods upon failures.
Based on this, we decided to create the chaos-scheduler to inject chaos repeatedly, while providing a flexible schema for developers and SREs by which they can automate chaos runs while being able to define minimum intervals between two instances of chaos or specify the total number of chaos instances across a time range, etc.,
The Chaos Scheduler is a Kubernetes controller (built using the Operator-SDK framework) that reconciles a custom resource called ChaosSchedule, which is essentially a higher-level abstraction that embeds within itself the (now-familiar) ChaosEngine template along with a schedule specification. While still an alpha component today, the Chaos Scheduler sees adoption already and is poised towards becoming an optional component in the Litmus deployment bundle (helm chart).
In this blog, let’s take a closer look at the scheduling options provided by the chaos scheduler and how you can give it a spin in your cluster.
The ChaosSchedule is the core schema that defines the chaos workflow for a given Application Under Test (AUT) or Node Under Test (NUT). It describes the following:
As mentioned earlier, one of the goals with the Chaos Scheduler was to provide a flexible and rich set of configuration options, for, there is already the standard Kubernetes Cron Job if the requirement is only about repeating the chaos action. As of today, there are three ways in which we can schedule the chaos to be injected:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
spec:
schedule:
now: true
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
executionTime
.apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
spec:
schedule:
once:
executionTime: "2020-05-12T05:47:00Z" #should be modified according to current UTC Time
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
startTime
& endTime
with a minChaosInterval
specified to ensure a mandatory cool-off period to observe adherence to MTTR (Mean-Time-To-Recover). This option also allows whitelisting/blacklisting days of a week for chaos. Here is a sample of how to inject the chaos in this way.apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
spec:
schedule:
repeat:
startTime: "2020-05-12T05:47:00Z" #should be modified according to current UTC Time
endTime: "2020-05-12T05:52:00Z" #should be modified according to current UTC Time
minChaosInterval: "2m" #format should be like "10m" or "2h" accordingly for minutes and hours
instanceCount: "2"
includedDays: "mon,tue,wed"
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
Needless to say, the ChaosSchedule is referenced as the owner of the secondary resources (chaosengine) with Kubernetes DeletePropagation policies ensuring their removal too upon deletion of the ChaosSchedule CR.
In the subsequent section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.
In this section, let us view the steps involved in setting up a demo environment to try out the Chaos Scheduler.
kubectl apply -f https://litmuschaos.github.io/pages/litmus-operator-latest.yaml
namespace/litmus created
serviceaccount/litmus created
clusterrole.rbac.authorization.k8s.io/litmus created
clusterrolebinding.rbac.authorization.k8s.io/litmus created
deployment.apps/chaos-operator-ce created
customresourcedefinition.apiextensions.k8s.io/chaosengines.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosexperiments.litmuschaos.io created
customresourcedefinition.apiextensions.k8s.io/chaosresults.litmuschaos.io created
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/crds/chaosschedule_crd.yaml
customresourcedefinition.apiextensions.k8s.io/chaosschedules.litmuschaos.io created
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-scheduler/master/deploy/chaos-scheduler.yaml
deployment.apps/chaos-scheduler created
NOTE: In this example, I intend to inject chaos on a single replica Nginx deployment running in the default namespace. Modify according to your environment.
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/experiment.yaml
chaosexperiment.litmuschaos.io/pod-delete created
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-charts/1.4.0/charts/generic/pod-delete/rbac.yaml
serviceaccount/pod-delete-sa created
role.rbac.authorization.k8s.io/pod-delete-sa created
rolebinding.rbac.authorization.k8s.io/pod-delete-sa created
kubectl annotate deploy/nginx-deployment litmuschaos.io/chaos="true"
deployment.extensions/nginx-deployment annotated
kubectl get po -n litmus
chaos-operator-ce-5cd5894879-k7wgz 1/1 Running 0
10m
chaos-scheduler-84fcccb5bd-mjpnj 1/1 Running 0
10m
kubectl get sa -n litmus
default 1 10m
litmus 1 10m
scheduler 1 10m
kubectl get sa
default 1 10m
pod-delete-sa 1 10m
Now we can safely move further
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: schedule-nginx
namespace: litmus
spec:
schedule:
repeat:
startTime: "2020-05-12T05:47:00Z" #should be modified according to current UTC Time
endTime: "2020-05-12T05:52:00Z" #should be modified according to current UTC Time
minChaosInterval: "2m" #format should be like "10m" or "2h" accordingly for minutes and hours
instanceCount: "2"
includedDays: "mon,tue,wed"
engineTemplateSpec:
appinfo:
appns: 'default'
applabel: 'app=nginx'
appkind: 'deployment'
# It can be true/false
annotationCheck: 'true'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: pod-delete-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: 'delete'
experiments:
- name: pod-delete
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '30'
# set chaos interval (in sec) as desired
- name: CHAOS_INTERVAL
value: '10'
# pod failures without '--force' & default terminationGracePeriodSeconds
- name: FORCE
value: 'false'
kubectl apply -f chaos-schedule.yaml
watch kubectl get pod
kubectl describe chaosschedule schedule-nginx
Name: schedule-nginx
Namespace: default
Labels: <none>
Annotations: API Version: litmuschaos.io/v1alpha1
Kind: ChaosSchedule
Metadata:
Creation Timestamp: 2020-05-14T08:44:32Z
Generation: 3
Resource Version: 899464
Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/default/chaosschedules/ schedule-nginx
UID: 347fb7e6-2c9d-428e-9ce1-42bdcfdab37d
Spec:
Chaos Service Account:
Engine Template Spec:
Appinfo:
Appkind: deployment
Applabel: app=nginx
Appns: default
Chaos Service Account: litmus
Components:
Runner:
Experiments:
Name: pod-delete
Spec:
Components:
Rank: 0
Job Clean Up Policy: retain
Schedule:
Repeat:
End Time: 2020-05-12T05:52:00Z
Included Days: Mon,Tue,Wed
Instance Count: 2
Min Chaos Interval: 2m
Start Time: 2020-05-12T05:47:00Z
Schedule State: active
Status:
Active:
API Version: litmuschaos.io/v1alpha1
Kind: ChaosEngine
Name: schedule-nginx
Namespace: default
Resource Version: 899463
UID: 14f49857-8879-4129-a5b9-a3a592149725
Last Schedule Time: 2020-05-14T08:44:32Z
Schedule:
Start Time: 2020-05-14T08:44:32Z
Status: running
Total Instances: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 39s chaos-scheduler Created engine schedule-nginx
At any point in time, we can halt a chaosschedule, which simply means stopping the further execution of chaos. Here is the way to halt the chaosschedule.
The power of halting a schedule comes into action when we do not want to disturb the production cluster or an application at some point in time because of some important activity(migration) going on. We can halt the schedule without putting in the efforts of deleting and recreating the schedule.
Change the spec.ScheduleState
to halt
spec:
scheduleState: halt
With the Chaos Scheduler, the user is not burdened with trying to re-apply chaosengine manifests or remember to do chaos at different times by himself/herself and instead only has to compare execution results! As you read this, the Chaos Scheduler is being improved to support randomized execution within a time range. So, more power coming your way!! Do try out the steps and let us know what you feel about the scheduler and what use-cases it must support!
Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community #litmus channel in Kubernetes Slack
Contribute to LitmusChaos and share your feedback on Github
If you like LitmusChaos, become one of the many stargazers here