In this article, we will be talking about a new and exciting topic; Predefined Workflow with Kubera Chaos.
Kubera Chaos is based upon the popular LitmusChaos project. The Open Source foundation of Litmus includes both LitmusChaos and Kubera Chaos, which include community and verified chaos tests, out-of-the-box. Users can simply select and implement chaos charts within particular namespaces and use them across their environment.
In this blog, we’ll discuss:
- Introduction: Sock - Shop and it's relation with Kubera Chaos Workflow
- Introduction Sock-Shop-Resiliency Workflow
- About Git-App-Deployer
- Install Experiment
- Load Testing and Chaos Engine
- Implementation with Kubera Chaos
- Conclusion, References, and Links.
About Sock-Shop and it's relation with Kubera Chaos
Sock-Shop simulates the user-facing part of an e-commerce website that sells socks. It is intended to aid the demonstration, and testing of microservice and cloud-native technologies Sock-Shop is maintained by Weaveworks and Container Solutions.
Sock-Shop microservices are designed to have minimal expectations, using DNS to find other services. This means that it is possible to insert load-balancers and service routers, as required or desired.
Sock-Shop can be used to illustrate microservices architectures, demonstrate platforms at talks and meetups, or as a training and educational tool.
In Kubera Chaos, we have been working and testing our experiments on Sock-Shop applications.
Chaos Experiments contain the actual chaos details. These experiments are installed on your cluster as Kubernetes CRs.
We have two scenarios: weak and resilient. Based on that, we will perform our experiment workloads and try to find an effect on the cause.
This workflow allows the execution of the same chaos experiment against two versions of the Sock-Shop deployment: weak and resilient. The weak are expected to result in a failed workflow while the resilient succeed, essentially highlighting the need for deployment best-practices.
Sock-Shop Resiliency Workflow
Our first workflow is the Sock-Shop resiliency check.
This workflow installs and executes chaos on the demo application Sock-Shop, which simulates an e-commerce website selling socks. It injects a transient fault on an upstream microservice pod (socks catalog) while continuously checking the availability of the website.
Recommendation: Check whether the application is resilient to pod failure once the workflow is completed.
Chaos workflow CRD linkchaosWkfCRDLink:
Chaos workflow Cron CRD Link:
Provide the application info in spec.appinfo Override the experiment tunables if desired in experiments.spec.components.env
Git-App-Deployer has been used for the installation of sock-shop applications. At first, the user is asked to give the namespace, filePath, and timeout.
Namespace provides an additional qualification to a unique resource name. This is helpful when multiple teams are using the same cluster, and there is a potential for name collision. It can be a virtual wall between multiple clusters.
For sock-shop user has to pass
For Weak Sock-Shop-Resilient check just pass:
In a weak scenario, it will create a single replica and Deployments for all.
For Resilient Sock-Shop-Resilient check just pass:
In a resilient scenario, it will create two replicas of pods with Statefulsets for databases and Deployments for others.
Timeout is used for the termination of the application. The exceeding time by default is 300s.
You may change the default time value e.g
A kubeconfig file is a file used to configure access to Kubernetes when used in conjunction with the kubectl command line tool (or other clients).
It creates a namespace and then installs the required application based on the given -namespace and -filepath.
If namespace already exists, then it shows log and starts installing sock-shop.
[Status]: Namespace already exists!
Sock-Shop installation will deploy all 14 manifests of Sock-Shop microservices.
Now let’s see how the git app deployer works in the workflow:
At first, the installation of Git-App-Deployer(application installation) is performed.
- name: install-application container: image: litmuschaos/litmus-app-deployer:latest args: ["-namespace=sock-shop","-typeName=weak", "-timeout=400"]
Note : For resilient provide type flagName as resilient(-typeName=resilient)
In a weak scenario, a single replica will be there. You may check using this command:
kubectl get po -n sock-shop
In the terminal, the output will be weak
NAME READY STATUS RESTARTS AGE carts-754c96bf74-hbwlv 1/1 Running 0 4m8s carts-db-69f74c65bd-xkm8s 1/1 Running 0 4m9s catalogue-67f86d6587-8gghr 1/1 Running 0 4m8s catalogue-db-66bc8df878-km7m9 1/1 Running 0 4m8s front-end-5d96ff485b-wkn75 1/1 Running 0 4m8s orders-67dbccdfdf-bz4sp 1/1 Running 0 4m8s orders-db-79c689d8c7-r52qc 1/1 Running 0 4m8s payment-6787495757-p7l6l 1/1 Running 0 4m8s queue-master-67478795b7-z8wxz 1/1 Running 0 4m8s rabbitmq-c8fcd79c9-gzdsc 1/1 Running 0 4m7s shipping-d49997689-zm72c 1/1 Running 0 4m7s user-7498444df6-l8wqm 1/1 Running 0 4m6s user-db-64b9d4f4d-2z6zj 1/1 Running 0 4m7s user-load-dc4586796-5vwvz 1/1 Running 0 4m7s
In terminal 2, replicas will be shown for resilience. (In case you are running with option resilient, the output would be as below)
NAME READY STATUS RESTARTS AGE carts-754c96bf74-hbwlv 1/1 Running 0 19m carts-754c96bf74-w7bzt 1/1 Running 0 3m48s carts-db-0 1/1 Running 0 3m48s carts-db-1 1/1 Running 0 3m44s carts-db-69f74c65bd-xkm8s 1/1 Running 0 19m catalogue-67f86d6587-k4ncs 1/1 Running 0 3m48s catalogue-67f86d6587-r4mpl 1/1 Running 0 12m catalogue-db-0 1/1 Running 0 3m48s catalogue-db-1 1/1 Running 0 3m29s catalogue-db-66bc8df878-km7m9 1/1 Running 0 19m front-end-5d96ff485b-lw7vp 1/1 Running 0 3m47s front-end-5d96ff485b-wkn75 1/1 Running 1 19m orders-67dbccdfdf-bz4sp 1/1 Running 0 19m orders-67dbccdfdf-rxll5 0/1 Running 0 3m47s orders-db-0 1/1 Running 0 3m47s orders-db-1 1/1 Running 0 3m33s orders-db-79c689d8c7-r52qc 1/1 Running 0 19m payment-6787495757-7njtm 1/1 Running 0 3m47s payment-6787495757-p7l6l 1/1 Running 0 19m queue-master-67478795b7-76n2n 1/1 Running 0 3m47s queue-master-67478795b7-z8wxz 1/1 Running 0 19m rabbitmq-c8fcd79c9-2sjxr 1/1 Running 0 3m47s rabbitmq-c8fcd79c9-gzdsc 1/1 Running 0 19m shipping-d49997689-l9gls 1/1 Running 0 3m47s shipping-d49997689-zm72c 1/1 Running 0 19m user-7498444df6-6t8cq 1/1 Running 0 3m47s user-7498444df6-l8wqm 1/1 Running 0 19m user-db-0 1/1 Running 0 3m47s user-db-1 1/1 Running 0 3m1s user-db-64b9d4f4d-2z6zj 1/1 Running 0 19m user-load-dc4586796-5vwvz 1/1 Running 0 19m user-load-dc4586796-wsh59 1/1 Running 0 3m47s
The load test packages a test script in a container for Locust that simulates user traffic to Sock-Shop. Please run it against the front-end service. The address and port of the frontend will be different and depend on which platform you've deployed to. See the notes for each deployment.
It has been used parallelly with a chaos engine, which loads against the catalog front-end service.
In the manifest, it is written as:
- name: install-application container: image: litmuschaos/litmus-app-deployer:latest args: ["-namespace=loadtest"
Load-test have 2 replicas as shown below:
oumkale@mayadata:~$ kubectl get po -n loadtest NAME READY STATUS RESTARTS AGE load-test-5d489d8c9d-mxc5g 1/1 Running 0 88s load-test-5d489d8c9d-qnrbs 1/1 Running 0 88s
Chaos experiments contain the actual chaos details. These experiments are installed on your cluster as Kubernetes CRs. The Chaos Experiments are grouped as Chaos Charts and are published on the Chaos Hub.
The generic chaos experiments such as pod-delete, container-kill, pod-network-latency are available under Generic Chaos Chart. This is the first chart you are recommended to install.
Verify if the chaos experiments are installed.
kubectl get chaosexperiments -n <namespace>
In our Sock-Shop Resiliency check, we have generic chaos such as pod-delete.
In the installation of the experiment, there are some main concepts that have been used.
The ChaosEngine is the main user-facing chaos custom resource with a namespace scope and is designed to hold information around how the chaos experiments are executed. It connects an application instance with one or more chaos experiments while allowing the users to specify run level details (override experiment defaults, provide new environment variables and volumes, options to delete or retain experiment pods, etc.,). This CR is also updated/patched with the status of the chaos experiments, making it the single source of truth with respect to the chaos.
ChaosExperiment CR is the heart of LitmusChaos and contains the low-level execution information. They serve as off-the-shelf templates that one needs to "pull" (install on the cluster) before including them as part of chaos run against any target applications (the binding being defined in the ChaosEngine). The experiments are installed on the cluster as Kubernetes custom resources and are designed to hold granular details of the experiment such as image, library, necessary permissions, chaos parameters (set to their default values). Most of the ChaosExperiment parameters are essentially tunables that can be overridden from the ChaosEngine resource.
ChaosResult resource holds the results of a ChaosExperiment with a namespace scope. It is created or updated at runtime by the experiment itself. It holds important information like the ChaosEngine reference, Experiment State, Verdict of the experiment (on completion), salient application/result attributes. It is also a source for metrics collection. It is updated/patched with the status of the experiment run. It is not removed as part of the default cleanup procedures to allow for extended reference.
The HTTP probe allows developers to specify a URL which the experiment uses to gauge health/service availability (or other custom conditions) as part of the entry/exit criteria. The received status code is mapped against an expected status. It can be defined at .spec.experiments.spec.probe the path inside ChaosEngine.
probe: - name: "check-frontend-access-url" type: "httpProbe" httpProbe/inputs: url: "http://front-end.sock-shop.svc.cluster.local" expectedResponseCode: "200" mode: "Continuous" runProperties: probeTimeout: 2 interval: 1 retry: 1 probePollingInterval: 1
We have been using probe again front end URL “http://front-end.sock-shop.svc.cluster.local” with probe poll interval 1 sec.
In a weak scenario, only one replica of the pod is present. After chaos injection, it will be down, and therefore accessibility will not be there, and eventually, it will fail due to front-end access.
In a resilient scenario, two replicas of pods are present. After chaos injection, one will be down. Therefore one pod is still up for accessibility and will still be there to pass due to front-end access.
The probe will retry for one-second for accessibility change. If the termination time exceeds two-seconds without accessibility, then it will fail.
We have been working on different predefined workflows and Kubernetes workflow. It will be in the upcoming releases.
For Chaos Workflows with Argo and LitmusChaos, please refer to the given link since we had implemented this
LitmusChaos + Argo = Chaos Workflow
To learn more about the Argo workflow and its operation, please refer to this blog post.
See below for the workflow in action.
In a weak scenario after chaos injection, it will fail. In a resilient scenario, after chaos injection, it will pass.
Select your workflow
The ChaosEngine is the main user-facing chaos custom resource with a namespace scope and is designed to hold information around how the chaos experiments are executed.
oumkale@mayadata:~$ kubectl describe chaosresult catalogue-pod-delete-chaos-pod-delete -n kubera Name: catalogue-pod-delete-chaos-pod-delete Namespace: kubera Labels: chaosUID=0cca0456-9466-4033-95f5-f1b9da31b6a5 controller-uid=8992e456-bb15-49e9-9444-9f7c7ed75a64 job-name=pod-delete-ml6og8 name=catalogue-pod-delete-chaos-pod-delete Annotations: <none> API Version: litmuschaos.io/v1alpha1 Kind: ChaosResult Metadata: Creation Timestamp: 2020-11-20T19:10:24Z Generation: 6 Resource Version: 15562 Self Link: /apis/litmuschaos.io/v1alpha1/namespaces/kubera/chaosresults/catalogue-pod-delete-chaos-pod-delete UID: a6be983e-0b75-41eb-b1df-c0cd0ccb718f Spec: Engine: catalogue-pod-delete-chaos Experiment: pod-delete Status: Experimentstatus: Fail Step: N/A Phase: Completed Probe Success Percentage: 100 Verdict: Pass Probe Status: Name: check-frontend-access-url Status: Continuous: Passed 👍 Type: httpProbe Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Awaited 23m pod-delete-3i8inn-f4xll experiment: pod-delete, Result: Awaited Warning Fail 20m pod-delete-3i8inn-f4xll experiment: pod-delete, Result: Fail Normal Awaited 10m pod-delete-bs0230-rqxp6 experiment: pod-delete, Result: Awaited Normal Pass 6m56s pod-delete-bs0230-rqxp6 experiment: pod-delete, Result: Pass
We have briefly discussed the Sock-Shop resilience predefined workflow where the Sock-Shop application and its relationship with Kubera Chaos have been shown. The workflow has been discussed in the Introduction Sock-Shop-Resiliency Workflow, About Git-App-Deployer and its operations, Installation of Experiments, Load Testing and Chaos Engine Implementation with Kubera Chaos. In Kubera Chaos, we are continuously working and testing our experiments on Sock-Shop applications.
Among the different predefined workflows and kubernetes workflows implemented, our first predefined workflow is Sock-Shop-Resiliency-Check, which has been elaborated with scenario effects and its causes.
References and Links
Are you an SRE, developer, or a Kubernetes enthusiast?
Does Chaos Engineering excite you?
Join our community on Slack For detailed discussions and regular updates On Chaos Engineering For Kubernetes. Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.