Predefined Workflow with Kubera Chaos

In this article, we will be talking about a new and exciting topic; Predefined Workflow with Kubera Chaos.

Kubera Chaos is based upon the popular LitmusChaos project. The Open Source foundation of Litmus includes both LitmusChaos and Kubera Chaos, which include community and verified chaos tests, out-of-the-box. Users can simply select and implement chaos charts within particular namespaces and use them across their environment.

In this blog, we’ll discuss:

  1. Introduction: Sock - Shop and it's relation with Kubera Chaos Workflow
  2. Introduction Sock-Shop-Resiliency Workflow
    - About Git-App-Deployer
    - Install Experiment
    - Load Testing and Chaos Engine
  3. Implementation with Kubera Chaos
  4. Conclusion, References, and Links.

About Sock-Shop and it's relation with Kubera Chaos

Sock-Shop simulates the user-facing part of an e-commerce website that sells socks. It is intended to aid the demonstration, and testing of microservice and cloud-native technologies Sock-Shop is maintained by Weaveworks and Container Solutions.

Sock-Shop microservices are designed to have minimal expectations, using DNS to find other services. This means that it is possible to insert load-balancers and service routers, as required or desired.

Sock-Shop can be used to illustrate microservices architectures, demonstrate platforms at talks and meetups, or as a training and educational tool.

In Kubera Chaos, we have been working and testing our experiments on Sock-Shop applications.

Chaos Experiments contain the actual chaos details. These experiments are installed on your cluster as Kubernetes CRs.

We have two scenarios: weak and resilient. Based on that, we will perform our experiment workloads and try to find an effect on the cause.

This workflow allows the execution of the same chaos experiment against two versions of the Sock-Shop deployment: weak and resilient. The weak are expected to result in a failed workflow while the resilient succeed, essentially highlighting the need for deployment best-practices.

Sock-Shop Resiliency Workflow

Our first workflow is the Sock-Shop resiliency check.

This workflow installs and executes chaos on the demo application Sock-Shop, which simulates an e-commerce website selling socks. It injects a transient fault on an upstream microservice pod (socks catalog) while continuously checking the availability of the website. 


Title: 'sock-shop-resiliency-check'

Recommendation: Check whether the application is resilient to pod failure once the workflow is completed.

Chaos workflow CRD linkchaosWkfCRDLink:

https://raw.githubusercontent.com/mayadata-io/kubera-
chaos-charts/master/workflows/sock-shop-demo/workflow.yaml

Chaos workflow Cron CRD Link:

https://raw.githubusercontent.com/mayadata-io/kubera-chaos-charts/master/workflows/sock-shop-demo/workflow_cron.yaml

Repo Link:

https://github.com/mayadata-io/kubera-chaos-charts     

Experiment Info:
Provide the application info in spec.appinfo Override the experiment tunables if desired in experiments.spec.components.env

 

About Git-App-Deployer

Git-App-Deployer has been used for the installation of sock-shop applications. At first, the user is asked to give the namespace, filePath, and timeout.

Namespace:
Namespace provides an additional qualification to a unique resource name. This is helpful when multiple teams are using the same cluster, and there is a potential for name collision. It can be a virtual wall between multiple clusters.

For Namespace
For sock-shop user has to pass

-namespace=sock-shop

For Load-Test

-namespace=loadtest

For  FilePath
For Weak Sock-Shop-Resilient check just pass:

-namespace=weak

In a weak scenario, it will create a single replica and Deployments for all.

For Resilient Sock-Shop-Resilient check just pass:

-namepsace=resilient

In a resilient scenario, it will create two replicas of pods with Statefulsets for databases and Deployments for others.

For Timeout

Timeout is used for the termination of the application. The exceeding time by default is 300s.

You may change the default time value e.g

-timeout=400

A kubeconfig file is a file used to configure access to Kubernetes when used in conjunction with the kubectl command line tool (or other clients).

It creates a namespace and then installs the required application based on the given -namespace and -filepath.

If namespace already exists, then it shows log and starts installing sock-shop.

[Status]: Namespace already exists!

Sock-Shop installation will deploy all 14 manifests of Sock-Shop microservices.

Now let’s see how the git app deployer works in the workflow:

At first, the installation of Git-App-Deployer(application installation) is performed.

- name: install-application
      container:
        image: litmuschaos/litmus-app-deployer:latest
        args: ["-namespace=sock-shop","-typeName=weak", "-timeout=400"]

Note : For resilient provide type flagName as resilient(-typeName=resilient)

In a weak scenario, a single replica will be there. You may check using this command:

kubectl get po -n sock-shop

In the terminal, the output will be weak

NAME                            READY   STATUS    RESTARTS   AGE
carts-754c96bf74-hbwlv          1/1     Running   0          4m8s
carts-db-69f74c65bd-xkm8s       1/1     Running   0          4m9s
catalogue-67f86d6587-8gghr      1/1     Running   0          4m8s
catalogue-db-66bc8df878-km7m9   1/1     Running   0          4m8s
front-end-5d96ff485b-wkn75      1/1     Running   0          4m8s
orders-67dbccdfdf-bz4sp         1/1     Running   0          4m8s
orders-db-79c689d8c7-r52qc      1/1     Running   0          4m8s
payment-6787495757-p7l6l        1/1     Running   0          4m8s
queue-master-67478795b7-z8wxz   1/1     Running   0          4m8s
rabbitmq-c8fcd79c9-gzdsc        1/1     Running   0          4m7s
shipping-d49997689-zm72c        1/1     Running   0          4m7s
user-7498444df6-l8wqm           1/1     Running   0          4m6s
user-db-64b9d4f4d-2z6zj         1/1     Running   0          4m7s
user-load-dc4586796-5vwvz       1/1     Running   0          4m7s

In terminal 2, replicas will be shown for resilience. (In case you are running with option resilient, the output would be as below)

NAME                            READY   STATUS    RESTARTS   AGE
carts-754c96bf74-hbwlv          1/1     Running   0          19m
carts-754c96bf74-w7bzt          1/1     Running   0          3m48s
carts-db-0                      1/1     Running   0          3m48s
carts-db-1                      1/1     Running   0          3m44s
carts-db-69f74c65bd-xkm8s       1/1     Running   0          19m
catalogue-67f86d6587-k4ncs      1/1     Running   0          3m48s
catalogue-67f86d6587-r4mpl      1/1     Running   0          12m
catalogue-db-0                  1/1     Running   0          3m48s
catalogue-db-1                  1/1     Running   0          3m29s
catalogue-db-66bc8df878-km7m9   1/1     Running   0          19m
front-end-5d96ff485b-lw7vp      1/1     Running   0          3m47s
front-end-5d96ff485b-wkn75      1/1     Running   1          19m
orders-67dbccdfdf-bz4sp         1/1     Running   0          19m
orders-67dbccdfdf-rxll5         0/1     Running   0          3m47s
orders-db-0                     1/1     Running   0          3m47s
orders-db-1                     1/1     Running   0          3m33s
orders-db-79c689d8c7-r52qc      1/1     Running   0          19m
payment-6787495757-7njtm        1/1     Running   0          3m47s
payment-6787495757-p7l6l        1/1     Running   0          19m
queue-master-67478795b7-76n2n   1/1     Running   0          3m47s
queue-master-67478795b7-z8wxz   1/1     Running   0          19m
rabbitmq-c8fcd79c9-2sjxr        1/1     Running   0          3m47s
rabbitmq-c8fcd79c9-gzdsc        1/1     Running   0          19m
shipping-d49997689-l9gls        1/1     Running   0          3m47s
shipping-d49997689-zm72c        1/1     Running   0          19m
user-7498444df6-6t8cq           1/1     Running   0          3m47s
user-7498444df6-l8wqm           1/1     Running   0          19m
user-db-0                       1/1     Running   0          3m47s
user-db-1                       1/1     Running   0          3m1s
user-db-64b9d4f4d-2z6zj         1/1     Running   0          19m
user-load-dc4586796-5vwvz       1/1     Running   0          19m
user-load-dc4586796-wsh59       1/1     Running   0          3m47s


Load-Test:

The load test packages a test script in a container for Locust that simulates user traffic to Sock-Shop. Please run it against the front-end service. The address and port of the frontend will be different and depend on which platform you've deployed to. See the notes for each deployment.

It has been used parallelly with a chaos engine, which loads against the catalog front-end service.

In the manifest, it is written as:

- name: install-application
      container:
        image: litmuschaos/litmus-app-deployer:latest
        args: ["-namespace=loadtest"

Load-test have 2 replicas as shown below:

oumkale@mayadata:~$ kubectl get po -n loadtest
NAME                         READY   STATUS    RESTARTS   AGE
load-test-5d489d8c9d-mxc5g   1/1     Running   0          88s
load-test-5d489d8c9d-qnrbs   1/1     Running   0          88s


Install Experiment:

Chaos experiments contain the actual chaos details. These experiments are installed on your cluster as Kubernetes CRs. The Chaos Experiments are grouped as Chaos Charts and are published on the Chaos Hub.

The generic chaos experiments such as pod-delete, container-kill, pod-network-latency are available under Generic Chaos Chart. This is the first chart you are recommended to install.

Verify if the chaos experiments are installed.

kubectl get chaosexperiments -n <namespace>

In our Sock-Shop Resiliency check, we have generic chaos such as pod-delete.

In the installation of the experiment, there are some main concepts that have been used.


ChaosEngine

The ChaosEngine is the main user-facing chaos custom resource with a namespace scope and is designed to hold information around how the chaos experiments are executed. It connects an application instance with one or more chaos experiments while allowing the users to specify run level details (override experiment defaults, provide new environment variables and volumes, options to delete or retain experiment pods, etc.,). This CR is also updated/patched with the status of the chaos experiments, making it the single source of truth with respect to the chaos.


ChaosExperiment

ChaosExperiment CR is the heart of LitmusChaos and contains the low-level execution information. They serve as off-the-shelf templates that one needs to "pull" (install on the cluster) before including them as part of chaos run against any target applications (the binding being defined in the ChaosEngine). The experiments are installed on the cluster as Kubernetes custom resources and are designed to hold granular details of the experiment such as image, library, necessary permissions, chaos parameters (set to their default values). Most of the ChaosExperiment parameters are essentially tunables that can be overridden from the ChaosEngine resource.


ChaosResult

ChaosResult resource holds the results of a ChaosExperiment with a namespace scope. It is created or updated at runtime by the experiment itself. It holds important information like the ChaosEngine reference, Experiment State, Verdict of the experiment (on completion), salient application/result attributes. It is also a source for metrics collection. It is updated/patched with the status of the experiment run. It is not removed as part of the default cleanup procedures to allow for extended reference.


LitmusProbe HTTP

httpProbe

The HTTP probe allows developers to specify a URL which the experiment uses to gauge health/service availability (or other custom conditions) as part of the entry/exit criteria. The received status code is mapped against an expected status. It can be defined at .spec.experiments[].spec.probe the path inside ChaosEngine.

probe:
- name: "check-frontend-access-url"
  type: "httpProbe"
  httpProbe/inputs:
    url: "http://front-end.sock-shop.svc.cluster.local"
    expectedResponseCode: "200"
  mode: "Continuous"
  runProperties:
    probeTimeout: 2
    interval: 1
    retry: 1
    probePollingInterval: 1

We have been using probe again front end URL “http://front-end.sock-shop.svc.cluster.local” with probe poll interval 1 sec.

In a weak scenario, only one replica of the pod is present. After chaos injection, it will be down, and therefore accessibility will not be there, and eventually, it will fail due to front-end access.

In a resilient scenario, two replicas of pods are present. After chaos injection, one will be down. Therefore one pod is still up for accessibility and will still be there to pass due to front-end access.

The probe will retry for one-second for accessibility change. If the termination time exceeds two-seconds without accessibility, then it will fail.

We have been working on different predefined workflows and Kubernetes workflow. It will be in the upcoming releases.

For Chaos Workflows with Argo and LitmusChaos, please refer to the given link since we had implemented this

LitmusChaos + Argo = Chaos Workflow

To learn more about the Argo workflow and its operation, please refer to this blog post.

See below for the workflow in action. 

In a weak scenario after chaos injection, it will fail. In a resilient scenario, after chaos injection, it will pass.

Select your workflow

The ChaosEngine is the main user-facing chaos custom resource with a namespace scope and is designed to hold information around how the chaos experiments are executed.

oumkale@mayadata:~$ kubectl describe chaosresult catalogue-pod-delete-chaos-pod-delete -n kubera
Name:         catalogue-pod-delete-chaos-pod-delete
Namespace:    kubera
Labels:       chaosUID=0cca0456-9466-4033-95f5-f1b9da31b6a5
              controller-uid=8992e456-bb15-49e9-9444-9f7c7ed75a64
              job-name=pod-delete-ml6og8
              name=catalogue-pod-delete-chaos-pod-delete
Annotations:  <none>
API Version:  litmuschaos.io/v1alpha1
Kind:         ChaosResult
Metadata:
  Creation Timestamp:  2020-11-20T19:10:24Z
  Generation:          6
  Resource Version:    15562
  Self Link:           /apis/litmuschaos.io/v1alpha1/namespaces/kubera/chaosresults/catalogue-pod-delete-chaos-pod-delete
  UID:                 a6be983e-0b75-41eb-b1df-c0cd0ccb718f
Spec:
  Engine:      catalogue-pod-delete-chaos
  Experiment:  pod-delete
Status:
  Experimentstatus:
    Fail Step:                 N/A
    Phase:                     Completed
    Probe Success Percentage:  100
    Verdict:                   Pass
  Probe Status:
    Name:  check-frontend-access-url
    Status:
      Continuous:  Passed 👍 
    Type:          httpProbe
Events:
  Type     Reason   Age    From                     Message
  ----     ------   ----   ----                     -------
  Normal   Awaited  23m    pod-delete-3i8inn-f4xll  experiment: pod-delete, Result: Awaited
  Warning  Fail     20m    pod-delete-3i8inn-f4xll  experiment: pod-delete, Result: Fail
  Normal   Awaited  10m    pod-delete-bs0230-rqxp6  experiment: pod-delete, Result: Awaited
  Normal   Pass     6m56s  pod-delete-bs0230-rqxp6  experiment: pod-delete, Result: Pass


Conclusion

We have briefly discussed the Sock-Shop resilience predefined workflow where the Sock-Shop application and its relationship with Kubera Chaos have been shown. The workflow has been discussed in the  Introduction Sock-Shop-Resiliency Workflow, About Git-App-Deployer and its operations, Installation of Experiments, Load Testing and Chaos Engine Implementation with Kubera Chaos. In Kubera Chaos, we are continuously working and testing our experiments on Sock-Shop applications.

Among the different predefined workflows and kubernetes workflows implemented, our first predefined workflow is Sock-Shop-Resiliency-Check, which has been elaborated with scenario effects and its causes.

References and Links

Are you an SRE, developer, or a Kubernetes enthusiast? 
Does Chaos Engineering excite you?
Join our community on Slack For detailed discussions and regular updates On Chaos Engineering For Kubernetes. Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.

LitmusChaos
Litmus ChaosHub
Kubera Chaos Charts
Test Tools
Connect with us on Slack
Litmus Chaos YouTube Channel

Abhishek Raj
Abhishek is a Customer Success Engineer at Mayadata. He is currently working with Kubernetes and Docker.
Abhishek Raj
Abhishek is a Customer Success Engineer at Mayadata. He is currently working with Kubernetes and Docker.
Paul Burt
Prior to working with MayaData, Paul has worked with NetApp & Red Hat in senior positions. He’s upvoting your /r/kubernetes threads. Paul has a knack for and demystifying infrastructure, and making gnarly, complex topics approachable. He enjoys home brewing beer, reading independent comics, and yelling at his computer when it doesn’t do what he wants.