Chaos engineering

One of the best way to evaluate resiliency of a system is to emulate failures from within. In this section, we will use some typical patterns commonly associated with "Chaos Engineering" 👹.

Pirate chaos

Error Injection

Description

Istio’s permits simulating communication errors between two micro-services with a simple modification of a VirtualService.

+---+     +------------+           +-----------------+       +------------+
  #       |            +----------->                 +------->            |
 ~+~ +--->+    Front   |           |    Middleware   |       |  Database  |
 / \      |            +----XXX    |                 <-------+            |
+---+     +------------+    400    +-----------------+       +------------+

Execution

Before proceeding any further, make sure you are back with the base behaviour. To do so, you may run the following command:

Λ\: $ kubectl delete namespace workshop && \
      kubectl apply --filename 03_application-bootstrap/application-base.yml

To configure errors injection, you have to apply the next YAML as shown:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  namespace: workshop
  name: middleware
spec:
  hosts:
    - middleware
  http:
    - route:
        - destination:
            host: middleware
            subset: version-1
      fault:
        abort:
          percent: 50 (1)
          httpStatus: 400 (2)
1 This is the percentage requests you want to end in an errors
2 This is the HTTP status you want the error to hold
Λ\: $ kubectl apply --filename 06_chaos-engineering/01_error-injection/01_inject-errors.yml

A curl loop (or Siege run) should show you how the rule manifests itself in the system:

Λ\: $ while true; do curl -qs ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:53.938Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:54.471Z"}
{"timestamp":"2020-01-03T00:16:54.718+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:54.956+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:55.206+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:55.651Z"}
{"timestamp":"2020-01-03T00:16:55.897+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:56.136+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:56.375+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:57.089Z"}
{"timestamp":"2020-01-03T00:16:57.325+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}

Delay

Description

Like we did in the previous step, we can inject some latency in our system without any modification of our App. All the job will be provided by Istio to slow down requests between two pods.

+---+     +------------+            +-----------------+       +------------+
  #       |            +------------>                 +------->            |
 ~+~ +--->+    Front   |            |    Middleware   |       |  Database  |
 / \      |            |-----  ----->                 <-------+            |
+---+     +------------+    |  |    +-----------------+       +------------+
                           +====+
                           |(::)|
                      Wait | )( | for 4 seconds
                           |(..)|
                           +====+

Execution

Before proceeding any further, make sure you are back with the base behaviour. To do so, you may run the following command:

Λ\: $ kubectl delete namespace workshop && \
      kubectl apply --filename 03_application-bootstrap/application-base.yml

Since the beginning of this workshop, the Middleware component has a secret parameter, MIDDLEWARE_MAX_LATENCY which enables defining a max time (chosen randomly between 0 and MIDDLEWARE_MAX_LATENCY) to respond to a request.

Let’s make it super fast ⚡ ️by setting this parameter to 0 with the next YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: workshop
  labels:
    app: middleware
    version: v1
  name: middleware-v1
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: middleware
      version: v1
  template:
    metadata:
      labels:
        app: middleware
        version: v1
    spec:
      containers:
        - image: stacklabs/istio-on-gke-middleware
          imagePullPolicy: Always
          env:
            - name: MIDDLEWARE_DATABASE_URI
              value: http://database:8080
            - name: SPRING_CLOUD_GCP_LOGGING_PROJECT_ID
              value: "<YOUR_GCP_PROJECT_ID>" (1)
            - name: MIDDLEWARE_MAX_LATENCY
              value: "0" (2)
            - name: MIDDLEWARE_VERSION
              value: "v1-very-fast"
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: 8181
            initialDelaySeconds: 20
          name: middleware
          resources:
            requests:
              memory: "512Mi"
              cpu: 1
            limits:
              memory: "512Mi"
              cpu: 1
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
1 Don’t forget to fill your Google Project ID
2 The super secret parameter to change the max latency
Λ\: $ kubectl apply --filename 06_chaos-engineering/02_delay/01_make-middleware-very-fast.yml

Now, let’s leverage Istio to achieve the same behaviour with a VirtualService, by applying the following YAML:

Before applying the following YAML, you should start a Siege run to see the response time for each request:

Λ\: $ docker run --rm -it yokogawa/siege ${CLUSTER_INGRESS_IP} -t 30S -v
New configuration template added to /root/.siege
Run siege -C to view the current settings in that file
** SIEGE 4.0.4
** Preparing 25 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200     0.49 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.49 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.49 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.52 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.60 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.60 secs:     101 bytes ==> GET  /
...
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  namespace: workshop
  name: middleware
spec:
  hosts:
    - middleware
  http:
    - route:
        - destination:
            host: middleware
            subset: version-1
      fault:
        delay:
          fixedDelay: 4.000s (1)
          percent: 50 (2)
1 The duration of the delay
2 The percentage of requests to tweak with a delay
Λ\: $ kubectl apply --filename 06_chaos-engineering/02_delay/02_add-delays.yml

You should be seeing a noticeable increase in the response time for some requests:

Λ\: $ docker run --rm -it yokogawa/siege ${CLUSTER_INGRESS_IP} -t 30S -v
New configuration template added to /root/.siege
Run siege -C to view the current settings in that file
** SIEGE 4.0.4
** Preparing 25 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200     0.34 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.49 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.35 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.45 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.49 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.42 secs:     100 bytes ==> GET  /
HTTP/1.1 200     0.48 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.37 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.39 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.47 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.52 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.49 secs:     101 bytes ==> GET  /
HTTP/1.1 200     0.50 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.34 secs:     101 bytes ==> GET  / (1)
HTTP/1.1 200     4.38 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.38 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.38 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.38 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.38 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.38 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.42 secs:     101 bytes ==> GET  /
HTTP/1.1 200     4.49 secs:     101 bytes ==> GET  /
...
1 First request to be delayed by Istio

To go a bit further, you can try to:

  • Use Stackdriver to represent the total round-trip time of a request in the service mesh

Timeout

Description

What happens if a microservice is not responding ? If all requests set to reach it finish in timeout ? Istio lets you test your services behaviour in those situations to ensure they are resilient to other components not responding, or being down.

+---+     +------------+           +-----------------+       +------------+
  #       |            +----------->                 +------->            |
 ~+~ +--->+    Front   |           |    Middleware   |       |  Database  |
 / \      |            +----~~~    |                 <-------+            |
+---+     +------------+ Time Out  +-----------------+       +------------+

Execution

Every language and every HTTP client has its own way to configure default timeouts…​ with Istio you can define a default timeout for all calls towards a specific route.

Let’s start by setting a very high latency for the Middleware component, by applying the next YAML file.

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: workshop
  labels:
    app: middleware
    version: v1
  name: middleware-v1
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: middleware
      version: v1
  template:
    metadata:
      labels:
        app: middleware
        version: v1
    spec:
      containers:
        - image: stacklabs/istio-on-gke-middleware
          imagePullPolicy: Always
          env:
            - name: MIDDLEWARE_DATABASE_URI
              value: http://database:8080
            - name: SPRING_CLOUD_GCP_LOGGING_PROJECT_ID
              value: "<YOUR_GCP_PROJECT_ID>" (1)
            - name: MIDDLEWARE_MAX_LANTECY
              value: "3000" (2)
            - name: MIDDLEWARE_VERSION
              value: "latency-3s"
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: 8181
            initialDelaySeconds: 20
          name: middleware
          resources:
            requests:
              memory: "512Mi"
              cpu: 1
            limits:
              memory: "512Mi"
              cpu: 1
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
1 Your Google Project ID
2 Define the max latency to 3 seconds
Λ\: $ kubectl apply --filename 06_chaos-engineering/03_timeout/01_make-middleware-very-slow.yml

Then, let’s define a timeout rule for this micro-service using the next VirtualService.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  namespace: workshop
  name: middleware
spec:
  hosts:
    - middleware
  http:
    - route:
        - destination:
            host: middleware
            subset: version-1
      timeout: 500ms (1)
1 Define the timeout to 500ms
Λ\: $ kubectl apply --filename 06_chaos-engineering/03_timeout/02_add-timeouts.yml

Now let’s take a look at the app responses:

Λ\: $ while true; do curl ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:00.351Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:00.875Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:01.329Z"}
{"timestamp":"2020-01-03T22:08:02.046+0000","path":"/","status":500,"error":"Internal Server Error","message":"504 Gateway Timeout from GET http://middleware:8080","requestId":"a5c454c0"} (1)
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:02.465Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:02.856Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:03.433Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:04.118Z"}
{"timestamp":"2020-01-03T22:08:04.835+0000","path":"/","status":500,"error":"Internal Server Error","message":"504 Gateway Timeout from GET http://middleware:8080","requestId":"1755cee9"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:05.44Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:05.906Z"}
1 A timeout occurred here