Resiliency

In this section, you will be using the service mesh features to improve the resiliency of your system.

Retries

Description

Instead of managing resiliency inside your application, you can let your infrastructure take care of it. Istio can trigger a retry mechanism on HTTP calls when errors occur between two microservices.

                                                 First request
                                              +----------------+
                                              |                |
+---+     +------------+      +---------------+-+     503    +-v----------+
  #       |            |      |                 +<--XXXXXXX--+            |
 ~+~ +--->+    Front   +----->+    Middleware   |            |  Database  |
 / \      |            |      |                 +----------->+            |
+---+     +------------+      +---------------^-+    Retry   +-+----------+
                                              |                |
                                              +----------------+
                                                    OK 200

Execution

Before proceeding any further, make sure you are back with the base behaviour. To do so, you may run the following command:

Λ\: $ kubectl delete namespace workshop && \
      kubectl apply --filename 03_application-bootstrap/application-base.yml

The Database application has a secret parameter to define an error rate for its main endpoint. You can activate it by setting an environment variable named DATABASE_ERROR_RATE to a value between 0 and 100.

In this step, you shall deploy the Database component configured with an error rate of 50% by means of the following YAML.

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: workshop
  labels:
    app: database
    version: v1
  name: database-v1
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: database
      version: v1
  template:
    metadata:
      labels:
        app: database
        version: v1
    spec:
      containers:
        - image: stacklabs/istio-on-gke-database
          imagePullPolicy: Always
          env:
            - name: SPRING_CLOUD_GCP_LOGGING_PROJECT_ID
              value: "<YOUR_GCP_PROJECT_ID>"
            - name: DATABASE_ERROR_RATE
              value: "50" (1)
            - name: DATABASE_VERSION
              value: "errors-50%"
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: 8181
            initialDelaySeconds: 20
          name: database
          resources:
            requests:
              memory: "512Mi"
              cpu: 1
            limits:
              memory: "512Mi"
              cpu: 1
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
1 The error rate value

Let’s apply it by running the next command:

Λ\: $ kubectl apply --filename 05_resiliency/01_retries/01_add-errors-to-database.yml

If you execute multiple sequential calls, you should notice random errors happening from time to time:

Λ\: $ while true; do curl -qs ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:25:55.527Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:25:56.065Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:25:56.756Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:25:57.248Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:25:57.996Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:25:58.878Z"}
{"timestamp":"2020-01-02T23:25:59.590+0000","path":"/","status":500,"error":"Internal Server Error","message":"500 Internal Server Error from GET http://middleware:8080","requestId":"c525845a"} (1)
{"timestamp":"2020-01-02T23:26:00.084+0000","path":"/","status":500,"error":"Internal Server Error","message":"500 Internal Server Error from GET http://middleware:8080","requestId":"c525845a"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:26:00.882Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:26:01.384Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:26:01.896Z"}
{"timestamp":"2020-01-02T23:26:02.333+0000","path":"/","status":500,"error":"Internal Server Error","message":"500 Internal Server Error from GET http://middleware:8080","requestId":"c525845a"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:26:03.019Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:26:03.795Z"}
1 First error occurs

This is a use-case where the Middleware could be implementing a retry policy to be tolerant to errors coming from a child component. For that purpose, there is a a built-in feature to retry calls on error. You can configure it with the YAML below:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  namespace: workshop
  name: database
spec:
  hosts:
    - database
  http:
    - route:
        - destination:
            host: database
            subset: version-1
      retries:
        attempts: 5
        perTryTimeout: 1s

The latter can be applied as follows:

Λ\: $ kubectl apply --filename 05_resiliency/01_retries/02_add-retries.yml

Let’s run the previous curl command once again:

Λ\: $ while true; do curl -qs ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:37.459Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:38.427Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:38.923Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:39.31Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:39.925Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:40.802Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:41.542Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:41.998Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:42.484Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:43.219Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:43.754Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:44.558Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:45.031Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:45.46Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:27:46.086Z"}

You should magically be seeing a noticeable improvement in error responses.

Pirate doing magic

This may not reduce all errors, because given an error rate of 50% and 5 retries, some errors are still very likely to happen.

Traffic Limiting

Description

Istio provides the ability to limit traffic between two services. This is useful especially when rate related errors in one service can spread to others creating some kind of destructive chain reaction.

                              +-----------------+
                              |                 +------------------+
                          +--->  Middleware #1  |                  |
+---+     +------------+  |   |                 |            +-----v------+
  #       |            +--+   +-----------------+            |            |
 ~+~ +--->+    Front   |                                     |  Database  |
 / \      |            +--+   +-----------------+            |            |
+---+     +------------+  |   |                 |            +-----^------+
                          +--->  Middleware #2  |                  |
                              |                 +----XXXXXXXXXX----+
                              +-----------------+

                                                 Two requests at the same
                                                 time

Execution

Before proceeding any further, make sure you are back with the base behaviour. To do so, you may run the following command:

Λ\: $ kubectl delete namespace workshop && \
      kubectl apply --filename 03_application-bootstrap/application-base.yml

This feature can be configured through the trafficPolicy field defined in a DestinationRule to prevent too many consecutive request at the same time.

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  namespace: workshop
  name: database
spec:
  host: database
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
  subsets:
    - name: version-1
      labels:
        version: v1

Let’s take a look at what does this template by running it as follows:

Λ\: $ kubectl apply --filename 05_resiliency/02_traffic-limiting/01_traffic-limiting.yml

To see it in action, you’ll have to trigger a lot of traffic because Envoy keeps track of each req and tries to do some orchestration to reduce the number of errors.

Λ\: $ docker run --rm -it yokogawa/siege ${CLUSTER_INGRESS_IP} -t 30S -c 100 -v
...
HTTP/1.1 500     0.47 secs:     193 bytes ==> GET  /
HTTP/1.1 500     0.48 secs:     193 bytes ==> GET  /
HTTP/1.1 200     0.53 secs:     102 bytes ==> GET  /
HTTP/1.1 500     0.53 secs:     193 bytes ==> GET  /
HTTP/1.1 500     0.52 secs:     193 bytes ==> GET  /
HTTP/1.1 200     0.62 secs:     102 bytes ==> GET  /
HTTP/1.1 200     0.84 secs:     102 bytes ==> GET  /
HTTP/1.1 200     0.84 secs:     102 bytes ==> GET  /
HTTP/1.1 200     0.54 secs:     102 bytes ==> GET  /

Lifting the server siege...
Transactions:		         188 hits
Availability:		       41.32 %
Elapsed time:		        3.86 secs
Data transferred:	        0.07 MB
Response time:		        1.88 secs
Transaction rate:	       48.70 trans/sec
Throughput:		        0.02 MB/sec
Concurrency:		       91.66
Successful transactions:         188
Failed transactions:	         267
Longest transaction:	        1.91
Shortest transaction:	        0.38

During this operation, you may also run a curl loop to see the actual response:

Λ\: $ while true; do curl -qs ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:30:51.419Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:30:52.427Z"}
{"from":"front (v1) => middleware (v1) => database (errors-50%)","date":"2020-01-02T23:30:53.456Z"}
{"timestamp":"2020-01-02T23:30:54.102+0000","path":"/","status":500,"error":"Internal Server Error","message":"500 Internal Server Error from GET http://middleware:8080","requestId":"855c16f4"}
{"timestamp":"2020-01-02T23:30:54.601+0000","path":"/","status":500,"error":"Internal Server Error","message":"500 Internal Server Error from GET http://middleware:8080","requestId":"23fa27e0"}

To go a bit further, you can try to:

  • Follow execution logs of the envoy proxy to see what’s happening during traffic exclusion

  • Follow errors rates from the Stackdriver console to see the number of requests handled by each pod

Pool Ejection

The concept of pool ejection is to stop all incoming calls to a component when this one starts to return too many errors (≥ 502 HTTP error codes). The goal is to let a pod cool down for few moments before coming back to the pool.

Before proceeding any further, make sure you are back with the base behaviour. To do so, you may run the following command:

Λ\: $ kubectl delete namespace workshop && \
      kubectl apply --filename 03_application-bootstrap/application-base.yml

Let’s deploy yet another version of the Middleware component with a high error rate:

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: workshop
  labels:
    app: middleware
    version: v1
  name: middleware-with-errors
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: middleware
      version: v1
  template:
    metadata:
      labels:
        app: middleware
        version: v1
    spec:
      containers:
        - image: stacklabs/istio-on-gke-middleware
          imagePullPolicy: Always
          env:
            - name: MIDDLEWARE_DATABASE_URI
              value: http://database:8080
            - name: SPRING_CLOUD_GCP_LOGGING_PROJECT_ID
              value: "<YOUR_GCP_PROJECT_ID>"
            - name: MIDDLEWARE_ERROR_RATE
              value: "50"
            - name: MIDDLEWARE_VERSION
              value: "errors-50%"
          livenessProbe:
            httpGet:
              path: /actuator/health
              port: 8181
            initialDelaySeconds: 20
          name: middleware
          resources:
            requests:
              memory: "512Mi"
              cpu: 1
            limits:
              memory: "512Mi"
              cpu: 1
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  namespace: workshop
  name: middleware
spec:
  hosts:
    - middleware
  http:
    - route:
        - destination:
            host: middleware
            subset: version-1
      retries:
        attemps: 0
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  namespace: workshop
  name: middleware
spec:
  host: middleware
  subsets:
    - name: version-1
      labels:
        version: v1

With the following command:

Λ\: $ kubectl apply --filename 05_resiliency/03_pool-ejection/01_add-numerous-errors-to-database.yml

Now let’s curl the cluster with multiple sequential calls:

Λ\: $ while true; do curl -qs ${CLUSTER_INGRESS_IP}; echo; done;
"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:09.868Z"}
{"from":"front (v1) => middleware (errors-50%) => database (v1)","date":"2020-05-10T13:20:10.403Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:10.984Z"}
{"from":"front (v1) => middleware (errors-50%) => database (v1)","date":"2020-05-10T13:20:11.513Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:11.728Z"}
{"from":"front (v1) => middleware (errors-50%) => database (v1)","date":"2020-05-10T13:20:11.921Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:12.519Z"}
{"timestamp":"2020-05-10T13:20:12.644+0000","path":"/","status":500,"error":"Internal Server Error","message":"503 Service Unavailable from GET http://middleware:8080","requestId":"25240976"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:12.952Z"}
{"timestamp":"2020-05-10T13:20:13.081+0000","path":"/","status":500,"error":"Internal Server Error","message":"503 Service Unavailable from GET http://middleware:8080","requestId":"b04cc67c"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:13.673Z"}
{"from":"front (v1) => middleware (errors-50%) => database (v1)","date":"2020-05-10T13:20:14.104Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:14.705Z"}
{"timestamp":"2020-05-10T13:20:14.824+0000","path":"/","status":500,"error":"Internal Server Error","message":"503 Service Unavailable from GET http://middleware:8080","requestId":"b04cc67c"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:20:14.805Z"}

You can see a lot of errors coming from the same pod. To enable the Pool Ejection, you need to apply the next YAML:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  namespace: workshop
  name: middleware
spec:
  host: middleware
  subsets:
    - name: version-1
      labels:
        version: v1
      trafficPolicy:
        connectionPool:
          http: {}
          tcp: {}
        outlierDetection:
          baseEjectionTime: 10.000s
          consecutiveErrors: 1
          interval: 1.000s
          maxEjectionPercent: 100

To apply the template, you may run the following command:

Λ\: $ kubectl apply --filename 05_resiliency/03_pool-ejection/02_add-pool-ejection.yml

Following the logs, you should find out that almost every 10 seconds, the pods did not receive any request.

If you look at the application’s response, you can see that the pod responded with a 503 to the Front service.

Λ\: $ while true; do curl -qs ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:19.649Z"}
{"from":"front (v1) => middleware (errors-50%) => database (v1)","date":"2020-05-10T13:12:19.977Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:20.204Z"}
{"from":"front (v1) => middleware (errors-50%) => database (v1)","date":"2020-05-10T13:12:20.628Z"} (1)
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:21.237Z"}
{"timestamp":"2020-05-10T13:12:21.359+0000","path":"/","status":500,"error":"Internal Server Error","message":"503 Service Unavailable from GET http://middleware:8080","requestId":"bfb448ec"} (2)
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:21.672Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:22.282Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:22.584Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:23.195Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:23.618Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:24.109Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:24.503Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:24.999Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:25.397Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:25.775Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:26.059Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:26.287Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-05-10T13:12:26.762Z"} (3)
1 Pod with random errors at 50% keep answering with success
2 Pods answered with an invalid response, it will be evicted for defined period of time
3 Next requests are only handled by other pods

To go a bit further, you can try to:

  • Use the previously created dashboard to see the requests count for each pod inside the cluster

The ejection-time is equal to the product of minimum ejection duration and the number of times the host has been ejected