Chaos engineering
One of the best way to evaluate resiliency of a system is to emulate failures from within. In this section, we will use some typical patterns commonly associated with "Chaos Engineering" 👹.
Error Injection
Description
Istio’s permits simulating communication errors between two micro-services with a simple modification of a VirtualService.
+---+ +------------+ +-----------------+ +------------+ # | +-----------> +-------> | ~+~ +--->+ Front | | Middleware | | Database | / \ | +----XXX | <-------+ | +---+ +------------+ 400 +-----------------+ +------------+
Execution
Before proceeding any further, make sure you are back with the base behaviour. To do so, you may run the following command:
|
To configure errors injection, you have to apply the next YAML as shown:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
namespace: workshop
name: middleware
spec:
hosts:
- middleware
http:
- route:
- destination:
host: middleware
subset: version-1
fault:
abort:
percent: 50 (1)
httpStatus: 400 (2)
1 | This is the percentage requests you want to end in an errors |
2 | This is the HTTP status you want the error to hold |
Λ\: $ kubectl apply --filename 06_chaos-engineering/01_error-injection/01_inject-errors.yml
A curl
loop (or Siege run) should show you how the rule manifests itself in the system:
Λ\: $ while true; do curl -qs ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:53.938Z"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:54.471Z"}
{"timestamp":"2020-01-03T00:16:54.718+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:54.956+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:55.206+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:55.651Z"}
{"timestamp":"2020-01-03T00:16:55.897+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:56.136+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"timestamp":"2020-01-03T00:16:56.375+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
{"from":"front (v1) => middleware (v1) => database (v1)","date":"2020-01-03T00:16:57.089Z"}
{"timestamp":"2020-01-03T00:16:57.325+0000","path":"/","status":500,"error":"Internal Server Error","message":"400 Bad Request from GET http://middleware:8080","requestId":"695e97ad"}
Delay
Description
Like we did in the previous step, we can inject some latency in our system without any modification of our App. All the job will be provided by Istio to slow down requests between two pods.
+---+ +------------+ +-----------------+ +------------+ # | +------------> +-------> | ~+~ +--->+ Front | | Middleware | | Database | / \ | |----- -----> <-------+ | +---+ +------------+ | | +-----------------+ +------------+ +====+ |(::)| Wait | )( | for 4 seconds |(..)| +====+
Execution
Before proceeding any further, make sure you are back with the base behaviour. To do so, you may run the following command:
|
Since the beginning of this workshop, the Middleware component has a secret parameter, MIDDLEWARE_MAX_LATENCY
which
enables defining a max time (chosen randomly between 0 and MIDDLEWARE_MAX_LATENCY
) to respond to a request.
Let’s make it super fast ⚡ ️by setting this parameter to 0
with the next YAML
:
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: workshop
labels:
app: middleware
version: v1
name: middleware-v1
spec:
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
selector:
matchLabels:
app: middleware
version: v1
template:
metadata:
labels:
app: middleware
version: v1
spec:
containers:
- image: stacklabs/istio-on-gke-middleware
imagePullPolicy: Always
env:
- name: MIDDLEWARE_DATABASE_URI
value: http://database:8080
- name: SPRING_CLOUD_GCP_LOGGING_PROJECT_ID
value: "<YOUR_GCP_PROJECT_ID>" (1)
- name: MIDDLEWARE_MAX_LATENCY
value: "0" (2)
- name: MIDDLEWARE_VERSION
value: "v1-very-fast"
livenessProbe:
httpGet:
path: /actuator/health
port: 8181
initialDelaySeconds: 20
name: middleware
resources:
requests:
memory: "512Mi"
cpu: 1
limits:
memory: "512Mi"
cpu: 1
ports:
- containerPort: 8080
name: http
protocol: TCP
1 | Don’t forget to fill your Google Project ID |
2 | The super secret parameter to change the max latency |
Λ\: $ kubectl apply --filename 06_chaos-engineering/02_delay/01_make-middleware-very-fast.yml
Now, let’s leverage Istio to achieve the same behaviour with a VirtualService, by applying the following YAML
:
Before applying the following
|
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
namespace: workshop
name: middleware
spec:
hosts:
- middleware
http:
- route:
- destination:
host: middleware
subset: version-1
fault:
delay:
fixedDelay: 4.000s (1)
percent: 50 (2)
1 | The duration of the delay |
2 | The percentage of requests to tweak with a delay |
Λ\: $ kubectl apply --filename 06_chaos-engineering/02_delay/02_add-delays.yml
You should be seeing a noticeable increase in the response time for some requests:
Λ\: $ docker run --rm -it yokogawa/siege ${CLUSTER_INGRESS_IP} -t 30S -v
New configuration template added to /root/.siege
Run siege -C to view the current settings in that file
** SIEGE 4.0.4
** Preparing 25 concurrent users for battle.
The server is now under siege...
HTTP/1.1 200 0.34 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.49 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.35 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.45 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.49 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.42 secs: 100 bytes ==> GET /
HTTP/1.1 200 0.48 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.37 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.39 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.47 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.52 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.49 secs: 101 bytes ==> GET /
HTTP/1.1 200 0.50 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.34 secs: 101 bytes ==> GET / (1)
HTTP/1.1 200 4.38 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.38 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.38 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.38 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.38 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.38 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.42 secs: 101 bytes ==> GET /
HTTP/1.1 200 4.49 secs: 101 bytes ==> GET /
...
1 | First request to be delayed by Istio |
To go a bit further, you can try to:
-
Use Stackdriver to represent the total round-trip time of a request in the service mesh
Timeout
Description
What happens if a microservice is not responding ? If all requests set to reach it finish in timeout ? Istio lets you test your services behaviour in those situations to ensure they are resilient to other components not responding, or being down.
+---+ +------------+ +-----------------+ +------------+ # | +-----------> +-------> | ~+~ +--->+ Front | | Middleware | | Database | / \ | +----~~~ | <-------+ | +---+ +------------+ Time Out +-----------------+ +------------+
Execution
Every language and every HTTP client has its own way to configure default timeouts… with Istio you can define a default timeout for all calls towards a specific route.
Let’s start by setting a very high latency for the Middleware component, by applying the next YAML
file.
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: workshop
labels:
app: middleware
version: v1
name: middleware-v1
spec:
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
selector:
matchLabels:
app: middleware
version: v1
template:
metadata:
labels:
app: middleware
version: v1
spec:
containers:
- image: stacklabs/istio-on-gke-middleware
imagePullPolicy: Always
env:
- name: MIDDLEWARE_DATABASE_URI
value: http://database:8080
- name: SPRING_CLOUD_GCP_LOGGING_PROJECT_ID
value: "<YOUR_GCP_PROJECT_ID>" (1)
- name: MIDDLEWARE_MAX_LANTECY
value: "3000" (2)
- name: MIDDLEWARE_VERSION
value: "latency-3s"
livenessProbe:
httpGet:
path: /actuator/health
port: 8181
initialDelaySeconds: 20
name: middleware
resources:
requests:
memory: "512Mi"
cpu: 1
limits:
memory: "512Mi"
cpu: 1
ports:
- containerPort: 8080
name: http
protocol: TCP
1 | Your Google Project ID |
2 | Define the max latency to 3 seconds |
Λ\: $ kubectl apply --filename 06_chaos-engineering/03_timeout/01_make-middleware-very-slow.yml
Then, let’s define a timeout rule for this micro-service using the next VirtualService.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
namespace: workshop
name: middleware
spec:
hosts:
- middleware
http:
- route:
- destination:
host: middleware
subset: version-1
timeout: 500ms (1)
1 | Define the timeout to 500ms |
Λ\: $ kubectl apply --filename 06_chaos-engineering/03_timeout/02_add-timeouts.yml
Now let’s take a look at the app responses:
Λ\: $ while true; do curl ${CLUSTER_INGRESS_IP}; echo; done;
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:00.351Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:00.875Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:01.329Z"}
{"timestamp":"2020-01-03T22:08:02.046+0000","path":"/","status":500,"error":"Internal Server Error","message":"504 Gateway Timeout from GET http://middleware:8080","requestId":"a5c454c0"} (1)
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:02.465Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:02.856Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:03.433Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:04.118Z"}
{"timestamp":"2020-01-03T22:08:04.835+0000","path":"/","status":500,"error":"Internal Server Error","message":"504 Gateway Timeout from GET http://middleware:8080","requestId":"1755cee9"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:05.44Z"}
{"from":"front (v1) => middleware (latency-3s) => database (v1)","date":"2020-01-03T22:08:05.906Z"}
1 | A timeout occurred here |