Slide 1

Slide 1 text

Istio and the Service Mesh Architecture DevOps BKK 2018

Slide 2

Slide 2 text

About me ● Manatsawin Hanmongkolchai ● Junior Architect at Wongnai

Slide 3

Slide 3 text

How I sold Istio to my team

Slide 4

Slide 4 text

How Wongnai monitor microservices

Slide 5

Slide 5 text

Microservice monitoring ● In-service metrics eg. controller time

Slide 6

Slide 6 text

Microservice monitoring ● AWS X-Ray SDK

Slide 7

Slide 7 text

Microservice monitoring ● Sentry

Slide 8

Slide 8 text

Microservice monitoring ● ELB Error Rate

Slide 9

Slide 9 text

Microservice monitoring These must be integrated into your service AWS X-Ray

Slide 10

Slide 10 text

Microservice monitoring The problem in microservice world ● Service can be written in many languages. Not all tools support every languages

Slide 11

Slide 11 text

Microservice monitoring The problem in microservice world ● People in a rush skip implementing proper monitoring

Slide 12

Slide 12 text

Meet Istio

Slide 13

Slide 13 text

Service mesh Istio handle interservice connection Sidecar

Slide 14

Slide 14 text

How Istio sidecar work? Istio use admission controller to install 2 containers in your pod

Slide 15

Slide 15 text

How Istio sidecar work? 1. Init container to setup transparent proxy iptables rule (as root) 2. Envoy running alongside your app as the transparent proxy

Slide 16

Slide 16 text

What Istio can do for you Monitoring ● Network calls ● Tracing

Slide 17

Slide 17 text

Network monitoring Istio provide insight into your network in layer 7

Slide 18

Slide 18 text

Total requests 4xx 5xx

Slide 19

Slide 19 text

Request count of service Response time

Slide 20

Slide 20 text

Service network monitoring Measured client side Request count Success rate Resp. time Speed (for TCP) Measured server side

Slide 21

Slide 21 text

Who call me?

Slide 22

Slide 22 text

Distributed Tracing ● All incoming/outgoing HTTP calls are traced to Jaeger ● Needs to propagate OpenTracing headers from incoming call to outgoing call to track calls correctly

Slide 23

Slide 23 text

Distributed Tracing ● Easiest way is to just integrate Zipkin OpenTracing into your app

Slide 24

Slide 24 text

Distributed Tracing

Slide 25

Slide 25 text

Distributed Tracing

Slide 26

Slide 26 text

What Istio can do for you ● Traffic Management ○ Routing ■ Traffic Shifting ■ Mirror ○ Fault Injection ○ Circuit Breaker

Slide 27

Slide 27 text

Routing ● Kubernetes service operates in Layer 4 Cluster IP Backend Backend Backend Req Req Req Req Req Req

Slide 28

Slide 28 text

Routing ● Istio operate in layer 7 and can do per-call load balancing Envoy Req Req Req Req Req Req Backend Backend Backend

Slide 29

Slide 29 text

Split traffic ● Split traffic between service (eg. 1% to new version)

Slide 30

Slide 30 text

Mirror traffic ● Test in production by cloning traffic Envoy Live version Test version Req

Slide 31

Slide 31 text

Fault Injection ● Intentionally making service worse ● Why? Let’s hear a story

Slide 32

Slide 32 text

Fault Injection Site Reliability Engineering How Google runs production systems landing.google.com /sre/book/

Slide 33

Slide 33 text

#WongnaiIsHiring ● Wongnai is looking for our first Site Reliability Engineer ● careers.wongnai.com

Slide 34

Slide 34 text

Chubby

Slide 35

Slide 35 text

Fault Injection Over time, we found that the failures of the global instance of Chubby consistently generated service outages.

Slide 36

Slide 36 text

Fault Injection As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down.

Slide 37

Slide 37 text

Fault Injection The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective.

Slide 38

Slide 38 text

Fault Injection In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.

Slide 39

Slide 39 text

Fault Injection ● Slow down services ○ Delay 80% of requests for 5 seconds ● Make errors ○ Return 500 error code for 80% of requests

Slide 40

Slide 40 text

Circuit Breaker Remove a backend from service if it return too many errors in a row Frontend Backend Work Queue 503 Timeout F5

Slide 41

Slide 41 text

Summary Istio provide visibility and configurability to your network. This is traditionally done by adding library, but in a microservice world you need a cross language solution

Slide 42

Slide 42 text

The catch Here’s what we found while moving to Istio ● While requiring zero code changes, your service must already be well behaved cloud application

Slide 43

Slide 43 text

The catch ● Do not connect directly to pod IP (eg. no service discovery - just use cluster IP and avoid headless service)

Slide 44

Slide 44 text

The catch ● Do not mix port type in the cluster (eg. don’t run HTTP server on port 6379 with another pod running TCP service at the same port)

Slide 45

Slide 45 text

The catch ● Set the Host header to the destination. Don’t connect to gateway and set Host header to cooking. ○ This case is really hard to debug...

Slide 46

Slide 46 text

The catch ● External services (ie. outside Kubernetes) but in the capturing IP range must have ServiceEntry defined ○ ServiceEntry is cluster-wide

Slide 47

Slide 47 text

Slides on speakerdeck.com/whs