Practices for Deploying and Running Microservices

Slide 1

Slide 1 text

Deploying and Running Microservices @AhmetAlpBalkan Software Engineer at Microsoft

Slide 2

Slide 2 text

@AhmetAlpBalkan Go programming language Contributor/Maintainer Open Source/Linux at

Slide 3

Slide 3 text

• ported docker client to windows (docker.exe) • docker-machine for Azure • docker-registry for Azure • Docker C# client library (docker.dotnet) • Microsoft’s ﬁrst oﬃcial docker image (asp.net) Go • Azure SDK for Go

Slide 4

Slide 4 text

This presentation ๏ is a thought exercise; not solutions ๏ based on publications by  Google, Microsoft, Netﬂix etc. ๏ based on experience talking to customers ๏ common practices adopted in the industry

Slide 5

Slide 5 text

Survey time

Slide 6

Slide 6 text

Microservices Architecture

Slide 7

Slide 7 text

Microservices Architecture process monolithic

Slide 8

Slide 8 text

Microservices Architecture process monolithic microservices

Slide 9

Slide 9 text

process monolithic microservices Microservices Architecture

Slide 10

Slide 10 text

Microservices Architecture process monolithic microservices API API API

Slide 11

Slide 11 text

machines Microservices cluster from 30.000ft

Slide 12

Slide 12 text

Microservices cluster from 30.000ft machines

Slide 13

Slide 13 text

How do you come up with a microservice?

Slide 14

Slide 14 text

Unix Philosophy

Slide 15

Slide 15 text

Unix Philosophy Doug McIlroy (1973) ๏ Write programs that do one thing  and do it well, ๏ Write programs to work together ๏ Write programs to handle text streams*, because that is a universal interface

Slide 16

Slide 16 text

Building blocks ๏ Independent microservices ๏ Orchestrator to deploy/run services ๏ Load balancing solution ๏ Networking to inter-connect services ๏ API (e.g. HTTP REST, protobuf, gRPC)

Slide 17

Slide 17 text

Microservices axioms ๏ services can scale out independently ๏ services get added/removed all the time ๏ services can discover each other ๏ services talk to each other via RPC/API ๏ machines go down/become unreachable ๏ services crash/become unresponsive ๏ you will see all sorts of weirdness

Slide 18

Slide 18 text

Microservices are good for you ๏ It totally makes sense ๏ loose coupling ๏ separation of concerns ๏ It is independently scalable ๏ Easy to write new services ๏ in diﬀerent languages ๏ Bugs & failure are more contained

Slide 19

Slide 19 text

So what’s the problem? Microservices are harder to manage 1 monolith vs. 20 microservices More complicated deployments Too many failure points Too many moving parts A lot more units to monitor/alert upon

Slide 20

Slide 20 text

Concerns how do you deploy your services? how do you release and roll out new versions for a service? how do you reschedule/move services between machines? where do you store application state? how do you move data between machines?

Slide 21

Slide 21 text

…more concerns how do you handle failed services? how do you discover new/removed instances for services? what is the target uptime/SLO (service level objective) for each service? how do you monitor service health? how and when do you alert humans?

Slide 22

Slide 22 text

…more concerns how do you do logging? how easy is it to deploy services? how do you respond to incidents? ๏ how (fast) do you identify the problem? ๏ how (fast) do you mitigate/repair? ๏ how do you debug or troubleshoot?

Slide 23

Slide 23 text

Microservices is no silver bullet. It requires a well thought out architecture & tooling.

Slide 24

Slide 24 text

An application split into microservices is often an order of magnitude more complex to deploy, run and monitor.

Slide 25

Slide 25 text

So you want to use microservices…

Slide 26

Slide 26 text

Journey to Microservices ๏ Pick a mental model and practices ๏ Pick your tooling ๏ Releasing ๏ Monitoring ๏ Livesite

Slide 27

Slide 27 text

Microservices is a mentality shift. Maintain an open mind.

Slide 28

Slide 28 text

Philosopher King In Plato’s ideal city-state: ”philosophers [must] become kings…or those now called kings [must]…genuinely and adequately philosophize” You should have philosopher kings in your company.

Slide 29

Slide 29 text

Have one way of doing things.

Slide 30

Slide 30 text

Have one way of doing things. ๏ Bad is miles better than diverse ๏ A practice is easy to change ๏ Prevents useless discussions ๏ Have a single practice for everything: retry policy, secrets management,  deployment tool, build system,  test framework, OS/distro, RPC protocol,  log format, monitoring software, …

Slide 31

Slide 31 text

Pick your technology stack.

Slide 32

Slide 32 text

Pick your technology stack. ๏ multiple programming languages are okay ๏ best language is what the team speaks ๏ (elegant & maintainable) > fast ๏ you don’t need fast ๏ you don’t need scalable ๏ hardware is cheaper than developers

Slide 33

Slide 33 text

Pick your tooling

Slide 34

Slide 34 text

Invest in tools that gives you automation and conﬁdence ๏ Orchestrator ๏ Monitoring ๏ Alerting ๏ Service Discovery ๏ Load Balancing

Slide 35

Slide 35 text

Orchestrators ๏ are the operating systems for  your datacenter: ๏ you have a pool of machines (cluster) ๏ a set of services/tasks you want to  run (and keep running, or periodically) ๏ you need an orchestrator/scheduler:

Slide 36

Slide 36 text

Orchestrators ..handle heavy lifting of service lifecycles: ๏ deploy a service ๏ upgrade a service ๏ rolling upgrades without downtime ๏ rollbacks ๏ reschedule services if machines fail ๏ restart services if they crash

Slide 37

Slide 37 text

Release Engineering

Slide 38

Slide 38 text

Deployment Concerns how many steps to deploy ServiceA? ๏ can an intern deploy easily too? how conﬁdently can you deploy? ๏ do you have enough tests? can you deploy without downtime? how long does it take to deploy all your company’s microservices?

Slide 39

Slide 39 text

Deployment Concerns can you rollback a new deployment? how do you update conﬁguration? can you redeploy all your stack as it was on a particular date? are your builds signed? does your code work the same on all your machines (hardware/OS etc)?

Slide 40

Slide 40 text

Testing ๏ Tests are meant to give you conﬁdence ๏ green = deploy ๏ red = do not deploy ๏ If you hesitate deploying a green build  you are not testing enough.

Slide 41

Slide 41 text

Build Reproducibility ๏ any code that goes to PROD is a commit from the source tree; not a dev machine. ๏ have homogenous execution environment on your machines ๏ same version of: pkgs, kernel, distro ๏ use Docker for reproducibility

Slide 42

Slide 42 text

Build Reproducibility ๏ have ability to see versions of each running microservice in your cluster in your logs. ๏ run your microservices on readonly ﬁlesystems to prevent contention. ๏ have homogenous conﬁguration for all instances of a service (etcd/consul/zk…)

Slide 43

Slide 43 text

Deployment Pro-tips automate everything if it hurts, do it more! (and automate) invest in tools that give you conﬁdence conduct deployment drills and you will discover previously unknown bugs, unscripted deployment steps and pain points

Slide 44

Slide 44 text

Monitoring

Slide 45

Slide 45 text

Monitoring philosophy ๏ If you can’t measure, you don’t know it works ๏ A correctly working program is a very special case. Failure is the default. ๏ Have massive visibility in your systems. ๏ An intern should be able to query anything about your system very easily. ๏ Monitoring is cheap. Being blind to an outage is expensive.

Slide 46

Slide 46 text

Why do we monitor? ๏ something is broken, alert humans ๏ analyze long-term trends ๏ compare if v2.1 is faster than v2.0 ๏ build dashboards and observe anomalies

Slide 47

Slide 47 text

What do we monitor? ๏ health ๏ latency ๏ error rate ๏ request count ๏ resource utilization (cpu/memory)

Slide 48

Slide 48 text

Black-box monitoring ๏ monitor a system from user’s perspective ๏ GET /home → 200 OK ๏ creating user works ๏ a particular result appears in search results

Slide 49

Slide 49 text

Black-box monitoring ๏ particularly not helpful if you have complex systems. user requests are load balanced. ๏ tells you the light bulb is turned on; but not how hot it is.

Slide 50

Slide 50 text

White-box monitoring ๏ ask a system for its internal details GET /stats mem.free 312  cpu.avg 0.15  http.500 1  http.404 12  http.200 5698  threads 24  uptime 3m14s

Slide 51

Slide 51 text

White-box monitoring ๏ in reality with a lot metrics with a lot more dimensions such as “version”, “instanceid” http_requests {code=200, handler=new_user,   method=get, version=2.0, id=3aebf531} 5310  http_requests {code=500, handler=new_user,  method=get, version=2.0, id=3aebf531} 4 

Slide 52

Slide 52 text

White-box monitoring ๏ you can extend services with internal metrics and export other counters from in- memory ๏ check out Prometheus

Slide 53

Slide 53 text

Aggregation ☹average does not mean anything ☹median does not mean anything either ☹95% nope still not there ☹99% nope 99.9% maybe… have visibility for 99.95%

Slide 54

Slide 54 text

Aggregation ๏ Use a time-series database. They sample, aggregate and store results from counters. ๏ OpenTSDB, Graphite ๏ query: ﬁnd total errors count for a speciﬁc region in the last 5 minutes:    http_request{code=500, service=search, region=westus}[5m]

Slide 55

Slide 55 text

Logging

Slide 56

Slide 56 text

Motivation for Logging ๏ logs are for debugging ๏ if you SSH into PROD machines you are totally doing it wrong. ๏ SSH is un-auditable (you cannot track what an engineer is doing in a machine) ๏ humans contaminate servers and break homogeneity

Slide 57

Slide 57 text

Logs ๏ stick with 12factor.net principles,  log messages to stdout. ๏ use structured logging ๏ use open source tools for log collection, storage and querying. ๏ store logs forever if you can, for further analysis or auditing. otherwise logrotate.

Slide 58

Slide 58 text

Log contexts ๏ Generate random GUIDs for correlation ๏ Pass the correlation ID around to retrieve logs about a request from all services. ๏ You can put correlation IDs to headers. ๏ Add user-parameters to contexts and measure latency for each parameter, identify outliers.

Slide 59

Slide 59 text

Live Site Engineering

Slide 60

Slide 60 text

At the end of the day, only the user experience matters.

Slide 61

Slide 61 text

Live Site Philosophy ๏ In a high scale everything _will_ go wrong: compiler bugs garbage collection bugs kernel freezes ๏ In an outage, mitigation is the ﬁrst priority. ๏ MTTR (mean time to repair)

Slide 62

Slide 62 text

Live Site Philosophy ๏ Do not just restart and call it a day …you are an educated person ๏ If it happened once, it will happen again ๏ Learn from every single incident

Slide 63

Slide 63 text

Things will go wrong ๏ Your newly developed small feature will bring down the entire service. ๏ Have knobs/flags to disable features in production through configuration. ๏ When a bad deployment happens (in a rolling upgrade fashion) have ways to flip traffic to the old deployment. ๏ Search: blue/green deployments

Slide 64

Slide 64 text

Things will go wrong ๏ Carefully plan failure modes ๏ A dumb retry policy between service RPCs will prevent the system from healing. ๏ Tell clients when and how to retry ๏ See: circuit breaker pattern ๏ Can you contain the failure? ๏ Is returning older/cached data O.K.?

Slide 65

Slide 65 text

Service Level Objectives ๏ You cannot be 100% up: it is impossible. ๏ Find your SLO targets for each service: ๏ uptime ๏ latency ๏ Identify and analyze your dependencies.

Slide 66

Slide 66 text

Wrap up

Slide 67

Slide 67 text

๏ Microservices is not for everybody. ๏ You should have standardized practices. ๏ Automate everything. ๏ Developer time is expensive, use it carefully. ๏ If you are not monitoring, you don’t know.