Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practices for Deploying and Running Microservices

Practices for Deploying and Running Microservices

Presented in Istanbul phpkonf 2015 in Turkish.

Ahmet Alp Balkan

May 22, 2016

More Decks by Ahmet Alp Balkan

Other Decks in Technology


  1. • ported docker client to windows (docker.exe) • docker-machine for

    Azure • docker-registry for Azure • Docker C# client library (docker.dotnet) • Microsoft’s first official docker image (asp.net) Go • Azure SDK for Go
  2. This presentation ๏ is a thought exercise; not solutions ๏

    based on publications by
 Google, Microsoft, Netflix etc. ๏ based on experience talking to customers ๏ common practices adopted in the industry
  3. Unix Philosophy Doug McIlroy (1973) ๏ Write programs that do

    one thing
 and do it well, ๏ Write programs to work together ๏ Write programs to handle text streams*, because that is a universal interface
  4. Building blocks ๏ Independent microservices ๏ Orchestrator to deploy/run services

    ๏ Load balancing solution ๏ Networking to inter-connect services ๏ API (e.g. HTTP REST, protobuf, gRPC)
  5. Microservices axioms ๏ services can scale out independently ๏ services

    get added/removed all the time ๏ services can discover each other ๏ services talk to each other via RPC/API ๏ machines go down/become unreachable ๏ services crash/become unresponsive ๏ you will see all sorts of weirdness
  6. Microservices are good for you ๏ It totally makes sense

    ๏ loose coupling ๏ separation of concerns ๏ It is independently scalable ๏ Easy to write new services ๏ in different languages ๏ Bugs & failure are more contained
  7. So what’s the problem? Microservices are harder to manage 1

    monolith vs. 20 microservices More complicated deployments Too many failure points Too many moving parts A lot more units to monitor/alert upon
  8. Concerns how do you deploy your services? how do you

    release and roll out new versions for a service? how do you reschedule/move services between machines? where do you store application state? how do you move data between machines?
  9. …more concerns how do you handle failed services? how do

    you discover new/removed instances for services? what is the target uptime/SLO (service level objective) for each service? how do you monitor service health? how and when do you alert humans?
  10. …more concerns how do you do logging? how easy is

    it to deploy services? how do you respond to incidents? ๏ how (fast) do you identify the problem? ๏ how (fast) do you mitigate/repair? ๏ how do you debug or troubleshoot?
  11. An application split into microservices is often an order of

    magnitude more complex to deploy, run and monitor.
  12. Journey to Microservices ๏ Pick a mental model and practices

    ๏ Pick your tooling ๏ Releasing ๏ Monitoring ๏ Livesite
  13. Philosopher King In Plato’s ideal city-state: ”philosophers [must] become kings…or

    those now called kings [must]…genuinely and adequately philosophize” You should have philosopher kings in your company.
  14. Have one way of doing things. ๏ Bad is miles

    better than diverse ๏ A practice is easy to change ๏ Prevents useless discussions ๏ Have a single practice for everything: retry policy, secrets management,
 deployment tool, build system,
 test framework, OS/distro, RPC protocol,
 log format, monitoring software, …
  15. Pick your technology stack. ๏ multiple programming languages are okay

    ๏ best language is what the team speaks ๏ (elegant & maintainable) > fast ๏ you don’t need fast ๏ you don’t need scalable ๏ hardware is cheaper than developers
  16. Invest in tools that gives you automation and confidence ๏

    Orchestrator ๏ Monitoring ๏ Alerting ๏ Service Discovery ๏ Load Balancing
  17. Orchestrators ๏ are the operating systems for
 your datacenter: ๏

    you have a pool of machines (cluster) ๏ a set of services/tasks you want to
 run (and keep running, or periodically) ๏ you need an orchestrator/scheduler:
  18. Orchestrators ..handle heavy lifting of service lifecycles: ๏ deploy a

    service ๏ upgrade a service ๏ rolling upgrades without downtime ๏ rollbacks ๏ reschedule services if machines fail ๏ restart services if they crash
  19. Deployment Concerns how many steps to deploy ServiceA? ๏ can

    an intern deploy easily too? how confidently can you deploy? ๏ do you have enough tests? can you deploy without downtime? how long does it take to deploy all your company’s microservices?
  20. Deployment Concerns can you rollback a new deployment? how do

    you update configuration? can you redeploy all your stack as it was on a particular date? are your builds signed? does your code work the same on all your machines (hardware/OS etc)?
  21. Testing ๏ Tests are meant to give you confidence ๏

    green = deploy ๏ red = do not deploy ๏ If you hesitate deploying a green build
 you are not testing enough.
  22. Build Reproducibility ๏ any code that goes to PROD is

    a commit from the source tree; not a dev machine. ๏ have homogenous execution environment on your machines ๏ same version of: pkgs, kernel, distro ๏ use Docker for reproducibility
  23. Build Reproducibility ๏ have ability to see versions of each

    running microservice in your cluster in your logs. ๏ run your microservices on readonly filesystems to prevent contention. ๏ have homogenous configuration for all instances of a service (etcd/consul/zk…)
  24. Deployment Pro-tips automate everything if it hurts, do it more!

    (and automate) invest in tools that give you confidence conduct deployment drills and you will discover previously unknown bugs, unscripted deployment steps and pain points
  25. Monitoring philosophy ๏ If you can’t measure, you don’t know

    it works ๏ A correctly working program is a very special case. Failure is the default. ๏ Have massive visibility in your systems. ๏ An intern should be able to query anything about your system very easily. ๏ Monitoring is cheap. Being blind to an outage is expensive.
  26. Why do we monitor? ๏ something is broken, alert humans

    ๏ analyze long-term trends ๏ compare if v2.1 is faster than v2.0 ๏ build dashboards and observe anomalies
  27. What do we monitor? ๏ health ๏ latency ๏ error

    rate ๏ request count ๏ resource utilization (cpu/memory)
  28. Black-box monitoring ๏ monitor a system from user’s perspective ๏

    GET /home → 200 OK ๏ creating user works ๏ a particular result appears in search results
  29. Black-box monitoring ๏ particularly not helpful if you have complex

    systems. user requests are load balanced. ๏ tells you the light bulb is turned on; but not how hot it is.
  30. White-box monitoring ๏ ask a system for its internal details

    GET /stats mem.free 312
 cpu.avg 0.15
 http.500 1
 http.404 12
 http.200 5698
 threads 24
 uptime 3m14s
  31. White-box monitoring ๏ in reality with a lot metrics with

    a lot more dimensions such as “version”, “instanceid” http_requests {code=200, handler=new_user, 
 method=get, version=2.0, id=3aebf531} 5310
 http_requests {code=500, handler=new_user,
 method=get, version=2.0, id=3aebf531} 4

  32. White-box monitoring ๏ you can extend services with internal metrics

    and export other counters from in- memory ๏ check out Prometheus
  33. Aggregation ☹average does not mean anything ☹median does not mean

    anything either ☹95% nope still not there ☹99% nope 99.9% maybe… have visibility for 99.95%
  34. Aggregation ๏ Use a time-series database. They sample, aggregate and

    store results from counters. ๏ OpenTSDB, Graphite ๏ query: find total errors count for a specific region in the last 5 minutes:
 http_request{code=500, service=search, region=westus}[5m]
  35. Motivation for Logging ๏ logs are for debugging ๏ if

    you SSH into PROD machines you are totally doing it wrong. ๏ SSH is un-auditable (you cannot track what an engineer is doing in a machine) ๏ humans contaminate servers and break homogeneity
  36. Logs ๏ stick with 12factor.net principles,
 log messages to stdout.

    ๏ use structured logging ๏ use open source tools for log collection, storage and querying. ๏ store logs forever if you can, for further analysis or auditing. otherwise logrotate.
  37. Log contexts ๏ Generate random GUIDs for correlation ๏ Pass

    the correlation ID around to retrieve logs about a request from all services. ๏ You can put correlation IDs to headers. ๏ Add user-parameters to contexts and measure latency for each parameter, identify outliers.
  38. Live Site Philosophy ๏ In a high scale everything _will_

    go wrong: compiler bugs garbage collection bugs kernel freezes ๏ In an outage, mitigation is the first priority. ๏ MTTR (mean time to repair)
  39. Live Site Philosophy ๏ Do not just restart and call

    it a day …you are an educated person ๏ If it happened once, it will happen again ๏ Learn from every single incident
  40. Things will go wrong ๏ Your newly developed small feature

    will bring down the entire service. ๏ Have knobs/flags to disable features in production through configuration. ๏ When a bad deployment happens (in a rolling upgrade fashion) have ways to flip traffic to the old deployment. ๏ Search: blue/green deployments
  41. Things will go wrong ๏ Carefully plan failure modes ๏

    A dumb retry policy between service RPCs will prevent the system from healing. ๏ Tell clients when and how to retry ๏ See: circuit breaker pattern ๏ Can you contain the failure? ๏ Is returning older/cached data O.K.?
  42. Service Level Objectives ๏ You cannot be 100% up: it

    is impossible. ๏ Find your SLO targets for each service: ๏ uptime ๏ latency ๏ Identify and analyze your dependencies.
  43. ๏ Microservices is not for everybody. ๏ You should have

    standardized practices. ๏ Automate everything. ๏ Developer time is expensive, use it carefully. ๏ If you are not monitoring, you don’t know.