Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The service mesh: Distributed resilience for a cloud-native world

The service mesh: Distributed resilience for a cloud-native world

Modern application architecture is becoming cloud native: containerized, “microserviced,” and orchestrated with systems like Kubernetes, Mesos, and Docker Swarm. While this environment is resilient to many failures of both hardware and software, complex, high-traffic applications require more than this to be truly resilient—especially as internal, service-to-service communication becomes a critical component of application behavior and resilient applications require resilient interservice communication.

Oliver Gould explains why companies like PayPal, Ticketmaster, and Monzo are adopting the service mesh model, a user space infrastructure layer designed to manage service-to-service communication in a cloud-native environment, including handling partial failures and unexpected load while reducing tail latencies and degrading gracefully in the presence of component failure.

Oliver traces the roots of service mesh models to microservice “sidecars” like Netflix’s Prana and Airbnb’s SmartStack. He also offers an overview of linkerd, a lightweight, Apache 2-licensed service mesh implementation used in production today at banks, AI startups, government labs, and more, detailing linkerd’s modern, multilayered approach for handling failure (and its pernicious cousin, latency), including latency-aware load balancing, failure accrual, deadline propagation, retry budgets, and nacking. Oliver also describes linkerd’s unified model for request naming, which extends its model for failure handling across service cluster and data center boundaries, allowing for a variety of traffic-shifting strategies such as ad hoc staging clusters, blue-green deploys, and cross-data center failover.

625beff353c7c2b068b26d1a57566e05?s=128

Oliver Gould

June 22, 2017
Tweet

Transcript

  1. The Service Mesh Oliver Gould @olix0r, CTO, Buoyant

  2. None
  3. resilience The property of a material that enables it to

    resume its original shape after being bent, stretched, or compressed.
  4. operational stress variable load
 hardware failure
 bugs
 thE uNExpeCteD
 resilient

    strategies dynamic orchestration
 load balancing
 timeouts & retries
 circuit breaking

  5. 2000 dedicated hardware with
 configuration management dynamically scheduled
 hybrid cloud

    2017
  6. containers orchestrators microservices

  7. service
 A service
 B service
 C runtime communication

  8. service
 A service
 B service
 C Twitter circa 2013

  9. cloud native abstractions Virtual machines Data centers Hardware redundancy Servers

    IP addresses, DNS Server monitoring Monolithic applications TCP/IP Containers Orchestrated envs Design for failure Services Service discovery Service monitoring Microservices gRPC, REST
  10. service
 A service
 B service
 C we need something more

    ?
  11. the service mesh an infrastructure layer for managing service to

    service communication
  12. Apache Apache Apache PHP PHP PHP PHP PHP Mysql Mysql

    Mysql LAMP
  13. Nginx Nginx Nginx DB DB DB Fat clients svc svc

    svc svc svc svc svc svc svc svc svc libraries
  14. ingress DB DB DB The service mesh svc svc svc

    svc svc svc svc svc svc svc svc service mesh service mesh
  15. The Linkerd service mesh Service C Service B Service A

    linkerd Service C Service B Service A linkerd Service C Service B Service A linkerd application HTTP proxied HTTP monitoring & control Node 1 Node 2 Node 3 Service C Service B Service A linkerd application HTTP proxied HTTP monitoring & control Node 1
  16. visibility security flexibility reliability

  17. If you’re building a cloud native application,
 you need a

    service mesh. CENSORED
  18. linkerd

  19. Linkers and Loaders, John R. Levine, Academic Press

  20. datacenter [1] physical [2] link [3] network [4] transport linkerd-tcp

    
 kubernetes, mesos, swarm, … 
 canal, weave, … aws, azure, digitalocean, gce, … business languages, libraries [7] application service [5] session [6] presentation json, protobuf, thrift, … linkerd
  21. a historical perspective: tcp/ip 1975: Internet Protocol Suite Layer 3:

    /etc/hosts Layer 4: /etc/services
  22. a historical perspective: dns 1984: Domain Name Service /etc/hosts-as-a-service

  23. host app: b app: a app: c service: a host

    app: a app: b app: a the new world of service discovery!
  24. what would a cloud native linker do?

  25. logical naming applications refer to logical names
 requests are bound

    to concrete names
 delegations express routing /svc/users /#/io.l5d.zk/prod/users /#/io.l5d.k8s/staging/http/users /svc => 2 * /#/io.l5d.zk/prod & 8 * /#/io.l5d.k8s/prod/http
  26. centralized control

  27. per-request: adhoc staging GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab: /s/B

    => /s/B2
  28. per-request routing: debug proxy GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab:

    /s/E => /s/P/s/E
  29. observability counters (e.g. client/users/failures) histograms (e.g. client/users/latency/p99) tracing

  30. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db
  31. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db 800ms! 600ms!
  32. deadlines timelines users web db timeout=400ms deadline=323ms deadline=210ms 77ms elapsed

    113ms elapsed
  33. retries typical: retries=3

  34. retries typical: retries=3 worst-case: 300% more load!!!

  35. budgets typical: retries=3 better:
 retryBudget=20% worst-case: 300% more load!!! worst-case:

    20% more load
  36. load shedding via cancellation timelines users web db timelines users

    web db timeout!
  37. load shedding via cancellation timelines users web db timelines users

    web db timeout!
  38. backpressure timelines users web db timelines users web db 1000

    requests 100 requests 1000 requests
  39. backpressure timelines users web db timelines users web db 1000

    failed 1000 failed
  40. backpressure timelines users web db 100 ok 100 ok 100

    ok + 900 failed/redirected/etc
  41. lb algorithms: • round-robin • fewest connections • queue depth

    • exponentially- weighted moving average (ewma) • aperture request-level load balancing
  42. demo!?

  43. https://github.com/linkerd linkerd.io buoyant.io Buoyant is hiring! info.buoyant.io/careers