Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The service mesh: Distributed resilience for a cloud-native world

The service mesh: Distributed resilience for a cloud-native world

Modern application architecture is becoming cloud native: containerized, “microserviced,” and orchestrated with systems like Kubernetes, Mesos, and Docker Swarm. While this environment is resilient to many failures of both hardware and software, complex, high-traffic applications require more than this to be truly resilient—especially as internal, service-to-service communication becomes a critical component of application behavior and resilient applications require resilient interservice communication.

Oliver Gould explains why companies like PayPal, Ticketmaster, and Monzo are adopting the service mesh model, a user space infrastructure layer designed to manage service-to-service communication in a cloud-native environment, including handling partial failures and unexpected load while reducing tail latencies and degrading gracefully in the presence of component failure.

Oliver traces the roots of service mesh models to microservice “sidecars” like Netflix’s Prana and Airbnb’s SmartStack. He also offers an overview of linkerd, a lightweight, Apache 2-licensed service mesh implementation used in production today at banks, AI startups, government labs, and more, detailing linkerd’s modern, multilayered approach for handling failure (and its pernicious cousin, latency), including latency-aware load balancing, failure accrual, deadline propagation, retry budgets, and nacking. Oliver also describes linkerd’s unified model for request naming, which extends its model for failure handling across service cluster and data center boundaries, allowing for a variety of traffic-shifting strategies such as ad hoc staging clusters, blue-green deploys, and cross-data center failover.

Oliver Gould

June 22, 2017
Tweet

More Decks by Oliver Gould

Other Decks in Technology

Transcript

  1. resilience The property of a material that enables it to

    resume its original shape after being bent, stretched, or compressed.
  2. operational stress variable load
 hardware failure
 bugs
 thE uNExpeCteD
 resilient

    strategies dynamic orchestration
 load balancing
 timeouts & retries
 circuit breaking

  3. cloud native abstractions Virtual machines Data centers Hardware redundancy Servers

    IP addresses, DNS Server monitoring Monolithic applications TCP/IP Containers Orchestrated envs Design for failure Services Service discovery Service monitoring Microservices gRPC, REST
  4. Nginx Nginx Nginx DB DB DB Fat clients svc svc

    svc svc svc svc svc svc svc svc svc libraries
  5. ingress DB DB DB The service mesh svc svc svc

    svc svc svc svc svc svc svc svc service mesh service mesh
  6. The Linkerd service mesh Service C Service B Service A

    linkerd Service C Service B Service A linkerd Service C Service B Service A linkerd application HTTP proxied HTTP monitoring & control Node 1 Node 2 Node 3 Service C Service B Service A linkerd application HTTP proxied HTTP monitoring & control Node 1
  7. datacenter [1] physical [2] link [3] network [4] transport linkerd-tcp

    
 kubernetes, mesos, swarm, … 
 canal, weave, … aws, azure, digitalocean, gce, … business languages, libraries [7] application service [5] session [6] presentation json, protobuf, thrift, … linkerd
  8. host app: b app: a app: c service: a host

    app: a app: b app: a the new world of service discovery!
  9. logical naming applications refer to logical names
 requests are bound

    to concrete names
 delegations express routing /svc/users /#/io.l5d.zk/prod/users /#/io.l5d.k8s/staging/http/users /svc => 2 * /#/io.l5d.zk/prod & 8 * /#/io.l5d.k8s/prod/http
  10. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db
  11. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db 800ms! 600ms!
  12. lb algorithms: • round-robin • fewest connections • queue depth

    • exponentially- weighted moving average (ewma) • aperture request-level load balancing