Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finagle, linkerd, and Apache Mesos: Twitter-style microservices at scale

Finagle, linkerd, and Apache Mesos: Twitter-style microservices at scale

Finagle (Twitter's Apache-licensed RPC stack) and Apache Mesos are two core technologies used by Twitter to scale its multi-service architecture to high-volume traffic loads. In this talk, we describe how Twitter used Finagle and Mesos together to address the challenges of scaling its application. We introduce linkerd, an Apache-licensed proxy form of Finagle, which extends Finagle's operational model to non-JVM or polyglot multi-service applications. Finally, we show how linkerd can be used to "wrap" applications running in Apache Mesos to provide higher-level, service-based semantics around scalability, reliability, and fault-tolerance for multi-service or microservice applications---even in the presence of high traffic loads and unreliable hardware.

625beff353c7c2b068b26d1a57566e05?s=128

Oliver Gould

May 12, 2016
Tweet

Transcript

  1. Finagle, linkerd, and Mesos
 Twitter-style microservices at scale oliver gould


    cto, buoyant ApacheCon North America, May 2016 from
  2. oliver gould • cto @ buoyant
 open-source microservice infrastructure •

    previously, tech lead @ twitter:
 observability, traffic • core contributor: finagle • creator: linkerd • loves: dogs @olix0r
 ver@buoyant.io
  3. overview • 2010: A Failwhale Odyssey • Automating the Datacenter

    • Microservices: A Silver Bullet • Finagle: The Once and Future Layer 5 • Introducing linkerd • Demo • Q&A
  4. 2010 A FAILWHALE ODYSSEY

  5. Twitter, 2010 107 users 107 tweets/day 102 engineers 101 services

    101 deploys/week 102 hosts 0 datacenters 101 user-facing outages/week https://blog.twitter.com/2010/measuring-tweets
  6. None
  7. None
  8. The Monorail, 2010 103 of RPS 102 of RPS/host 101

    of RPS/process hardware lb the monorail mysql memcache kestrel
  9. Problems with the Monorail Ruby performance MySQL scaling Memcache operability

    Deploys
  10. Events https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

  11. Asymmetry Photo by Troy Holden

  12. Provisioning

  13. automating the datacenter

  14. mesos.apache.org UC Berkeley, 2010 Twitter, 2011 Apache, 2012 Abstracts compute

    resources Promise: don’t worry about the hosts
  15. aurora.apache.org Twitter, 2011 Apache, 2013 Schedules processes on Mesos Promise:

    no more puppet, monit, etc
  16. timelines Aurora (or Marathon, or …) host Mesos host host

    host host host users notifications x800 x300 x1000
  17. microservices A SILVER BULLET

  18. scaling teams growing software

  19. flexibility

  20. performance correctness monitoring debugging efficiency security resilience

  21. not a silver bullet. (sorry.)

  22. Resilience is an imperative: our software runs on the truly

    dismal computers we call datacenters. Besides being heinously
 complex… they are unreliable and prone to
 operator error. Marius Eriksen @marius
 RPC Redux
  23. resilience in microservices software you didn’t write hardware you can’t

    touch network you can’t configure break in new and surprising ways and your customers shouldn’t notice
  24. resilient microservices means resilient communication

  25. datacenter [1] physical [2] link [3] network [4] transport aurora,

    marathon, … mesos 
 canal, weave, … aws, azure, digitalocean, gce, … business languages, libraries [7] application rpc [5] session [6] presentation json, protobuf, thrift, … http/2, mux, …
  26. layer 5 dispatches requests onto layer 4 connections

  27. finagle THE ONCE AND FUTURE LAYER 5

  28. github.com/twitter/finagle RPC library (JVM) asynchronous built on Netty scala functional

    strongly typed first commit: Oct 2010
  29. used by…

  30. programming finagle val users = Thrift.newIface[UserSvc](“/s/users”)
 val timelines = Thrift.newIface[TimelineSvc](“/s/timeline”)

    Http.serve(“:8080”, Service.mk[Request, Response] { req => for { user <- users.get(userReq(req)) timeline <- timelines.get(user) } yield renderHTML(user, timeline) })
  31. operating finagle transport security service discovery circuit breaking backpressure deadlines

    retries tracing metrics keep-alive multiplexing load balancing per-request routing service-level objectives Observe Session timeout Retries Request draining Load balancer Monitor Observe Trace Failure accrual Request timeout Pool Fail fast Expiration Dispatcher
  32. “It’s slow”
 is the hardest problem you’ll ever debug. Jeff

    Hodges @jmhodges
 Notes on Distributed Systems for Young Bloods
  33. the more components you deploy, the more problems you have

  34. the more components you deploy, the more problems you have

  35. the more components you deploy, the more problems you have

  36. lb algorithms: • round-robin • fewest connections • queue depth

    • exponentially-weighted moving average (ewma) • aperture load balancing at layer 5
  37. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db
  38. deadlines timelines users web db timeout=400ms deadline=223ms deadline=10ms 177ms elapsed

    213ms elapsed
  39. retry budget typical: retries=3 worst-case: 300% more load!!! better:
 retryBudget=20%

    worst-case: 20% more load
  40. tracing

  41. tracing

  42. tracing

  43. layer 5 routing

  44. layer 5 routing applications refer to logical names
 requests are

    bound to concrete names
 delegations express routing /s/users /io.l5d.zk/prod/users /s => /io.l5d.zk/prod/http
  45. per-request routing: staging GET / HTTP/1.1
 Host: mysite.com
 Dtab-local: /s/B

    => /s/B2
  46. per-request routing: debug proxy GET / HTTP/1.1
 Host: mysite.com
 Dtab-local:

    /s/E => /s/P/s/E
  47. so all i have to do is rewrite my app

    in scala?
  48. linkerd

  49. github.com/buoyantio/linkerd microservice rpc proxy layer-5 router aka l5d built on

    finagle & netty pluggable http, thrift, … etcd, consul, kubernetes, marathon, zookeeper, … …
  50. magic resiliency sprinkles transport security service discovery circuit breaking backpressure

    deadlines retries tracing metrics keep-alive multiplexing load balancing per-request routing service-level objectives Service B instance linkerd Service C instance linkerd Service A instance linkerd
  51. namerd released in March centralized routing policy delegates logical names

    to service discovery pluggable etcd kubernetes zookeeper …
  52. namerd

  53. demo: gob’s microservice

  54. web word gen l5d l5d l5d

  55. web word gen gen-v2 l5d l5d l5d l5d

  56. web word gen gen-v2 l5d l5d l5d l5d namerd

  57. master dc/os marathon zookeeper node node public node node …

    ELB ELB
  58. master dc/os marathon zookeeper node node public node node …

    linkerd linkerd linkerd linkerd ELB ELB namerd
  59. master dc/os marathon zookeeper node node public node node …

    linkerd linkerd linkerd linkerd ELB ELB namerd web (x1) gen (x3) word (x3) word-growthhack (x3)
  60. github.com/buoyantio/linkerd-examples

  61. linkerd roadmap • Netty4.1 • HTTP/2+gRPC linkerd#174 • TLS client

    certs • Richer routing policies • Announcers • More configurable everything
  62. more at linkerd.io slack: slack.linkerd.io email: ver@buoyant.io twitter: • @olix0r

    • @linkerd thanks!