Finagle, linkerd, and Mesos Magic Operability Sprinkles for Microservices

Finagle, linkerd, and Mesos  Magic Operability Sprinkles for Microservices oliver
gould  cto, buoyant MesosCon North America, June 2 2016 from

oliver gould • founding cto @ buoyant  open-source microservice infrastructure
• previously, tech lead @ twitter:  observability, traﬃc • core contributor: ﬁnagle • creator: linkerd • likes: dogs • dislikes: being woken up by a pager @olix0r  [email protected]

overview • 2010: A Failwhale Odyssey • Automating the Datacenter
• Microservices: A Silver Bullet • Finagle: The Once and Future Layer 5 • Introducing linkerd • Demo • Q&A

2010 A FAILWHALE ODYSSEY

Twitter, 2010 107 users 107 tweets/day 102 engineers 101 services
101 deploys/week 102 hosts 0 datacenters 101 user-facing outages/week https://blog.twitter.com/2010/measuring-tweets

Events https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

Asymmetry Photo by Troy Holden

Provisioning

automating the datacenter

mesos.apache.org UC Berkeley, 2010 Twitter, 2011 Apache, 2012 Abstracts compute
resources Promise: don’t worry about the hosts

aurora.apache.org Twitter, 2011 Apache, 2013 Schedules processes on Mesos Promise:
no more puppet, monit, etc

timelines Aurora (or Marathon, or …) host Mesos host host
host host host users notiﬁcations x800 x300 x1000

microservices A SILVER BULLET

scaling teams growing software

ﬂexibility

performance correctness monitoring debugging eﬃciency security operability  resilience

there are no magic sprinkles. (sorry.)

Resilience is an imperative: our software runs on the truly
dismal computers we call datacenters. Besides being heinously  complex… they are unreliable and prone to  operator error. Marius Eriksen @marius  RPC Redux

resilience in microservices software you didn’t write hardware you can’t
touch network you can’t conﬁgure break in new and surprising ways and your customers shouldn’t notice

resilient microservices means resilient communication

datacenter [1] physical [2] link [3] network [4] transport aurora,
marathon, … mesos   canal, weave, … aws, azure, digitalocean, gce, … business languages, libraries [7] application rpc [5] session [6] presentation json, protobuf, thrift, … http/2, mux, …

layer 5 dispatches requests onto layer 4 connections

ﬁnagle THE ONCE AND FUTURE LAYER 5

github.com/twitter/ﬁnagle RPC library (JVM) asynchronous built on Netty scala functional
strongly typed ﬁrst commit: Oct 2010

used by…

programming ﬁnagle val users = Thrift.newIface[UserSvc](“/s/users”)  val timelines = Thrift.newIface[TimelineSvc](“/s/timeline”)
Http.serve(“:8080”, Service.mk[Request, Response] { req => for { user <- users.get(userReq(req)) timeline <- timelines.get(user) } yield renderHTML(user, timeline) })

operating ﬁnagle transport security service discovery circuit breaking backpressure deadlines
retries tracing metrics keep-alive multiplexing load balancing per-request routing service-level objectives Observe Session timeout Retries Request draining Load balancer Monitor Observe Trace Failure accrual Request timeout Pool Fail fast Expiration Dispatcher

layer 5 naming

layer 5 naming applications refer to logical names  requests are
bound to concrete names  delegations express routing /s/users /#/io.l5d.zk/prod/users /s => /#/io.l5d.zk/prod/http

per-request routing: staging GET / HTTP/1.1  Host: mysite.com  Dtab-local: /s/B
=> /s/B2

per-request routing: debug proxy GET / HTTP/1.1  Host: mysite.com  Dtab-local:
/s/E => /s/P/s/E

tracing

“It’s slow”  is the hardest problem you’ll ever debug. Jeﬀ
Hodges @jmhodges  Notes on Distributed Systems for Young Bloods

the more components you deploy, the more problems you have

lb algorithms: • round-robin • fewest connections • queue depth
• exponentially-weighted moving average (ewma) • aperture load balancing at layer 5

timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms
retries=2 timeout=200ms retries=3 timelines users web db

timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms
retries=2 timeout=200ms retries=3 timelines users web db 800ms! 600ms!

deadlines timelines users web db timeout=400ms deadline=323ms deadline=210ms 77ms elapsed
113ms elapsed

retries typical: retries=3

retries typical: retries=3 worst-case: 300% more load!!!

budgets typical: retries=3 better:  retryBudget=20% worst-case: 300% more load!!! worst-case:
20% more load

so all i have to do is rewrite my app
in scala?

linkerd

github.com/buoyantio/linkerd microservice rpc proxy layer-5 router aka l5d built on
ﬁnagle & netty pluggable http, thrift, … etcd, consul, kubernetes, marathon, zookeeper, … …

magic resiliency sprinkles transport security service discovery circuit breaking backpressure
deadlines retries tracing metrics keep-alive multiplexing load balancing per-request routing service-level objectives Service B instance linkerd Service C instance linkerd Service A instance linkerd

namerd released in March centralized routing policy delegates logical names
to service discovery pluggable etcd kubernetes zookeeper …

namerd

demo: gob’s microservice

web word gen l5d l5d l5d

web word gen gen-v2 l5d l5d l5d l5d

web word gen gen-v2 l5d l5d l5d l5d namerd

master dc/os marathon zookeeper node node public node node …
ELB ELB

linkerd linkerd linkerd linkerd ELB ELB namerd

linkerd linkerd linkerd linkerd ELB ELB namerd web (x1) gen (x3) word (x3) word-growthhack (x3) gen-growthhack (x3)

github.com/buoyantio/linkerd-examples

linkerd roadmap • Netty4.1 • HTTP/2+gRPC linkerd#174 • TLS client
certs, SPIFEE • Deadlines • Announcers • All conﬁgurable everything

more at linkerd.io slack: slack.linkerd.io email: [email protected] twitter: • @olix0r
• @linkerd thanks!

Finagle, linkerd, and Mesos Magic Operability S...

Finagle, linkerd, and Mesos Magic Operability Sprinkles for Microservices

More Decks by Oliver Gould

Other Decks in Programming

Featured

Transcript

Finagle, linkerd, and Mesos Magic Operability S...

Finagle, linkerd, and Mesos Magic Operability Sprinkles for Microservices