Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Freeing the Whale: How to Fail at Scale

Freeing the Whale: How to Fail at Scale

Twitter was once known for its ever-present error page, the “Fail Whale.” Thousands of staff-years later, this iconic image has all but faded from memory. This transformation was only possible due to Twitter’s treatment of failure as something not just to be expected, but to be embraced.

In this talk, we discuss the technical insights that enabled Twitter to fail, safely and often. We will show how Finagle, the high-scale RPC library used at Twitter, Pinterest, SoundCloud, and other companies, provides a uniform model for handling failure at the communications layer. We’ll describe Finagle’s multi-layer mechanism for handling failure (and its pernicious cousin, latency), including latency-aware load balancing, failure accrual, deadline propagation, retry budgets, and negative acknowledgement. Finally, we’ll describe Finagle’s unified model for naming, inspired by the concepts of symbolic naming and dynamic linking in operating systems, which allows it to extend failure handling across service cluster and datacenter boundaries. We will end with a roadmap for improvements upon this model and mechanisms for applying it to non-Finagle applications.

From QConSF 2016.

625beff353c7c2b068b26d1a57566e05?s=128

Oliver Gould

November 10, 2016
Tweet

Transcript

  1. Freeing the Whale How to Fail at Scale oliver gould


    cto, buoyant QConSF, November 9, 2016 from
  2. 2010 A FAILWHALE ODYSSEY

  3. None
  4. None
  5. Twitter, 2010 107 users 107 tweets/day 102 engineers 101 ops

    eng 101 services 101 deploys/week 102 hosts 0 datacenters 101 user-facing outages/week https://blog.twitter.com/2010/measuring-tweets
  6. objective reliability flexibility

  7. objective reliability flexibility solution platform SOA + devops
 i.e. “microservices”

  8. Resilience is an imperative: our software runs on the truly

    dismal computers we call datacenters. Besides being heinously
 complex… they are unreliable and prone to
 operator error. Marius Eriksen @marius
 RPC Redux
  9. software you didn’t write hardware you can’t touch network you

    can’t trace break in new and surprising ways and your customers shouldn’t notice
  10. freeing the whale photo: Johanan Ottensooser

  11. mesos.apache.org UC Berkeley, 2010 Twitter, 2011 Apache, 2012 Abstracts compute

    resources Promise: don’t worry about the hosts
  12. aurora.apache.org Twitter, 2011 Apache, 2013 Schedules processes on Mesos Promise:

    no more puppet, monit, etc
  13. timelines Aurora (or Marathon, or …) host Mesos host host

    host host host users notifications x800 x300 x1000
  14. timelines Aurora (or Marathon, or …) host Mesos host host

    host host users notifications x800 x300 x1000
  15. service discovery timelines users zookeeper create ephemeral /svc/users/node_012345
 {“host”: “host-abc”,“port”:

    4321}
  16. service discovery timelines users zookeeper watch /svc/users/*

  17. service discovery timelines users zookeeper GetUser(olix0r)

  18. service discovery timelines users zookeeper uh oh. GetUser(olix0r)

  19. service discovery timelines users zookeeper client caches results GetUser(olix0r)

  20. service discovery timelines users zookeeper GetUser(olix0r) zookeeper serves empty results?!

  21. service discovery timelines users zookeeper service discovery is advisory GetUser(olix0r)

  22. github.com/twitter/finagle RPC library (JVM) asynchronous built on Netty scala functional

    strongly typed first commit: Oct 2010
  23. datacenter [1] physical [2] link [3] network [4] transport kubernetes,

    mesos, swarm, … 
 canal, weave, … aws, azure, digitalocean, gce, … business languages, libraries [7] application rpc [5] session [6] presentation json, protobuf, thrift, … http/2, mux, …
  24. “It’s slow”
 is the hardest problem you’ll ever debug. Jeff

    Hodges @jmhodges
 Notes on Distributed Systems for Young Bloods
  25. observability counters (e.g. client/users/failures) histograms (e.g. client/users/latency/p99) tracing

  26. tracing

  27. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db
  28. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db 800ms! 600ms!
  29. deadlines timelines users web db timeout=400ms deadline=323ms deadline=210ms 77ms elapsed

    113ms elapsed
  30. retries typical: retries=3

  31. retries typical: retries=3 worst-case: 300% more load!!!

  32. budgets typical: retries=3 better:
 retryBudget=20% worst-case: 300% more load!!! worst-case:

    20% more load
  33. load shedding via cancellation timelines users web db timelines users

    web db timeout!
  34. load shedding via cancellation timelines users web db timelines users

    web db timeout!
  35. backpressure timelines users web db timelines users web db 1000

    requests 100 requests 1000 requests
  36. backpressure timelines users web db timelines users web db 1000

    failed 1000 failed
  37. backpressure timelines users web db 100 ok 100 ok 100

    ok + 900 failed/redirected/etc
  38. lb algorithms: • round-robin • fewest connections • queue depth

    • exponentially-weighted moving average (ewma) • aperture request-level load balancing
  39. None
  40. So just rewrite everything in Finagle!?

  41. linkerd

  42. github.com/buoyantio/linkerd service mesh proxy built on finagle & netty suuuuper

    pluggable http, thrift, … etcd, consul, kubernetes, marathon, zookeeper, … …
  43. Linkers and Loaders, John R. Levine, Academic Press

  44. linker for the datacenter

  45. logical naming applications refer to logical names
 requests are bound

    to concrete names
 delegations express routing /s/users /#/io.l5d.zk/prod/users /#/io.l5d.zk/staging/users /s => /#/io.l5d.zk/prod
  46. per-request routing: staging GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab: /s/B

    => /s/B2
  47. per-request routing: debug proxy GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab:

    /s/E => /s/P/s/E
  48. linkerd service mesh transport security service discovery circuit breaking backpressure

    deadlines retries tracing metrics keep-alive multiplexing load balancing per-request routing service-level objectives Service B instance linkerd Service C instance linkerd Service A instance linkerd
  49. demo: gob’s microservice

  50. web word gen l5d l5d l5d

  51. web word gen gen-v2 l5d l5d l5d l5d

  52. web word gen gen-v2 l5d l5d l5d l5d namerd

  53. github.com/buoyantio/linkerd-examples

  54. linkerd roadmap • Battle test HTTP/2 • TLS client certs

    • Deadlines • Dark Traffic • All configurable everything
  55. more at linkerd.io slack: slack.linkerd.io email: ver@buoyant.io twitter: • @olix0r

    • @linkerd thanks!