Pro Yearly is on sale from $80 to $50! »

When Failure is Not an Option: Processing Real Money at Monzo with Kubernetes and Linkerd

When Failure is Not an Option: Processing Real Money at Monzo with Kubernetes and Linkerd

In this talk, we describe how Monzo processes financial transactions involving real money and real people in way that's safe, secure, and resilient. We show how combining Kubernetes with Linkerd creates a highly adaptive system, where Kubernetes provides a baseline level of protection against hardware and software failures and Linkerd layers on request-level resilience, including including latency-aware load-balancing, intelligent retries, and service-level measures of success rates and latency. We show how the resulting system is resilient to a wide variety of failures and protects the financial transactions that flow through it from failure, yet still allows for a rapid pace of feature development and iteration.

625beff353c7c2b068b26d1a57566e05?s=128

Oliver Gould

March 29, 2017
Tweet

Transcript

  1. Processing Real Money at Monzo with Kubernetes and Linkerd oliver

    beattie
 head of eng, monzo Kubecon EU, March 29, 2017 oliver gould
 creator, linkerd
 cto, buoyant inc. When Failure is Not an Option
  2. None
  3. None
  4. None
  5. Improving everyone’s relationship with their money

  6. Fast-paced tech startup but also a Regulated retail bank

  7. Extensible Efficient Resilient Secure

  8. None
  9. Reduced infrastructure spend 60% by moving to Kubernetes

  10. Host A Host B Service A Service B

  11. Load balancing Tracing Circuit breakers Retries Canarying Load shedding Error

    tracking Metrics Service discovery Logging Timeouts Expirations Security policies Back-offs Retry budgets Dynamic routing
  12. Host A Host B Service A Service B

  13. linkerd

  14. datacenter [1] physical [2] link [3] network [4] transport 


    
 kubernetes, mesos, swarm, … 
 canal, weave, … aws, azure, digitalocean, gce, … business languages, libraries [7] application rpc [5] session [6] presentation json, protobuf, thrift, … linkerd
  15. datacenter [1] physical [2] link [3] network [4] transport linkerd-tcp

    
 kubernetes, mesos, swarm, … 
 canal, weave, … aws, azure, digitalocean, gce, … business languages, libraries [7] application rpc [5] session [6] presentation json, protobuf, thrift, … linkerd
  16. Linkers and Loaders, John R. Levine, Academic Press

  17. a historical perspective: tcp/ip 1975: Internet Protocol Suite Layer 3:

    /etc/hosts Layer 4: /etc/services
  18. 
 Entire companies were formed around selling TCP/IP stacks before

    Windows 95. REMINDER
  19. a historical perspective: dns 1984: Domain Name Service /etc/hosts-as-a-service

  20. host app: b app: a app: c service: a host

    app: a app: b app: a the new world of service discovery!
  21. what would a cloud native linker do?

  22. logical naming applications refer to logical names
 requests are bound

    to concrete names
 delegations express routing /svc/users /#/io.l5d.zk/prod/users /#/io.l5d.k8s/staging/http/users /svc => /#/io.l5d.k8s/prod/http
  23. per-request routing: staging GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab: /s/B

    => /s/B2
  24. per-request routing: debug proxy GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab:

    /s/E => /s/P/s/E
  25. observability counters (e.g. client/users/failures) histograms (e.g. client/users/latency/p99) tracing

  26. linkerd-viz

  27. tracing

  28. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db
  29. timeouts & retries timelines users web db timeout=400ms retries=3 timeout=400ms

    retries=2 timeout=200ms retries=3 timelines users web db 800ms! 600ms!
  30. deadlines timelines users web db timeout=400ms deadline=323ms deadline=210ms 77ms elapsed

    113ms elapsed
  31. retries typical: retries=3

  32. retries typical: retries=3 worst-case: 300% more load!!!

  33. budgets typical: retries=3 better:
 retryBudget=20% worst-case: 300% more load!!! worst-case:

    20% more load
  34. load shedding via cancellation timelines users web db timelines users

    web db timeout!
  35. load shedding via cancellation timelines users web db timelines users

    web db timeout!
  36. backpressure timelines users web db timelines users web db 1000

    requests 100 requests 1000 requests
  37. backpressure timelines users web db timelines users web db 1000

    failed 1000 failed
  38. backpressure timelines users web db 100 ok 100 ok 100

    ok + 900 failed/redirected/etc
  39. lb algorithms: • round-robin • fewest connections • queue depth

    • exponentially-weighted moving average (ewma) • aperture request-level load balancing
  40. github.com/linkerd/linkerd • Donated to CNCF in January 2017! • K8s

    Ingress API in the next release • gRPC and HTTP/2 battle testing • Fine-grained client policy API • Hitting 1.0 this next month! • Help us test RC1 this week
  41. github.com/linkerd/linkerd-tcp • LIghtweight, service-discovery-aware TCP LB • Supports endpoint weighting

    • Modern TLS: ALPN, SNI, forward secrecy, … • Written in Rust: native, safe, fast, & tiny! • Currently beta: get involved!
  42. linkerd.io slack.linkerd.io monzo.com/careers