High Reliability Infrastructure migrations

80c37b7308df099b8b9ec660146cf3da?s=47 Julia Evans
December 11, 2018

High Reliability Infrastructure migrations

For companies with high availability requirements (99.99% uptime or higher), running new software in production comes with a lot of risks. But it’s possible to make significant infrastructure changes while maintaining the availability your customers expect! I’ll give you a toolbox for derisking migrations and making infrastructure changes with confidence, with examples from our Kubernetes & Envoy experience at Stripe.

80c37b7308df099b8b9ec660146cf3da?s=128

Julia Evans

December 11, 2018
Tweet

Transcript

  1. high reliability infrastructure migrations E Emmfmmi.si f.FmI s.iffE IaiTE JULIA

    EVANS bork
  2. about me infrastructure engineer stripe payments company billions of dollars

    1 year
  3. our challenges 10millions sub.mil isec atencyRELlAB1LlTY SECURlTY

  4. 99.99 1 minute 1 week

  5. we made 2 changes move some workloads to Kubernetes use

    Envoy for all Service to service networking
  6. W S

  7. reality µ p A this was me

  8. ojET Ei e on this

  9. i to normal the problem is WHAT

  10. what could go wrong 99.99 i WHAT normal

  11. the goal qq.am FIXED W

  12. how to get there understand the design run game days

    classify your failures have incidents only once make incremental changes have a rollback
  13. Understand Kubernetes design

  14. K8s designnderstand the design etch saetern Inetreaters everything else

  15. understand the design ignore most new software Kubernetes Envoy that's

  16. theory isn't enough pawing EIcankubern etes.ba

  17. learn how 1 system breaks

  18. cause problems on purpose info gqmg.rq.a.gs neither

  19. Run gamedays game days test how your system behaves under

    known failures let you learn without duress share knowledge
  20. Run gamedays terminate an eted instance push invalid configuration destroy

    all apiserver instances or just 1 container registry outage take down Envoy control plane Run these in QA but also in
  21. Run gamedays Kubernetes terminated every running pod in the cluster

    pod eviction We fixed then tested the fix
  22. classify your failure modes

  23. classify your failure modes at the beginning j I

  24. all our failure modes to containers don't start 0 permissions

    errors or networking issues
  25. learn your failure mode Reasons pods don't start I AM

    rate limiting scheduler bug I 1 i so many eted is down reasons lots more
  26. classification monitoring c heartbeat jobs

  27. Have every incident only once A

  28. Have incidents only once If you don't understand your incidents

    they will happen again
  29. Your problem space in i ii normal WHAT

  30. Your problem space he e normal WHAT

  31. Have incidents only once Find a problem Find causes Implement

    remediations Problem never comes back usually
  32. Fix categories of incidents

  33. some Envoy issues refinement otimistionmT freshets thundering

  34. all HTTP 11 Conn pool issues

  35. solution UseHTTPl2 Envoy is designed for HTTP 12

  36. e

  37. Have incidents only once tell your coworkers what you learned

    incident reports example eted EBS issue leader elections
  38. Have incidents only once io

  39. Have incidents only once incidents teach you how to build

    a reliable system
  40. make incremental changes

  41. make incremental changes 5 of traffic 1 host a non

    critical service
  42. make incremental changes establish an interface boundary

  43. our deepq menmtal changes Besa

  44. client make increment hanges BBB

  45. make increment eatchanges item IBB

  46. make incremental changes cnn.INT sneEer no haunted forests

  47. make incremental changes don't expose Kubernetes to developers

  48. reduce cognitive load reduce support burden

  49. escape from YAML skycfg skycfg fun

  50. YAML what other attributes are supported what k8s config does

    it generate name: missing-review-finder owner: risk schedule: 30 0 * * * disabled: false command: - ruby - scripts/cron/risk-missing-review-finder
  51. code return stripe_service( image = default_image, command = einhorn(henson_service =

    "home-srv", script = "home/srv.rb", workers = 8, port = 9768, ), iam_role = "homesrv.kube.%s.%s" % ( ctx.vars["stripe.cluster"], ctx.vars["stripe.environment"], ), replicas = 3, cpu = kube.cores(4), mem = kube.gigabytes(16), block_egress = False, )
  52. subset of Python typechecked sand boxed

  53. github.com stripe skycfgskycfg fun

  54. always have a rollback plan

  55. have a rollback plan on h fhf Iit Ojos

  56. a back.an i n

  57. playbook understand the design run game days classify your failures

    have incidents only once make incremental changes have a rollback
  58. culture leadership it's ok to start out not being an

    expert but you need to become one build an engine of learning building that expertise takes time
  59. managers make space for your team to learn E jay

    manager
  60. thanks a lot 9 ps we're hiring in Seattle stripe.com

    jobs