Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High Reliability Infrastructure migrations

Julia Evans
December 11, 2018

High Reliability Infrastructure migrations

For companies with high availability requirements (99.99% uptime or higher), running new software in production comes with a lot of risks. But it’s possible to make significant infrastructure changes while maintaining the availability your customers expect! I’ll give you a toolbox for derisking migrations and making infrastructure changes with confidence, with examples from our Kubernetes & Envoy experience at Stripe.

Julia Evans

December 11, 2018
Tweet

More Decks by Julia Evans

Other Decks in Programming

Transcript

  1. we made 2 changes move some workloads to Kubernetes use

    Envoy for all Service to service networking
  2. W S

  3. how to get there understand the design run game days

    classify your failures have incidents only once make incremental changes have a rollback
  4. Run gamedays game days test how your system behaves under

    known failures let you learn without duress share knowledge
  5. Run gamedays terminate an eted instance push invalid configuration destroy

    all apiserver instances or just 1 container registry outage take down Envoy control plane Run these in QA but also in
  6. learn your failure mode Reasons pods don't start I AM

    rate limiting scheduler bug I 1 i so many eted is down reasons lots more
  7. Have incidents only once Find a problem Find causes Implement

    remediations Problem never comes back usually
  8. e

  9. Have incidents only once tell your coworkers what you learned

    incident reports example eted EBS issue leader elections
  10. YAML what other attributes are supported what k8s config does

    it generate name: missing-review-finder owner: risk schedule: 30 0 * * * disabled: false command: - ruby - scripts/cron/risk-missing-review-finder
  11. code return stripe_service( image = default_image, command = einhorn(henson_service =

    "home-srv", script = "home/srv.rb", workers = 8, port = 9768, ), iam_role = "homesrv.kube.%s.%s" % ( ctx.vars["stripe.cluster"], ctx.vars["stripe.environment"], ), replicas = 3, cpu = kube.cores(4), mem = kube.gigabytes(16), block_egress = False, )
  12. playbook understand the design run game days classify your failures

    have incidents only once make incremental changes have a rollback
  13. culture leadership it's ok to start out not being an

    expert but you need to become one build an engine of learning building that expertise takes time