$30 off During Our Annual Pro Sale. View Details »

High Reliability Infrastructure migrations

Julia Evans
December 11, 2018

High Reliability Infrastructure migrations

For companies with high availability requirements (99.99% uptime or higher), running new software in production comes with a lot of risks. But it’s possible to make significant infrastructure changes while maintaining the availability your customers expect! I’ll give you a toolbox for derisking migrations and making infrastructure changes with confidence, with examples from our Kubernetes & Envoy experience at Stripe.

Julia Evans

December 11, 2018
Tweet

More Decks by Julia Evans

Other Decks in Programming

Transcript











































  1. high reliability
    infrastructure
    migrations
    E
    Emmfmmi.si
    f.FmI
    s.iffE
    IaiTE
    JULIA EVANS
    bork

    View Slide











































  2. about me
    infrastructure
    engineer
    stripe
    payments company
    billions of dollars 1 year

    View Slide











































  3. our challenges
    10millions
    sub.mil isec
    atencyRELlAB1LlTY
    SECURlTY

    View Slide











































  4. 99.99
    1 minute 1 week

    View Slide











































  5. we made 2 changes
    move some workloads
    to Kubernetes
    use Envoy for all
    Service to service networking

    View Slide











































  6. W S

    View Slide











































  7. reality
    µ p
    A this was me

    View Slide











































  8. ojET Ei
    e
    on this

    View Slide











































  9. i
    to
    normal the problem is WHAT

    View Slide











































  10. what could go wrong
    99.99
    i
    WHAT
    normal

    View Slide











































  11. the goal
    qq.am
    FIXED
    W

    View Slide











































  12. how to get there
    understand the design
    run game days
    classify your failures
    have incidents only once
    make incremental changes
    have a rollback

    View Slide











































  13. Understand
    Kubernetes
    design

    View Slide











































  14. K8s designnderstand
    the design
    etch saetern Inetreaters
    everything
    else

    View Slide











































  15. understand the design
    ignore most
    new software
    Kubernetes Envoy
    that's

    View Slide











































  16. theory isn't enough
    pawing
    EIcankubern etes.ba

    View Slide











































  17. learn how
    1 system
    breaks

    View Slide











































  18. cause problems
    on purpose
    info gqmg.rq.a.gs
    neither

    View Slide











































  19. Run gamedays
    game days
    test how your system behaves
    under known failures
    let you learn without duress
    share knowledge

    View Slide











































  20. Run gamedays
    terminate an eted instance
    push invalid configuration
    destroy all apiserver instances or just 1
    container registry outage
    take down Envoy control plane
    Run these in QA but also in

    View Slide











































  21. Run gamedays
    Kubernetes terminated
    every running pod
    in the cluster pod eviction
    We fixed then tested the
    fix

    View Slide











































  22. classify your
    failure modes

    View Slide











































  23. classify your failure modes
    at the beginning
    j
    I

    View Slide











































  24. all our failure modes
    to containers don't start
    0 permissions errors
    or networking issues

    View Slide











































  25. learn your failure mode
    Reasons pods don't start
    I AM rate limiting
    scheduler bug I 1 i
    so many
    eted is down
    reasons
    lots more

    View Slide











































  26. classification
    monitoring
    c heartbeat jobs

    View Slide











































  27. Have
    every
    incident
    only once A

    View Slide











































  28. Have incidents only once
    If you don't understand
    your incidents they
    will happen again

    View Slide











































  29. Your problem space
    in i ii
    normal WHAT

    View Slide











































  30. Your problem space
    he e
    normal WHAT

    View Slide











































  31. Have incidents only once
    Find a problem
    Find causes
    Implement remediations
    Problem never comes back
    usually

    View Slide











































  32. Fix categories
    of incidents

    View Slide











































  33. some Envoy issues
    refinement otimistionmT
    freshets thundering

    View Slide











































  34. all HTTP 11 Conn pool issues

    View Slide











































  35. solution
    UseHTTPl2
    Envoy is designed for HTTP 12

    View Slide











































  36. e

    View Slide











































  37. Have incidents only once
    tell
    your coworkers
    what you
    learned
    incident reports
    example eted EBS issue
    leader elections

    View Slide











































  38. Have incidents only once
    io

    View Slide











































  39. Have incidents only once
    incidents teach you
    how to build a
    reliable system

    View Slide











































  40. make
    incremental
    changes

    View Slide











































  41. make incremental changes
    5 of traffic
    1 host
    a non critical
    service

    View Slide











































  42. make incremental changes
    establish an
    interface boundary

    View Slide











































  43. our deepq
    menmtal
    changes
    Besa

    View Slide











































  44. client
    make
    increment
    hanges
    BBB

    View Slide











































  45. make
    increment
    eatchanges
    item IBB

    View Slide











































  46. make incremental changes
    cnn.INT sneEer
    no haunted forests

    View Slide











































  47. make incremental changes
    don't expose
    Kubernetes to
    developers

    View Slide











































  48. reduce cognitive load
    reduce support burden

    View Slide











































  49. escape from
    YAML
    skycfg
    skycfg fun

    View Slide

















  50. YAML
    what other attributes are supported
    what k8s config does it generate
    name: missing-review-finder
    owner: risk
    schedule: 30 0 * * *
    disabled: false
    command:
    - ruby
    - scripts/cron/risk-missing-review-finder

    View Slide



  51. code
    return stripe_service(
    image = default_image,
    command = einhorn(henson_service = "home-srv",
    script = "home/srv.rb",
    workers = 8,
    port = 9768,
    ),
    iam_role = "homesrv.kube.%s.%s" % (
    ctx.vars["stripe.cluster"],
    ctx.vars["stripe.environment"],
    ),
    replicas = 3,
    cpu = kube.cores(4),
    mem = kube.gigabytes(16),
    block_egress = False,
    )

    View Slide











































  52. subset of Python
    typechecked
    sand boxed

    View Slide











































  53. github.com stripe
    skycfgskycfg
    fun

    View Slide











































  54. always have a
    rollback plan

    View Slide











































  55. have a rollback plan
    on
    h
    fhf
    Iit
    Ojos

    View Slide











































  56. a back.an
    i n

    View Slide











































  57. playbook
    understand the design
    run game days
    classify your failures
    have incidents only once
    make incremental changes
    have a rollback

    View Slide











































  58. culture leadership
    it's ok to start out not
    being an expert
    but you need to become one
    build an engine of learning
    building that expertise takes time

    View Slide











































  59. managers make space for
    your team to learn
    E
    jay
    manager

    View Slide











































  60. thanks
    a lot 9
    ps we're hiring in Seattle
    stripe.com jobs

    View Slide