Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Road to Success is paved with Small Improvements

The Road to Success is paved with Small Improvements

This talk discusses the architectural past of Etsy and how it
evolved over the years. It also goes into detail about the tooling we wrote and leverage for making out continuous deployment culture work. It then goes on and describes out culture of blameless postmortems and how we strive to be a learning organziation and integrates learning into the process of incident response and handling.

Daniel Schauenberg

January 14, 2015
Tweet

More Decks by Daniel Schauenberg

Other Decks in Technology

Transcript

  1. The Road to Success
    is paved with
    Small Improvements
    Daniel Schauenberg • [email protected] • @mrtazz

    View full-size slide

  2. The Monolith

    View full-size slide

  3. We deploy quite a lot

    View full-size slide

  4. How comfortable
    are you
    deploying a change
    right now

    View full-size slide

  5. MTTR
    trumps
    MTBF

    View full-size slide

  6. If this is your
    first day at Etsy
    you deploy
    the site

    View full-size slide

  7. The Dark Past

    View full-size slide

  8. Hindsight is 20/20
    There would be no Etsy
    I wasn't around for this
    (the grain of salt disclaimer)

    View full-size slide

  9. dark
    less fun

    View full-size slide

  10. Architecture Overview
    Ubuntu
    Postgresql
    Lighttpd
    PHP/Python

    View full-size slide

  11. Single Big Database

    View full-size slide

  12. Business Logic
    in
    Stored Procedures

    View full-size slide

  13. DEV ! DBA ! OPS

    View full-size slide

  14. Stored Procedure
    Routing
    Middleware

    View full-size slide

  15. "Now you don't have
    to touch the
    database"

    View full-size slide

  16. A software
    manifestation of
    silos

    View full-size slide

  17. Site uptime
    wasn't good

    View full-size slide

  18. Trust
    the people

    View full-size slide

  19. Horizontal
    Scaling
    (A single box only scales so far)

    View full-size slide

  20. Master-Master
    replicated MySQL
    Shards

    View full-size slide

  21. Deploy
    !=
    Release

    View full-size slide

  22. Feature Flags

    View full-size slide

  23. if Feature::isEnabled($feature) {
    // new hawtness
    } else {
    // nothing to see here
    }

    View full-size slide

  24. repo open to
    everyone
    everybody can
    make changes
    ~30 regular
    contributors

    View full-size slide

  25. knife-spork
    knife-flip
    knife-lastrun
    knife-preflight
    knife-wip
    chef-whitelist

    View full-size slide

  26. Feature Flags
    in Chef

    View full-size slide

  27. if node.is_in_whitelist? "apache_upgrade"
    # new hawtness
    else
    # boring old stuff
    end

    View full-size slide

  28. Confidence
    in Dev Environment

    View full-size slide

  29. Developer VMs
    Every engineer has one
    Chef’d with the Etsy Stack
    Different sizes and Chef roles

    View full-size slide

  30. Continuous
    Integration

    View full-size slide

  31. QA and Unit
    Staging Smoker
    Prod Smoker

    View full-size slide

  32. > 2000 tests

    View full-size slide

  33. Continuous
    Integration as the
    bottleneck of
    deployment

    View full-size slide

  34. How to keep it fast?

    View full-size slide

  35. The Bobs
    (our builders)
    LXC Containers
    1 Jenkins Agent per Container
    4 containers per SSD

    View full-size slide

  36. > try
    https://github.com/etsy/Trylib

    View full-size slide

  37. The power of the CI
    Cluster
    The ease of
    your local
    environment

    View full-size slide

  38. Thanks
    Mozilla

    View full-size slide

  39. Deployinator

    View full-size slide

  40. Simple
    Web Application

    View full-size slide

  41. Ganglia
    &
    Graphite

    View full-size slide

  42. StatsD
    "how do you know this works in
    production?"

    View full-size slide

  43. Statsd::increment("foo")
    Statsd::timing("bar", 10)

    View full-size slide

  44. Dashboard
    Framework

    View full-size slide

  45. $graph = new Graph_Graphite_Simple(
    [
    'title' => 'Packets Received (nodejs)',
    'metrics' => 'stats.%s.packets_received',
    'limit_y_axis' => true,
    'stacked' => true,
    'width' => $width,
    ]
    )

    View full-size slide

  46. Deploy
    Dashboard

    View full-size slide

  47. “Finding your soul
    metric can take time.
    But in the end, it boils
    down to what
    matters most to your
    users.”
    Mathias Meyer, CEO Travis CI

    View full-size slide

  48. Find the subset of
    metrics people
    should look at after
    a deploy

    View full-size slide

  49. Stick them on a
    central place

    View full-size slide

  50. Syslog works really
    well

    View full-size slide

  51. Decide on a log format

    View full-size slide

  52. Awesome
    Tools

    View full-size slide

  53. Empower the
    Individual

    View full-size slide

  54. Information
    Overload

    View full-size slide

  55. Tools
    need
    attention

    View full-size slide

  56. realtalk:
    things break

    View full-size slide

  57. Complex
    Socio-Technical
    Systems

    View full-size slide

  58. “Erkenntnis und
    Irrtum fließen aus
    denselben
    psychischen Quellen;
    nur der Erfolg vermag
    beide zu scheiden.”
    Ernst Mach, Erkenntnis und Irrtum (p. 116)

    View full-size slide

  59. Success and failure
    can only be
    determined
    a posteriori

    View full-size slide

  60. Things made sense at
    the time

    View full-size slide

  61. People don't come to
    work to do a bad job

    View full-size slide

  62. Nietzschean
    Anxiety

    View full-size slide

  63. So I always get off the hook
    whatever I do?

    View full-size slide

  64. “There is a difference
    between explaining
    and excusing human
    performance.”
    Sidney Dekker, The Field Guide to Understanding
    Human Error (p. 196)

    View full-size slide

  65. Blameless
    Postmortems

    View full-size slide

  66. Everybody
    is Invited

    View full-size slide

  67. What
    happened?

    View full-size slide

  68. Describe the past
    Don't excuse it away

    View full-size slide

  69. The Facilitator

    View full-size slide

  70. Guide the Discussion

    View full-size slide

  71. Look out for indicators of
    Old View thinking

    View full-size slide

  72. Counterfactuals

    View full-size slide

  73. she should have
    if they just had
    if he would have
    you failed to

    View full-size slide

  74. Hindsight Bias
    Confirmation Bias
    Outcome Bias

    View full-size slide

  75. there are many
    more

    View full-size slide

  76. Who is
    in charge?

    View full-size slide

  77. Taught Facilitator
    Course

    View full-size slide

  78. 3 x 90 minutes

    View full-size slide

  79. Remediation
    Items

    View full-size slide

  80. incorporate learning and
    takeaways from the
    meeting

    View full-size slide

  81. turn surprises into
    known factors

    View full-size slide

  82. https://github.com/etsy/morgue

    View full-size slide

  83. "Hey all, I just ran rm -rf
    $DIR/ and since the variable
    was empty I deleted my
    whole VM. This would have
    been bad in production.
    Don't do that."

    View full-size slide

  84. Architecture
    Reviews

    View full-size slide

  85. Operability
    Reviews

    View full-size slide

  86. “It is also worth pointing
    out that the bias
    towards investigating
    failures rather than
    success itself
    represents a trade-off.”
    Erik Hollnagel, The ETTO Principle: Efficiency-Thoroughness
    Trade-Off

    View full-size slide

  87. Investigate
    Success

    View full-size slide

  88. Why did it work?

    View full-size slide

  89. Human Error is
    where you
    stopped looking

    View full-size slide

  90. Learning
    >
    Blaming

    View full-size slide

  91. How did we end up here?

    View full-size slide

  92. Overhauls
    &
    Iterations

    View full-size slide

  93. Culture
    &
    Tools
    (you can't really have one without the other)

    View full-size slide

  94. Humans are
    AWESOME

    View full-size slide

  95. Nobody comes
    to work to do
    a bad job

    View full-size slide

  96. Trust
    your
    Co-Workers

    View full-size slide

  97. There is a lot of
    knowledge in your
    engineering team

    View full-size slide

  98. Deploy
    (as often as it makes sense)

    View full-size slide

  99. Collaborate
    (even if you think you don't have to)

    View full-size slide

  100. Listen
    (to problems and experiences
    of your coworkers)

    View full-size slide

  101. codeascraft.com
    etsy.com/codeascraft/talks

    View full-size slide

  102. The Road to Success
    is paved with
    Small Improvements
    Daniel Schauenberg • [email protected] • @mrtazz

    View full-size slide