Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Road to Success is paved with Small Improvements

The Road to Success is paved with Small Improvements

This talk discusses the architectural past of Etsy and how it
evolved over the years. It also goes into detail about the tooling we wrote and leverage for making out continuous deployment culture work. It then goes on and describes out culture of blameless postmortems and how we strive to be a learning organziation and integrates learning into the process of incident response and handling.

Daniel Schauenberg

January 14, 2015
Tweet

More Decks by Daniel Schauenberg

Other Decks in Technology

Transcript

  1. The Road to Success
    is paved with
    Small Improvements
    Daniel Schauenberg • [email protected] • @mrtazz

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. The Monolith

    View Slide

  6. LAMP

    View Slide

  7. We deploy quite a lot

    View Slide

  8. How comfortable
    are you
    deploying a change
    right now

    View Slide

  9. MTTR
    trumps
    MTBF

    View Slide

  10. If this is your
    first day at Etsy
    you deploy
    the site

    View Slide

  11. View Slide

  12. The Dark Past

    View Slide

  13. View Slide

  14. Hindsight is 20/20
    There would be no Etsy
    I wasn't around for this
    (the grain of salt disclaimer)

    View Slide

  15. dark
    less fun

    View Slide

  16. Architecture Overview
    Ubuntu
    Postgresql
    Lighttpd
    PHP/Python

    View Slide

  17. Single Big Database

    View Slide

  18. Business Logic
    in
    Stored Procedures

    View Slide

  19. View Slide

  20. silos

    View Slide

  21. DEV ! DBA ! OPS

    View Slide

  22. Sprouter

    View Slide

  23. Stored Procedure
    Routing
    Middleware

    View Slide

  24. "Now you don't have
    to touch the
    database"

    View Slide

  25. A software
    manifestation of
    silos

    View Slide

  26. Site uptime
    wasn't good

    View Slide

  27. <3 uptime

    View Slide

  28. View Slide

  29. Trust
    the people

    View Slide

  30. View Slide

  31. Sprouter

    View Slide

  32. Master

    View Slide

  33. Horizontal
    Scaling
    (A single box only scales so far)

    View Slide

  34. Master-Master
    replicated MySQL
    Shards

    View Slide

  35. Flickr DNA

    View Slide

  36. Deploy
    !=
    Release

    View Slide

  37. Feature Flags

    View Slide

  38. if Feature::isEnabled($feature) {
    // new hawtness
    } else {
    // nothing to see here
    }

    View Slide

  39. Chef

    View Slide

  40. <3 Chef

    View Slide

  41. repo open to
    everyone
    everybody can
    make changes
    ~30 regular
    contributors

    View Slide

  42. View Slide

  43. knife-spork
    knife-flip
    knife-lastrun
    knife-preflight
    knife-wip
    chef-whitelist

    View Slide

  44. Feature Flags
    in Chef

    View Slide

  45. if node.is_in_whitelist? "apache_upgrade"
    # new hawtness
    else
    # boring old stuff
    end

    View Slide

  46. View Slide

  47. Confidence
    in Dev Environment

    View Slide

  48. Developer VMs
    Every engineer has one
    Chef’d with the Etsy Stack
    Different sizes and Chef roles

    View Slide

  49. Self
    Service

    View Slide

  50. Continuous
    Integration

    View Slide

  51. View Slide

  52. QA and Unit
    Staging Smoker
    Prod Smoker

    View Slide

  53. > 2000 tests

    View Slide

  54. Continuous
    Integration as the
    bottleneck of
    deployment

    View Slide

  55. How to keep it fast?

    View Slide

  56. SSDs!

    View Slide

  57. Not Quite

    View Slide

  58. The Bobs
    (our builders)
    LXC Containers
    1 Jenkins Agent per Container
    4 containers per SSD

    View Slide

  59. Try

    View Slide

  60. > try
    https://github.com/etsy/Trylib

    View Slide

  61. The power of the CI
    Cluster
    The ease of
    your local
    environment

    View Slide

  62. Thanks
    Mozilla

    View Slide

  63. IRC

    View Slide

  64. View Slide

  65. Deployinator

    View Slide

  66. View Slide

  67. View Slide

  68. Simple
    Web Application

    View Slide

  69. METRICS!

    View Slide

  70. View Slide

  71. Ganglia
    &
    Graphite

    View Slide

  72. View Slide

  73. StatsD
    "how do you know this works in
    production?"

    View Slide

  74. Statsd::increment("foo")
    Statsd::timing("bar", 10)

    View Slide

  75. Dashboard
    Framework

    View Slide

  76. $graph = new Graph_Graphite_Simple(
    [
    'title' => 'Packets Received (nodejs)',
    'metrics' => 'stats.%s.packets_received',
    'limit_y_axis' => true,
    'stacked' => true,
    'width' => $width,
    ]
    )

    View Slide

  77. Deploy
    Dashboard

    View Slide

  78. “Finding your soul
    metric can take time.
    But in the end, it boils
    down to what
    matters most to your
    users.”
    Mathias Meyer, CEO Travis CI

    View Slide

  79. Find the subset of
    metrics people
    should look at after
    a deploy

    View Slide

  80. Stick them on a
    central place

    View Slide

  81. Logs!

    View Slide

  82. Syslog works really
    well

    View Slide

  83. Decide on a log format

    View Slide

  84. View Slide

  85. View Slide

  86. View Slide

  87. Awesome
    Tools

    View Slide

  88. Empower the
    Individual

    View Slide

  89. Information
    Overload

    View Slide

  90. Tools
    need
    attention

    View Slide

  91. Culture

    View Slide

  92. View Slide

  93. View Slide

  94. View Slide

  95. View Slide

  96. View Slide

  97. View Slide

  98. realtalk:
    things break

    View Slide

  99. New View

    View Slide

  100. Complex
    Socio-Technical
    Systems

    View Slide

  101. “Erkenntnis und
    Irrtum fließen aus
    denselben
    psychischen Quellen;
    nur der Erfolg vermag
    beide zu scheiden.”
    Ernst Mach, Erkenntnis und Irrtum (p. 116)

    View Slide

  102. Success and failure
    can only be
    determined
    a posteriori

    View Slide

  103. Things made sense at
    the time

    View Slide

  104. People don't come to
    work to do a bad job

    View Slide

  105. Nietzschean
    Anxiety

    View Slide

  106. So I always get off the hook
    whatever I do?

    View Slide

  107. depends

    View Slide

  108. “There is a difference
    between explaining
    and excusing human
    performance.”
    Sidney Dekker, The Field Guide to Understanding
    Human Error (p. 196)

    View Slide

  109. Blameless
    Postmortems

    View Slide

  110. Open
    Meeting

    View Slide

  111. Everybody
    is Invited

    View Slide

  112. What
    happened?

    View Slide

  113. Timeline

    View Slide

  114. Describe the past
    Don't excuse it away

    View Slide

  115. The Facilitator

    View Slide

  116. Guide the Discussion

    View Slide

  117. Look out for indicators of
    Old View thinking

    View Slide

  118. Counterfactuals

    View Slide

  119. she should have
    if they just had
    if he would have
    you failed to

    View Slide

  120. Biases

    View Slide

  121. Hindsight Bias
    Confirmation Bias
    Outcome Bias

    View Slide

  122. there are many
    more

    View Slide

  123. Who is
    in charge?

    View Slide

  124. Etsy School

    View Slide

  125. Taught Facilitator
    Course

    View Slide

  126. 3 x 90 minutes

    View Slide

  127. Remediation
    Items

    View Slide

  128. incorporate learning and
    takeaways from the
    meeting

    View Slide

  129. View Slide

  130. turn surprises into
    known factors

    View Slide

  131. MORGUE

    View Slide

  132. View Slide

  133. View Slide

  134. View Slide

  135. View Slide

  136. View Slide

  137. View Slide

  138. View Slide

  139. https://github.com/etsy/morgue

    View Slide

  140. Near Miss

    View Slide

  141. "Hey all, I just ran rm -rf
    $DIR/ and since the variable
    was empty I deleted my
    whole VM. This would have
    been bad in production.
    Don't do that."

    View Slide

  142. Pre Mortem

    View Slide

  143. Architecture
    Reviews

    View Slide

  144. Operability
    Reviews

    View Slide

  145. “It is also worth pointing
    out that the bias
    towards investigating
    failures rather than
    success itself
    represents a trade-off.”
    Erik Hollnagel, The ETTO Principle: Efficiency-Thoroughness
    Trade-Off

    View Slide

  146. Investigate
    Success

    View Slide

  147. Why did it work?

    View Slide

  148. Human Error is
    where you
    stopped looking

    View Slide

  149. Learning
    >
    Blaming

    View Slide

  150. View Slide

  151. View Slide

  152. How did we end up here?

    View Slide

  153. View Slide

  154. Overhauls
    &
    Iterations

    View Slide

  155. Culture
    &
    Tools
    (you can't really have one without the other)

    View Slide

  156. Humans are
    AWESOME

    View Slide

  157. Nobody comes
    to work to do
    a bad job

    View Slide

  158. Trust
    your
    Co-Workers

    View Slide

  159. There is a lot of
    knowledge in your
    engineering team

    View Slide

  160. Deploy
    (as often as it makes sense)

    View Slide

  161. Collaborate
    (even if you think you don't have to)

    View Slide

  162. View Slide

  163. Listen
    (to problems and experiences
    of your coworkers)

    View Slide

  164. Thank you!

    View Slide

  165. codeascraft.com
    etsy.com/codeascraft/talks

    View Slide

  166. The Road to Success
    is paved with
    Small Improvements
    Daniel Schauenberg • [email protected] • @mrtazz

    View Slide