The Road to Success is paved with Small Improvements

The Road to Success is paved with Small Improvements

This talk discusses the architectural past of Etsy and how it
evolved over the years. It also goes into detail about the tooling we wrote and leverage for making out continuous deployment culture work. It then goes on and describes out culture of blameless postmortems and how we strive to be a learning organziation and integrates learning into the process of incident response and handling.

89e0ad1229121f46047977ac547bd7b4?s=128

Daniel Schauenberg

January 14, 2015
Tweet

Transcript

  1. The Road to Success is paved with Small Improvements Daniel

    Schauenberg • d@etsy.com • @mrtazz
  2. None
  3. None
  4. None
  5. The Monolith

  6. LAMP

  7. We deploy quite a lot

  8. How comfortable are you deploying a change right now

  9. MTTR trumps MTBF

  10. If this is your first day at Etsy you deploy

    the site
  11. None
  12. The Dark Past

  13. None
  14. Hindsight is 20/20 There would be no Etsy I wasn't

    around for this (the grain of salt disclaimer)
  15. dark less fun

  16. Architecture Overview Ubuntu Postgresql Lighttpd PHP/Python

  17. Single Big Database

  18. Business Logic in Stored Procedures

  19. None
  20. silos

  21. DEV ! DBA ! OPS

  22. Sprouter

  23. Stored Procedure Routing Middleware

  24. "Now you don't have to touch the database"

  25. A software manifestation of silos

  26. Site uptime wasn't good

  27. <3 uptime

  28. None
  29. Trust the people

  30. None
  31. Sprouter

  32. Master

  33. Horizontal Scaling (A single box only scales so far)

  34. Master-Master replicated MySQL Shards

  35. Flickr DNA

  36. Deploy != Release

  37. Feature Flags

  38. if Feature::isEnabled($feature) { // new hawtness } else { //

    nothing to see here }
  39. Chef

  40. <3 Chef

  41. repo open to everyone everybody can make changes ~30 regular

    contributors
  42. None
  43. knife-spork knife-flip knife-lastrun knife-preflight knife-wip chef-whitelist

  44. Feature Flags in Chef

  45. if node.is_in_whitelist? "apache_upgrade" # new hawtness else # boring old

    stuff end
  46. None
  47. Confidence in Dev Environment

  48. Developer VMs Every engineer has one Chef’d with the Etsy

    Stack Different sizes and Chef roles
  49. Self Service

  50. Continuous Integration

  51. None
  52. QA and Unit Staging Smoker Prod Smoker

  53. > 2000 tests

  54. Continuous Integration as the bottleneck of deployment

  55. How to keep it fast?

  56. SSDs!

  57. Not Quite

  58. The Bobs (our builders) LXC Containers 1 Jenkins Agent per

    Container 4 containers per SSD
  59. Try

  60. > try https://github.com/etsy/Trylib

  61. The power of the CI Cluster The ease of your

    local environment
  62. Thanks Mozilla

  63. IRC

  64. None
  65. Deployinator

  66. None
  67. None
  68. Simple Web Application

  69. METRICS!

  70. None
  71. Ganglia & Graphite

  72. None
  73. StatsD "how do you know this works in production?"

  74. Statsd::increment("foo") Statsd::timing("bar", 10)

  75. Dashboard Framework

  76. $graph = new Graph_Graphite_Simple( [ 'title' => 'Packets Received (nodejs)',

    'metrics' => 'stats.%s.packets_received', 'limit_y_axis' => true, 'stacked' => true, 'width' => $width, ] )
  77. Deploy Dashboard

  78. “Finding your soul metric can take time. But in the

    end, it boils down to what matters most to your users.” Mathias Meyer, CEO Travis CI
  79. Find the subset of metrics people should look at after

    a deploy
  80. Stick them on a central place

  81. Logs!

  82. Syslog works really well

  83. Decide on a log format

  84. None
  85. None
  86. None
  87. Awesome Tools

  88. Empower the Individual

  89. Information Overload

  90. Tools need attention

  91. Culture

  92. None
  93. None
  94. None
  95. None
  96. None
  97. None
  98. realtalk: things break

  99. New View

  100. Complex Socio-Technical Systems

  101. “Erkenntnis und Irrtum fließen aus denselben psychischen Quellen; nur der

    Erfolg vermag beide zu scheiden.” Ernst Mach, Erkenntnis und Irrtum (p. 116)
  102. Success and failure can only be determined a posteriori

  103. Things made sense at the time

  104. People don't come to work to do a bad job

  105. Nietzschean Anxiety

  106. So I always get off the hook whatever I do?

  107. depends

  108. “There is a difference between explaining and excusing human performance.”

    Sidney Dekker, The Field Guide to Understanding Human Error (p. 196)
  109. Blameless Postmortems

  110. Open Meeting

  111. Everybody is Invited

  112. What happened?

  113. Timeline

  114. Describe the past Don't excuse it away

  115. The Facilitator

  116. Guide the Discussion

  117. Look out for indicators of Old View thinking

  118. Counterfactuals

  119. she should have if they just had if he would

    have you failed to
  120. Biases

  121. Hindsight Bias Confirmation Bias Outcome Bias

  122. there are many more

  123. Who is in charge?

  124. Etsy School

  125. Taught Facilitator Course

  126. 3 x 90 minutes

  127. Remediation Items

  128. incorporate learning and takeaways from the meeting

  129. None
  130. turn surprises into known factors

  131. MORGUE

  132. None
  133. None
  134. None
  135. None
  136. None
  137. None
  138. None
  139. https://github.com/etsy/morgue

  140. Near Miss

  141. "Hey all, I just ran rm -rf $DIR/ and since

    the variable was empty I deleted my whole VM. This would have been bad in production. Don't do that."
  142. Pre Mortem

  143. Architecture Reviews

  144. Operability Reviews

  145. “It is also worth pointing out that the bias towards

    investigating failures rather than success itself represents a trade-off.” Erik Hollnagel, The ETTO Principle: Efficiency-Thoroughness Trade-Off
  146. Investigate Success

  147. Why did it work?

  148. Human Error is where you stopped looking

  149. Learning > Blaming

  150. None
  151. None
  152. How did we end up here?

  153. None
  154. Overhauls & Iterations

  155. Culture & Tools (you can't really have one without the

    other)
  156. Humans are AWESOME

  157. Nobody comes to work to do a bad job

  158. Trust your Co-Workers

  159. There is a lot of knowledge in your engineering team

  160. Deploy (as often as it makes sense)

  161. Collaborate (even if you think you don't have to)

  162. None
  163. Listen (to problems and experiences of your coworkers)

  164. Thank you!

  165. codeascraft.com etsy.com/codeascraft/talks

  166. The Road to Success is paved with Small Improvements Daniel

    Schauenberg • d@etsy.com • @mrtazz