Heretical Resilience

Heretical Resilience

Presented at QCon New York 2018

2396958133b7324fc7afe505dfa02572?s=128

Ryn Daniels

June 29, 2018
Tweet

Transcript

  1. heretical resilience Ryn Daniels - @rynchantress
 QCon New York 2018

    (to repair is human)
  2. @rynchantress qcon nyc 2018

  3. @rynchantress qcon nyc 2018 blargh AKA: A Dramatic Retelling of

    The Time I Nearly Broke Etsy Dot Com my side of the story
  4. @rynchantress qcon nyc 2018

  5. @rynchantress qcon nyc 2018 apache versions

  6. @rynchantress qcon nyc 2018 apache versions

  7. @rynchantress qcon nyc 2018

  8. @rynchantress qcon nyc 2018

  9. @rynchantress qcon nyc 2018 blargh

  10. @rynchantress qcon nyc 2018 blargh

  11. @rynchantress qcon nyc 2018

  12. @rynchantress qcon nyc 2018

  13. @rynchantress qcon nyc 2018

  14. @rynchantress qcon nyc 2018 blargh

  15. @rynchantress qcon nyc 2018 blargh

  16. @rynchantress qcon nyc 2018

  17. @rynchantress qcon nyc 2018

  18. @rynchantress qcon nyc 2018

  19. @rynchantress qcon nyc 2018

  20. @rynchantress qcon nyc 2018

  21. @rynchantress qcon nyc 2018 + + + = =

  22. @rynchantress qcon nyc 2018

  23. @rynchantress qcon nyc 2018

  24. @rynchantress qcon nyc 2018 + + =

  25. @rynchantress qcon nyc 2018

  26. @rynchantress qcon nyc 2018 blargh

  27. @rynchantress qcon nyc 2018 blargh

  28. @rynchantress qcon nyc 2018 The Post-mortem aka: What the heck

    actually just happened?
  29. @rynchantress qcon nyc 2018 The Post-mortem aka: What the heck

    actually just happened? aka: what did we learn?
  30. @rynchantress qcon nyc 2018 how did the site stay up?

  31. @rynchantress qcon nyc 2018

  32. @rynchantress qcon nyc 2018

  33. @rynchantress qcon nyc 2018 Always keep 7 servers out of

    config management, just in case. Lesson 1
  34. @rynchantress qcon nyc 2018 Consider fallbacks 
 for automation Lesson

    1
  35. @rynchantress qcon nyc 2018 distrusting your automation • How will

    you detect problems? • How easily can you test your automation? • Can you turn the automation off? • Do you remember how to do the thing manually?
  36. @rynchantress qcon nyc 2018 How did we respond so fast?

  37. @rynchantress qcon nyc 2018

  38. @rynchantress qcon nyc 2018 blargh

  39. @rynchantress qcon nyc 2018 Create a Slack Team in charge

    of maintaining a proper amount of slack in case of incidents. Lesson 2
  40. @rynchantress qcon nyc 2018 maintain adaptive capacity Lesson 2

  41. @rynchantress qcon nyc 2018 twiddling your thumbs • How do

    people ask each other for help? • Which teams have more or less slack? • What happens after work gets rearranged?
  42. @rynchantress qcon nyc 2018 what couldn't we see?

  43. @rynchantress qcon nyc 2018

  44. @rynchantress qcon nyc 2018

  45. @rynchantress qcon nyc 2018

  46. @rynchantress qcon nyc 2018

  47. @rynchantress qcon nyc 2018 Buy a couple botnets to DDoS

    your monitoring tools every now and then. Lesson 3
  48. @rynchantress qcon nyc 2018 understand the dependencies
 in your tooling

    Lesson 3
  49. @rynchantress qcon nyc 2018 watching the world burn • What

    do your monitoring/automation/
 orchestration tools depend on? • Who watches the watchers? • How do you communicate internally and externally? • Do you have backup tools?
  50. @rynchantress qcon nyc 2018 what actually went wrong with chef?

  51. @rynchantress qcon nyc 2018

  52. @rynchantress qcon nyc 2018 Always label your dragons. Lesson 4

  53. @rynchantress qcon nyc 2018 make informed decisions about which yaks

    to shave. Lesson 4
  54. @rynchantress qcon nyc 2018 choosing your yaks wisely • Which

    teams have sufficient slack? • Can a problem be avoided if not solved? • What are the tradeoffs and opportunity costs? • Who has the precision yak razors?
  55. @rynchantress qcon nyc 2018 who digs into the weird things?

  56. @rynchantress qcon nyc 2018 Hire the person who created the

    primary language your site is written in. 
 (This always scales.) Lesson 4.5
  57. @rynchantress qcon nyc 2018 Develop depth of
 inter-team relationships Lesson

    4.5
  58. @rynchantress qcon nyc 2018 finding your own rasmus • Which

    areas only have one (or two) people who understand them? • How is information shared within your organization? • What behaviors are rewarded?
  59. @rynchantress qcon nyc 2018 what happened afterwards?

  60. @rynchantress qcon nyc 2018

  61. @rynchantress qcon nyc 2018 Give people ill-fitting clothing when they

    mess up. Lesson 5
  62. @rynchantress qcon nyc 2018 encourage organizational learning Lesson 5

  63. @rynchantress qcon nyc 2018 a warning to others • How

    do people respond to incidents? • What happens after an incident? • How are remediation items prioritized? • What happen to the bandaid solutions?
  64. @rynchantress qcon nyc 2018

  65. @rynchantress qcon nyc 2018 technology can be robust.* only humans

    can be resilient. *for some already-known, pre-defined subset of problems
  66. @rynchantress qcon nyc 2018

  67. @rynchantress qcon nyc 2018 1. understand your automation 2. maintain

    adaptive capacity 3. know your dependencies 4. build cross-team relationships 5. always be learning
  68. @rynchantress qcon nyc 2018 1. understand your automation 2. maintain

    adaptive capacity 3. know your dependencies 4. build cross-team relationships 5. always be learning
  69. @rynchantress qcon nyc 2018 Thank you!