Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doing things the hard way

Doing things the hard way

Video available on YouTube: https://www.youtube.com/watch?v=WCZu-LZsIkg

---

Our discipline is one of tropes and maxims—the commoditisation of infrastructure, the golden signals of monitoring, the breaking down of barriers spurred by DevOps.

Surely there are mistakes we won't make again. Surely we've left the bad times behind.

Some mistakes are just too tempting to avoid.

Motivated by examples from GoCardless—a company founded in 2011—we'll explore three failure modes:

- dividing product and infrastructure teams early in the company's life
- pinning our hopes on the big rework that never arrives
- forgetting the basics of SRE while seeking out hard problems

We'll explore what makes each failure mode so tempting, what it might look like if you're experiencing it, and approaches to dig yourself out.

Chris Sinjakli

June 06, 2018
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Doing things the hard way @ChrisSinjo

  2. Hi

  3. @ChrisSinjo

  4. @ChrisSinjo

  5. An SRE

  6. GOCARDLESS

  7. “Obvious” mistakes and why we make them

  8. None
  9. Conference talks favour certain structures

  10. Conference talks favour self-contained narratives

  11. –Fixing Things Ltd “How we fixed the unfixable”

  12. –ScaleCorp “How we scaled our system 100x”

  13. These are great stories to tell!

  14. But there’s more…

  15. Mistakes

  16. The ones that were “obvious”

  17. The mistakes you never thought you’d make

  18. Except you did

  19. And I hope I can convince you

  20. This is normal

  21. The reasons are often reasonable

  22. Talking openly is important

  23. Context & biases

  24. Size: 25 → 215 total (8 → 60 eng)

  25. GOCARDLESS

  26. Hindsight

  27. Structure: 3 examples

  28. foreach(example):

  29. foreach(example): Define it

  30. foreach(example): Define it What it looks like

  31. foreach(example): Define it What it looks like Problems caused

  32. foreach(example): Define it What it looks like Problems caused Fixes

  33. Common themes Q&A

  34. Common themes Q&A

  35. So let’s get to it

  36. Early Infra/ Product Divide Failure mode 1

  37. You’re a young company

  38. You’ve built a product

  39. Your userbase is growing

  40. None
  41. You’ve also built this other thing

  42. Your product needs it to work

  43. It caught you by surprise

  44. You have an infra!

  45. It takes up dev time

  46. None
  47. You weren’t ready for this

  48. “Can’t someone make this go away?”

  49. “We need to hire a DevOps”

  50. It sounds silly

  51. But it literally happens

  52. –The least appealing job description ever “We have all this

    rubbish that’s distracting our devs.”
  53. "

  54. The phrasing was clunky

  55. But the framing is common

  56. Convenience

  57. None
  58. An understandable lever to pull

  59. But…

  60. Problems

  61. Now you have organisational problems

  62. Disconnect devs from production

  63. A new bottleneck

  64. Too much infra Too soon

  65. Solutions

  66. Assuming you can’t un-split

  67. Make infra contributions easy

  68. Make it obvious what needs changing

  69. Make experimentation easy

  70. Set aside time to coach

  71. Breaking my own rules

  72. Some up-front advice

  73. First infra hire: dev background

  74. Embed them in the existing team

  75. Don’t give them sole ownership of the pager

  76. Distracted by hard problems Failure mode 2

  77. We hear it so often

  78. –Every job ad “Join us and solve hard problems”

  79. We assume hard problems are most important

  80. They frequently aren’t

  81. Outcome: we neglect the basics

  82. When I say “basics”…

  83. Observability

  84. Metrics Monitoring (Structured) Events/Logs

  85. Metrics Monitoring SLOs (Structured) Events/Logs

  86. Metrics Monitoring Goals (Structured) Events/Logs

  87. Metrics Monitoring Uptime (Structured) Events/Logs

  88. Metrics Monitoring Error rate (Structured) Events/Logs

  89. Metrics Monitoring Latency (Structured) Events/Logs

  90. Easy to defer

  91. It feels mundane - As a project - As ongoing

    work
  92. It feels mundane - As a project - As ongoing

    work
  93. “So how does this improve the service?”

  94. “So how does this improve the service?” “We can measure

    it better.”
  95. “So how does this improve the service?” “We can measure

    it better.” “How does that improve it?”
  96. “So how does this improve the service?” “We can measure

    it better.” “How does that improve it?”
  97. None
  98. - Faster debugging - Shorter outages - Better project choice

  99. - Faster debugging - Shorter outages - Better project choice

  100. - Faster debugging - Shorter outages - Better project choice

  101. It feels mundane - As a project - As ongoing

    work
  102. It feels mundane - As a project - As ongoing

    work
  103. Observability is ongoing work

  104. Problems

  105. Previously… https://www.youtube.com/watch?v=SAkNBiZzEX8

  106. Was 10-15s of downtime okay?

  107. Back to basics?

  108. - Faster debugging - Shorter outages - Better project choice

  109. - Slower debugging - Longer outages - Worse project choice

  110. Lack of confidence

  111. Solutions

  112. Post-mortem meta-analysis

  113. “It wasn’t clear where the problem was.” –Post-mortems 1, 2,

    3
  114. “We couldn’t break the errors down by user.” –Post-mortems 2,

    3, 4
  115. “It was a false alarm. Again.” –Post-mortems 3, 4, 5

  116. You can do better at the basics

  117. A cultural shift

  118. Definition of done

  119. Done when it’s shipped ↓ Done when it’s measured

  120. A huge shift

  121. Cultural change takes time

  122. Start somewhere

  123. There are other basics

  124. - Post-mortem analysis - Tracking toil - Tracking pages per

    shift
  125. - Post-mortem analysis - Tracking toil - Tracking pages per

    shift
  126. - Post-mortem analysis - Tracking toil - Tracking pages per

    shift
  127. None
  128. The everything project Failure mode 3

  129. Story-based

  130. Kinda painful to tell

  131. The most immediate impact

  132. You have an infra!

  133. You’re not happy with it :(

  134. It evolved haphazardly

  135. You know where the problems are

  136. You want to fix them

  137. Reshaping the core

  138. https://www.usenix.org/conference/srecon17americas/program/presentation/sinjakli Previously…

  139. The precursor

  140. Goal: Better deployment

  141. Containers Orchestrator (Mesos) Load balancing Staging-per-developer Developer UI

  142. Containers Orchestrator (Mesos) Load balancing Staging-per-developer Developer UI

  143. Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

  144. Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

  145. Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

  146. Everything Project We were working on an

  147. Problems

  148. The New World Everything is seen in terms of

  149. The Old World So nothing happens back in

  150. It feels efficient

  151. But it’s not

  152. Loss of impact

  153. Loss of confidence

  154. Loss of team morale

  155. None
  156. Solutions

  157. None
  158. Look for the smallest version

  159. Look for the valuable part

  160. For us: deployment

  161. Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

  162. Containers Orchestrator (Mesos) Load balancing Staging-per-developer Developer UI

  163. https://gocardless.com/blog/from-idea-to-reality-containers-in-production-at- gocardless/

  164. Efficiency cannot come at the cost of everything else

  165. No stopping the world

  166. ✅ Long-term goals Short-term reality

  167. ✅ Long-term goals % Short-term reality

  168. None
  169. Mistakes

  170. None
  171. I’ve presented 3 “obvious” mistakes

  172. Not first Not last

  173. Each has an internal logic

  174. Conference talks favour self-contained narratives

  175. Even when talking about mistakes

  176. Technical mistakes are self-contained

  177. Us vs Them

  178. Us vs Them

  179. And I hope I have convinced you

  180. You won’t avoid every mistake

  181. We certainly didn’t

  182. It’s never perfect

  183. It’s perfectly fine to correct course

  184. Thank you &❤ @ChrisSinjo @GoCardlessEng

  185. https://gocardless.com/schemes

  186. We’re hiring &❤ @ChrisSinjo @GoCardlessEng

  187. Image credits • XOXO Festival Day 2 - CC-BY -

    https://www.flickr.com/photos/textfiles/15237123601/ • USS Barry conducts a practice pipe-patching drills during MultiSail 17 - CC-BY - https:// www.flickr.com/photos/usnavy/32480491984/ • Train lever - CC-BY - https://www.flickr.com/photos/darkbuffet/2309897403/ • Calendar - CC-BY - https://www.flickr.com/photos/dafnecholet/5374200948/
  188. Image credits • Unhappy man - CC0 - https://pixabay.com/en/unhappy-man-mask-sad-face- sitting-389944/

    • Stop sign - CC-BY - https://www.flickr.com/photos/wolfsavard/4812833180/ • Rope - CC-BY - https://www.flickr.com/photos/49140926@N07/6798304070/
  189. Questions? &❤ @ChrisSinjo @GoCardlessEng