$30 off During Our Annual Pro Sale. View Details »

Doing things the hard way

Doing things the hard way

Video available on YouTube: https://www.youtube.com/watch?v=WCZu-LZsIkg

---

Our discipline is one of tropes and maxims—the commoditisation of infrastructure, the golden signals of monitoring, the breaking down of barriers spurred by DevOps.

Surely there are mistakes we won't make again. Surely we've left the bad times behind.

Some mistakes are just too tempting to avoid.

Motivated by examples from GoCardless—a company founded in 2011—we'll explore three failure modes:

- dividing product and infrastructure teams early in the company's life
- pinning our hopes on the big rework that never arrives
- forgetting the basics of SRE while seeking out hard problems

We'll explore what makes each failure mode so tempting, what it might look like if you're experiencing it, and approaches to dig yourself out.

Chris Sinjakli

June 06, 2018
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Doing things
    the hard way
    @ChrisSinjo

    View Slide

  2. Hi

    View Slide

  3. @ChrisSinjo

    View Slide

  4. @ChrisSinjo

    View Slide

  5. An SRE

    View Slide

  6. GOCARDLESS

    View Slide

  7. “Obvious” mistakes
    and
    why we make them

    View Slide

  8. View Slide

  9. Conference talks
    favour
    certain
    structures

    View Slide

  10. Conference talks
    favour
    self-contained
    narratives

    View Slide

  11. –Fixing Things Ltd
    “How we fixed the unfixable”

    View Slide

  12. –ScaleCorp
    “How we scaled our system 100x”

    View Slide

  13. These are
    great
    stories to tell!

    View Slide

  14. But there’s
    more…

    View Slide

  15. Mistakes

    View Slide

  16. The ones that
    were “obvious”

    View Slide

  17. The mistakes you
    never thought
    you’d make

    View Slide

  18. Except you did

    View Slide

  19. And I hope I
    can convince
    you

    View Slide

  20. This is normal

    View Slide

  21. The
    reasons
    are often
    reasonable

    View Slide

  22. Talking openly
    is important

    View Slide

  23. Context
    &
    biases

    View Slide

  24. Size:
    25 → 215 total
    (8 → 60 eng)

    View Slide


  25. GOCARDLESS

    View Slide

  26. Hindsight

    View Slide

  27. Structure:
    3 examples

    View Slide

  28. foreach(example):

    View Slide

  29. foreach(example):
    Define it

    View Slide

  30. foreach(example):
    Define it
    What it looks like

    View Slide

  31. foreach(example):
    Define it
    What it looks like
    Problems caused

    View Slide

  32. foreach(example):
    Define it
    What it looks like
    Problems caused
    Fixes

    View Slide

  33. Common themes
    Q&A

    View Slide

  34. Common themes
    Q&A

    View Slide

  35. So let’s get to it

    View Slide

  36. Early Infra/
    Product Divide
    Failure mode 1

    View Slide

  37. You’re a young
    company

    View Slide

  38. You’ve built a
    product

    View Slide

  39. Your userbase
    is growing

    View Slide

  40. View Slide

  41. You’ve also
    built this other
    thing

    View Slide

  42. Your product
    needs it to work

    View Slide

  43. It caught you
    by surprise

    View Slide

  44. You have an
    infra!

    View Slide

  45. It takes up
    dev time

    View Slide

  46. View Slide

  47. You weren’t
    ready for this

    View Slide

  48. “Can’t someone
    make this go
    away?”

    View Slide

  49. “We need to
    hire a
    DevOps”

    View Slide

  50. It sounds silly

    View Slide

  51. But it literally
    happens

    View Slide

  52. –The least appealing job
    description ever
    “We have all this rubbish that’s
    distracting our devs.”

    View Slide

  53. "

    View Slide

  54. The phrasing
    was clunky

    View Slide

  55. But the framing
    is common

    View Slide

  56. Convenience

    View Slide

  57. View Slide

  58. An understandable
    lever to pull

    View Slide

  59. But…

    View Slide

  60. Problems

    View Slide

  61. Now you have
    organisational
    problems

    View Slide

  62. Disconnect
    devs from
    production

    View Slide

  63. A new
    bottleneck

    View Slide

  64. Too much infra
    Too soon

    View Slide

  65. Solutions

    View Slide

  66. Assuming you
    can’t un-split

    View Slide

  67. Make infra
    contributions
    easy

    View Slide

  68. Make it obvious
    what needs
    changing

    View Slide

  69. Make
    experimentation
    easy

    View Slide

  70. Set aside time
    to coach

    View Slide

  71. Breaking my
    own rules

    View Slide

  72. Some up-front
    advice

    View Slide

  73. First infra hire:
    dev background

    View Slide

  74. Embed them in
    the existing
    team

    View Slide

  75. Don’t give them
    sole ownership
    of the pager

    View Slide

  76. Distracted by
    hard problems
    Failure mode 2

    View Slide

  77. We hear it so
    often

    View Slide

  78. –Every job ad
    “Join us and solve hard problems”

    View Slide

  79. We assume hard
    problems are
    most important

    View Slide

  80. They frequently
    aren’t

    View Slide

  81. Outcome: we
    neglect the
    basics

    View Slide

  82. When I say
    “basics”…

    View Slide

  83. Observability

    View Slide

  84. Metrics
    Monitoring
    (Structured)
    Events/Logs

    View Slide

  85. Metrics
    Monitoring SLOs
    (Structured)
    Events/Logs

    View Slide

  86. Metrics
    Monitoring Goals
    (Structured)
    Events/Logs

    View Slide

  87. Metrics
    Monitoring Uptime
    (Structured)
    Events/Logs

    View Slide

  88. Metrics
    Monitoring Error rate
    (Structured)
    Events/Logs

    View Slide

  89. Metrics
    Monitoring Latency
    (Structured)
    Events/Logs

    View Slide

  90. Easy to defer

    View Slide

  91. It feels mundane
    - As a project
    - As ongoing work

    View Slide

  92. It feels mundane
    - As a project
    - As ongoing work

    View Slide

  93. “So how does this improve the service?”

    View Slide

  94. “So how does this improve the service?”
    “We can measure it better.”

    View Slide

  95. “So how does this improve the service?”
    “We can measure it better.”
    “How does that improve it?”

    View Slide

  96. “So how does this improve the service?”
    “We can measure it better.”
    “How does that improve it?”

    View Slide

  97. View Slide

  98. - Faster debugging
    - Shorter outages
    - Better project choice

    View Slide

  99. - Faster debugging
    - Shorter outages
    - Better project choice

    View Slide

  100. - Faster debugging
    - Shorter outages
    - Better project choice

    View Slide

  101. It feels mundane
    - As a project
    - As ongoing work

    View Slide

  102. It feels mundane
    - As a project
    - As ongoing work

    View Slide

  103. Observability
    is
    ongoing work

    View Slide

  104. Problems

    View Slide

  105. Previously…
    https://www.youtube.com/watch?v=SAkNBiZzEX8

    View Slide

  106. Was 10-15s of
    downtime okay?

    View Slide

  107. Back to
    basics?

    View Slide

  108. - Faster debugging
    - Shorter outages
    - Better project choice

    View Slide

  109. - Slower debugging
    - Longer outages
    - Worse project choice

    View Slide

  110. Lack of
    confidence

    View Slide

  111. Solutions

    View Slide

  112. Post-mortem
    meta-analysis

    View Slide

  113. “It wasn’t clear where the
    problem was.”
    –Post-mortems 1, 2, 3

    View Slide

  114. “We couldn’t break the errors
    down by user.”
    –Post-mortems 2, 3, 4

    View Slide

  115. “It was a false alarm.
    Again.”
    –Post-mortems 3, 4, 5

    View Slide

  116. You can do better
    at the basics

    View Slide

  117. A cultural shift

    View Slide

  118. Definition of
    done

    View Slide

  119. Done when it’s shipped

    Done when it’s
    measured

    View Slide

  120. A
    huge
    shift

    View Slide

  121. Cultural
    change
    takes time

    View Slide

  122. Start
    somewhere

    View Slide

  123. There are
    other basics

    View Slide

  124. - Post-mortem analysis
    - Tracking toil
    - Tracking pages per shift

    View Slide

  125. - Post-mortem analysis
    - Tracking toil
    - Tracking pages per shift

    View Slide

  126. - Post-mortem analysis
    - Tracking toil
    - Tracking pages per shift

    View Slide

  127. View Slide

  128. The everything
    project
    Failure mode 3

    View Slide

  129. Story-based

    View Slide

  130. Kinda
    painful
    to tell

    View Slide

  131. The most
    immediate
    impact

    View Slide

  132. You have an
    infra!

    View Slide

  133. You’re not
    happy with it :(

    View Slide

  134. It evolved
    haphazardly

    View Slide

  135. You know
    where the
    problems are

    View Slide

  136. You want to
    fix them

    View Slide

  137. Reshaping
    the core

    View Slide

  138. https://www.usenix.org/conference/srecon17americas/program/presentation/sinjakli
    Previously…

    View Slide

  139. The
    precursor

    View Slide

  140. Goal:
    Better deployment

    View Slide

  141. Containers
    Orchestrator (Mesos)
    Load balancing
    Staging-per-developer
    Developer UI

    View Slide

  142. Containers
    Orchestrator (Mesos)
    Load balancing
    Staging-per-developer
    Developer UI

    View Slide

  143. Containers
    Orchestrator (Mesos)
    Load balancing
    Staging-per-developer
    Self-serve developer UI

    View Slide

  144. Containers
    Orchestrator (Mesos)
    Load balancing
    Staging-per-developer
    Self-serve developer UI

    View Slide

  145. Containers
    Orchestrator (Mesos)
    Load balancing
    Staging-per-developer
    Self-serve developer UI

    View Slide

  146. Everything Project
    We were working
    on an

    View Slide

  147. Problems

    View Slide

  148. The New World
    Everything is seen in
    terms of

    View Slide

  149. The Old World
    So nothing happens
    back in

    View Slide

  150. It feels
    efficient

    View Slide

  151. But it’s
    not

    View Slide

  152. Loss of
    impact

    View Slide

  153. Loss of
    confidence

    View Slide

  154. Loss of
    team morale

    View Slide

  155. View Slide

  156. Solutions

    View Slide

  157. View Slide

  158. Look for the
    smallest
    version

    View Slide

  159. Look for the
    valuable
    part

    View Slide

  160. For us:
    deployment

    View Slide

  161. Containers
    Orchestrator (Mesos)
    Load balancing
    Staging-per-developer
    Self-serve developer UI

    View Slide

  162. Containers
    Orchestrator (Mesos)
    Load balancing
    Staging-per-developer
    Developer UI

    View Slide

  163. https://gocardless.com/blog/from-idea-to-reality-containers-in-production-at-
    gocardless/

    View Slide

  164. Efficiency cannot come
    at the cost of
    everything else

    View Slide

  165. No stopping
    the world

    View Slide

  166. ✅ Long-term goals
    Short-term reality

    View Slide

  167. ✅ Long-term goals
    % Short-term reality

    View Slide

  168. View Slide

  169. Mistakes

    View Slide

  170. View Slide

  171. I’ve presented 3
    “obvious” mistakes

    View Slide

  172. Not first
    Not last

    View Slide

  173. Each has an
    internal logic

    View Slide

  174. Conference talks
    favour
    self-contained
    narratives

    View Slide

  175. Even when
    talking about
    mistakes

    View Slide

  176. Technical
    mistakes are
    self-contained

    View Slide

  177. Us vs Them

    View Slide

  178. Us vs Them

    View Slide

  179. And I hope I
    have convinced
    you

    View Slide

  180. You won’t avoid
    every
    mistake

    View Slide

  181. We certainly
    didn’t

    View Slide

  182. It’s never
    perfect

    View Slide

  183. It’s perfectly
    fine to correct
    course

    View Slide

  184. Thank you
    &❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  185. https://gocardless.com/schemes

    View Slide

  186. We’re hiring
    &❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  187. Image credits
    • XOXO Festival Day 2 - CC-BY - https://www.flickr.com/photos/textfiles/15237123601/
    • USS Barry conducts a practice pipe-patching drills during MultiSail 17 - CC-BY - https://
    www.flickr.com/photos/usnavy/32480491984/
    • Train lever - CC-BY - https://www.flickr.com/photos/darkbuffet/2309897403/
    • Calendar - CC-BY - https://www.flickr.com/photos/dafnecholet/5374200948/

    View Slide

  188. Image credits
    • Unhappy man - CC0 - https://pixabay.com/en/unhappy-man-mask-sad-face-
    sitting-389944/
    • Stop sign - CC-BY - https://www.flickr.com/photos/wolfsavard/4812833180/
    • Rope - CC-BY - https://www.flickr.com/photos/49140926@N07/6798304070/

    View Slide

  189. Questions?
    &❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide