Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Road to Continuous Deployments

The Road to Continuous Deployments

Engineering Excellence through Continuous Delivery

An experience report on how we built a sustainable culture of shipping great products and how you can too.

It's a playbook covering a wide range of topics that help build engineering rigor including, but not limited to:

* Pair programming
* Instrumentation
* Trunk-based development
* Test-driven development
* Feature flags
* Observability
* On-call rotation
* And much more...

Swanand Pagnis

September 08, 2022
Tweet

More Decks by Swanand Pagnis

Other Decks in Programming

Transcript

  1. The Road to Continuous
    Deployments
    Experience Report from CoLearn Engineering

    View Slide

  2. Swanand Pagnis
    👨💼 CTO at CoLearn


    🍻 meetup.com/Bangalore-Ruby-Users-Group/


    📔 info.pagnis.in


    👨🏫 postgres-workshop.com

    View Slide

  3. Background

    View Slide

  4. Company
    •Education Platform for K
    -
    12


    •In Indonesia (for now)


    •Live Classes


    •AI
    -
    Powered Homework Help

    View Slide

  5. >1 year ago
    •8-month old codebases


    •Service Oriented Architecture


    •NodeJS backend
    +
    ReactJS Frontend


    •Native Android in Kotlin


    •Native iOS in Swift

    View Slide

  6. Now

    ~
    2 year old codebases


    •Service Oriented Architecture


    •Rails/Django backend
    +
    ReactJS Frontend


    •Native Android in Kotlin. Flutter WIP.


    •Native iOS in Swift. Flutter WIP.

    View Slide

  7. Now

    ~
    2 year old codebases


    •Service Oriented Architecture


    •Rails/Django backend
    +
    ReactJS Frontend


    •Native Android in Kotlin. Flutter WIP.


    •Native iOS in Swift. Flutter WIP.
    Deprioritised, because 🚀 & 💰

    View Slide

  8. Legacy Systems
    •Built for an MVP stage


    •Came without thorough engineering
    practices baked in

    View Slide

  9. Growth
    •1 year period


    •Engineering 10
    ->
    36


    •Product 2
    ->
    9


    •Design
    +
    Content 4
    ->
    20

    View Slide

  10. Target Audience 🎯
    •Product Engineering Teams


    •Founders, CTOs, CPOs, VPs


    •Software Developers


    •Product Managers

    View Slide

  11. Why CD? 🤔

    View Slide

  12. What is this?

    View Slide

  13. Public Transport. ✅

    View Slide

  14. Every 1 Hour

    View Slide

  15. Every 5 Minutes

    View Slide

  16. Which is better? Every 1 Hour
    Every 5 Minutes

    View Slide

  17. How about this?

    View Slide

  18. This belt is continuous.


    Hop-on whenever.


    Hop-off wherever.

    View Slide

  19. The Bottom Line
    •Find and surface bugs faster


    •Repeatable, reliable delivery


    •Risk mitigation: "When the costs are
    non-linear, keep it small"

    View Slide

  20. • Failure ☠ 🏦

    View Slide

  21. • Failure ☠ 🏦
    • Major Repairs 😱💰

    View Slide

  22. • Failure ☠ 🏦
    • Major Repairs 😱💰
    • Minor Repairs 😟 💵

    View Slide

  23. • Failure ☠ 🏦
    • Major Repairs 😱💰
    • Minor Repairs 😟 💵
    • Preventive Maintenance 😅 🪙

    View Slide

  24. • Failure ☠ 🏦


    • Major Repairs 😱💰


    • Minor Repairs 😟 💵


    • Preventive Maintenance 😅 🪙 ✅

    View Slide

  25. • Improved velocity 🏎


    • Better product via rapid
    iterations ♻


    • Improved code quality,
    reliability, architecture ☮

    View Slide

  26. What's the catch? 🎣

    View Slide

  27. • It needs rigour, which is not
    always possible

    View Slide

  28. • It needs rigour, which is not
    always possible
    • High inertia — needs time,
    effort, careful execution

    View Slide

  29. • It needs rigour, which is not
    always possible
    • High inertia — needs time,
    effort, careful execution
    • High short term costs

    View Slide

  30. Methodology

    View Slide

  31. Pairing, TDD, Trunk
    Based Development,
    On-Call Rotation:


    Build Rigour

    View Slide

  32. Testing, Instrumentation,
    Observability, Feature
    Flags:


    Make Verification Easy

    View Slide

  33. Infrastructure as Code,
    Immutable Infra,
    Pipelines, Playbooks:


    Reduce Operating Friction

    View Slide

  34. A sustainable culture
    of building & shipping
    great products.

    View Slide

  35. 1. Build Rigour


    2. Make Verification Easy


    3. Low Operating Friction

    View Slide

  36. 1. Build Rigour

    View Slide

  37. 1. Build Rigour
    Putting the engineering in engineering

    View Slide

  38. 1. Pair Programming

    View Slide

  39. 1. Pair Programming
    2. TDD

    View Slide

  40. 1. Pair Programming
    2. TDD
    3. Trunk Based Development

    View Slide

  41. 1. Pair Programming
    2. TDD
    3. Trunk Based Development
    4. On-Call Rotation

    View Slide

  42. 1. Pair Programming


    2. TDD


    3. Trunk Based Development


    4. On-Call Rotation

    View Slide

  43. Why did we pick it?

    View Slide

  44. ☝First, the results.

    View Slide

  45. Consistent high eNPS
    •Min 73, Max 87


    •Better connections, better work
    relationships


    •Pandemic induced remote anxiety went
    down 📉

    View Slide

  46. Increasing Velocity
    •Pairing velocity caught up with non-
    pairing velocity


    •Fewer delivery streams, same overall
    speed

    View Slide

  47. Consistent


    upward


    trend.


    Towards the end,


    holidays and covid


    knocked us down.

    View Slide

  48. Fast Onboarding
    •Ship code to production in Week 1


    •New languages, frameworks in a
    sprint or two 🏎


    •Internal transfers with zero friction 🧈

    View Slide

  49. Low Tech Debt
    •Greenfield projects:
    ~
    0 tech debt 👌


    •Code quality up 📈


    •Documentation quality and quantity up 📈


    •Architecture has been flexible 💪

    View Slide

  50. Why did we pick it?

    View Slide

  51. A Combination Of
    •Prior experience


    •Established research


    •First principles thinking

    View Slide

  52. What does research say?
    •Improves design quality


    •Reduces defects (people spend less time on defective
    solutions)


    •Reduces staffing risk


    •Enhances technical skills


    •Improves team communications


    •Is considered more enjoyable at statistically significant levels.
    The Costs and Benefits of Pair Programming; Alistair Cockburn, Laurie Williams, Feb 2000

    View Slide

  53. How to do it?

    View Slide

  54. Use Driver Navigator
    •For an idea to go from Navigator's head to
    the code, it must go through Driver's hands.

    View Slide

  55. Use Driver Navigator
    •For an idea to go from Navigator's head to
    the code, it must go through Driver's hands.
    •Switch roles periodically. Say, every hour.

    View Slide

  56. Use Driver Navigator
    •For an idea to go from Navigator's head to
    the code, it must go through Driver's hands.
    •Switch roles periodically. Say, every hour.
    •Senior / Junior is immaterial. Both get both
    roles.

    View Slide

  57. Use Driver Navigator
    •Avoid giving line by line instructions,
    convey the general idea.

    View Slide

  58. Use Driver Navigator
    •Avoid giving line by line instructions,
    convey the general idea.
    •Seniors take the responsibility of
    mentoring

    View Slide

  59. Use Driver Navigator
    •When chopping onions, don't say
    "cut the top off, now break in half,
    make a slice" etc.


    •Just say "finely chopped" or "diced"

    View Slide

  60. Use Driver Navigator
    •Mentoring happens with both driving
    and navigating.


    •I leave it to you to figure out what are
    the differences.

    View Slide

  61. Use Driver Navigator
    •Remote pairing is better than in-person
    because of the natural role selection


    •Sharing screen? 👉 driver. Other 👉 navigator.


    •Mobbing is incredibly easy in remote. Just
    join the call and you're ready! 👍

    View Slide

  62. Switch Pairs Every Sprint
    •Avoid pairing silos, they stall culture propagation

    View Slide

  63. Switch Pairs Every Sprint
    •Avoid pairing silos, they stall culture propagation
    •Often, a pair will be 💯, switch them anyway.

    View Slide

  64. Switch Pairs Every Sprint
    •Avoid pairing silos, they stall culture propagation
    •Often, a pair will be 💯, switch them anyway.
    •Exhaust senior-junior pairs first

    View Slide

  65. Switch Pairs Every Sprint
    •Avoid pairing silos, they stall culture propagation
    •Often, a pair will be 💯, switch them anyway.
    •Exhaust senior-junior pairs first
    •When sprint ends, you swap even if WIP. This is
    an effective litmus test.

    View Slide

  66. DevX is Crucial
    •Let pairs figure out the balance between solo
    focussed work and pairing


    •Have routine health-checks about how
    people are pairing, their experience, etc


    •Let pairing feature in 1
    :
    1s and other
    discussions

    View Slide

  67. Pilot + Co-Pilot


    = Pairing

    View Slide

  68. You don't say


    "Turn the lever by 10°
    and push that button"

    View Slide

  69. You say


    "Raise the elevation by
    1000m"

    View Slide

  70. Pairing Summary
    •Use Driver-Navigator


    •Switch pairs every sprints; no silos


    •Routine health checks with team

    View Slide

  71. 1. Pair Programming


    2. TDD


    3. Trunk Based Development


    4. On-Call Rotation

    View Slide

  72. Why did we pick it?

    View Slide

  73. • TDD improves testability. This benefit
    alone is enough to embrace TDD.


    • TDD forces you to think in
    specifications, hence improving
    product thinking, along with code
    quality.

    View Slide

  74. What are the effects?

    View Slide

  75. • Clear & significant uptick in quality
    where TDD was followed vs where it
    wasn't.

    View Slide

  76. • Clear & significant uptick in quality
    where TDD was followed vs where it
    wasn't.
    • Legacy or greenfield doesn't matter

    View Slide

  77. • Clear & significant uptick in quality
    where TDD was followed vs where it
    wasn't.
    • Legacy or greenfield doesn't matter
    • TDD and Pairing are two incredible force
    multipliers, they feed into each other
    and create a strong positive gains loop.

    View Slide

  78. How to do it?

    View Slide

  79. View Slide

  80. 1. Have senior engineers who are
    experienced in TDD

    View Slide

  81. 1. Have senior engineers who are
    experienced in TDD
    2. Pair programming. Duh.

    View Slide

  82. 1. Pair Programming


    2. TDD


    3. Trunk Based Development


    4. On-Call Rotation

    View Slide

  83. Why did we pick it?

    View Slide

  84. View Slide

  85. • Impedance mismatch between long-
    lived-branch
    +
    PR-based workflow
    and how high-trust teams operate

    View Slide

  86. • Impedance mismatch between long-
    lived-branch
    +
    PR-based workflow
    and how high-trust teams operate
    • Build a sense of ownership in the
    codebase

    View Slide

  87. • Impedance mismatch between long-
    lived-branch
    +
    PR-based workflow
    and how high-trust teams operate
    • Build a sense of ownership in the
    codebase
    • Always be selling

    View Slide

  88. • Impedance mismatch between long-
    lived-branch
    +
    PR-based workflow
    and how high-trust teams operate


    • Build a sense of ownership in the
    codebase


    • Always be selling release ready

    View Slide

  89. What are the effects?

    View Slide

  90. • Code reviews are faster


    • Teams respond quicker to
    urgent and important bugs


    • We're running more iterations

    View Slide

  91. • Deploying to dev, stage has
    become slightly awkward
    because there's no 1
    :
    1 mapping


    • Turn-key environments have
    become a necessity rather than
    nice-to-have

    View Slide

  92. How to do it?

    View Slide

  93. Have fast builds

    <
    1 min ideally, if possible


    •15 min from git push to production
    deploy including build


    •Enable focussed tests i.e. run a single
    test from a single file

    View Slide

  94. 💯 Dev Machines
    •Fast, capable laptops


    •Must have automated & manual
    testing setup


    •Enable setting up any dependency

    View Slide

  95. 1. Pair Programming


    2. TDD


    3. Trunk Based Development


    4. On-Call Rotation

    View Slide

  96. Observations?

    View Slide

  97. • Process still under iteration; no
    "yes, this works" yet

    View Slide

  98. • Process still under iteration; no
    "yes, this works" yet
    • Settled on functional rotations:
    Backend, Frontend, Mobile, DevOps

    View Slide

  99. • Process still under iteration; no
    "yes, this works" yet
    • Settled on functional rotations:
    Backend, Frontend, Mobile, DevOps
    • PTOs, leaves, Weekends still pose a
    challenge from time to time

    View Slide

  100. • Team members that have done
    really well during on-call have
    also done really well in their
    performance reviews.


    • Correlation, yes. Causation? 🤷

    View Slide

  101. How to do it?

    View Slide

  102. Sharing our experience, not a
    walkthrough for on-call.


    Literature: PagerDuty Docs &
    Google's SRE Book

    View Slide

  103. • Start with a robust triage process. First
    response under 15 min.


    • Have a playbook where common
    problems and remedies are listed.


    • In B2C products, a handful few situations
    repeat like a persistent boomerang. FAB.
    Frequently Annoying Bugs.

    View Slide

  104. • Use managed services as much as
    possible; reduce operational on-call


    • Try hard for "follow-the-sun"
    model; i.e. no wee hours


    • All alerts must be actionable, keep
    adjusting until they are

    View Slide

  105. 1. Pair Programming


    2. TDD


    3. Trunk Based Development


    4. On-Call Rotation

    View Slide

  106. Every single process /
    methodology discussed so far
    has ancillary benefits that go
    way beyond just CD.

    View Slide

  107. 2. Make
    Verification Easy
    How do you know what you've done is
    working?

    View Slide

  108. View Slide

  109. 1. Testability

    View Slide

  110. 1. Testability
    2. Instrumentation

    View Slide

  111. 1. Testability
    2. Instrumentation
    3. Feature Flags

    View Slide

  112. 1. Testability
    2. Instrumentation
    3. Feature Flags
    4. Static Verification

    View Slide

  113. 1. Testability


    2. Instrumentation


    3. Feature Flags


    4. Static Verification

    View Slide

  114. Why?

    View Slide

  115. • Testability is a core engineering
    principle.


    • To be able to answer questions
    about a system by probing the
    right points and looking at
    indicators

    View Slide

  116. • Cars, bridges, rack & pinion —
    you can't just restart them.


    • Neither can you go and update
    them at will

    View Slide

  117. Hit the hammer: $1.0


    Knowing where to hit: $9999.0

    View Slide

  118. • The more testable your environment
    is, the more people will actually test it.


    • Make it easy to test something and it
    will get tested.


    • Conversely, make it difficult to test
    and it's easy to slip.

    View Slide

  119. What are the
    observations?

    View Slide

  120. • Not having Dev & Stage as close to
    production has routinely caused problems

    View Slide

  121. • Not having Dev & Stage as close to
    production has routinely caused problems
    • Static branches mapping to environments
    (dev, stage, main) seem 👍, but are a 👎

    View Slide

  122. • Not having Dev & Stage as close to
    production has routinely caused problems
    • Static branches mapping to environments
    (dev, stage, main) seem 👍, but are a 👎
    • Opaque 3rd party dependencies are
    incredibly hard to test. e.g. WhatsApp
    business APIs

    View Slide

  123. • SoA
    +
    inter-service dependencies =
    complexity at a polynomial growth rate (or
    worse, factorial)

    View Slide

  124. • SoA
    +
    inter-service dependencies =
    complexity at a polynomial growth rate (or
    worse, factorial)
    • Cloud-Native systems are a pain to test,
    but they do offer instrumentation.

    View Slide

  125. • SoA
    +
    inter-service dependencies =
    complexity at a polynomial growth rate (or
    worse, factorial)
    • Cloud-Native systems are a pain to test,
    but they do offer instrumentation.
    • UIs are inherently hard to test, add probes
    (
    Metrics, Analytics, Traces, Errors, etc)

    View Slide

  126. So, how to go about it?

    View Slide

  127. There are two main
    themes:


    1. Development time


    2. Runtime, in production

    View Slide

  128. Development Time
    •Use TDD


    •Add linters, code coverage to test builds


    •Postman / equivalent API tools are 💯


    •Powerful Type Systems*

    View Slide

  129. Runtime
    • Make good use of lower order environments

    View Slide

  130. Runtime
    • Make good use of lower order environments
    •Heroku / Vercel style Review Apps are far
    more powerful than they seem

    View Slide

  131. Runtime
    • Make good use of lower order environments
    •Heroku / Vercel style Review Apps are far
    more powerful than they seem
    •Dive down deep into important bugs and see
    how they could've been tested earlier.
    (
    Which
    is different from how to reproduce them)

    View Slide

  132. Runtime
    •Add traces, specially to lower-order environments.
    (
    Example: AWS's X
    -
    Ray)

    View Slide

  133. Runtime
    •Add traces, specially to lower-order environments.
    (
    Example: AWS's X
    -
    Ray)
    •Try and build idempotent units of work. APIs, Workers, etc.

    View Slide

  134. Runtime
    •Add traces, specially to lower-order environments.
    (
    Example: AWS's X
    -
    Ray)
    •Try and build idempotent units of work. APIs, Workers, etc.
    •Pay special attention to non-idempotent units of work. Add
    checks and balances.
    (
    Example: OTPs)

    View Slide

  135. Runtime
    •Add traces, specially to lower-order environments.
    (
    Example: AWS's X
    -
    Ray)
    •Try and build idempotent units of work. APIs, Workers, etc.
    •Pay special attention to non-idempotent units of work. Add
    checks and balances.
    (
    Example: OTPs)
    •Eliminate String logs. All log statements are events, with
    key value pairs*

    View Slide

  136. In both Environments
    •Always test for contention:


    • What must happen sequentially? Does it?


    •Always test for coherence:


    • How much and what information do two systems
    need to collect from each other? Do they?

    View Slide

  137. Recommended Reading
    •Neil Gunther's work on Universal
    Scalability Law and Quantifying
    Scalability and Performance


    •Michael Nygard's "Release It!"

    View Slide

  138. 1. Testability


    2. Instrumentation


    3. Feature Flags


    4. Static Verification

    View Slide

  139. Why did we pick it?

    View Slide

  140. • Learning from other engineering
    disciplines


    • High velocity, but preferably not at
    a very high upfront cost


    • Wanted to build upfront, not after
    the fact

    View Slide

  141. What are the effects?

    View Slide

  142. • NewRelic routinely predicts a lot of problems
    before they occur


    • Tech spec quality has gone up — we add
    metrics and dashboarding right into tech
    specs


    • Had our share of goof-ups. e.g. Shipped a
    major feature, which nobody used in
    production 🤦

    View Slide

  143. From our playbook

    View Slide

  144. • Number of bugs has gone down*


    • Bug triage process is fast (and
    getting faster; median first
    response is down to 4 min)


    • Consistently low tech debt; and we
    assess and track regularly

    View Slide

  145. View Slide

  146. • Number of bugs has gone down*


    • Bug triage process is fast (and
    getting faster; median first
    response is down to 4 min)


    • Consistently low tech debt; and we
    assess and track regularly

    View Slide

  147. How to do it?

    View Slide

  148. • Have 3 levels of
    instrumentation:

    View Slide

  149. • Have 3 levels of
    instrumentation:
    • Infra & Systems level

    View Slide

  150. • Have 3 levels of
    instrumentation:
    • Infra & Systems level
    • Code & Application level

    View Slide

  151. • Have 3 levels of
    instrumentation:
    • Infra & Systems level
    • Code & Application level
    • Product & Business level

    View Slide

  152. • Have at least two kinds of
    thresholds:

    View Slide

  153. • Have at least two kinds of
    thresholds:
    • Too low and too high

    View Slide

  154. • Have at least two kinds of
    thresholds:
    • Too low and too high
    • Too long and too short

    View Slide

  155. • Envision your production dashboards
    before even writing a single line of code


    • We're running a trial with GQM
    technique


    • Answer the 🏅 question: How do you
    know what you've built is working?

    View Slide

  156. • NewRelic, Cloudwatch & friends
    are your friends


    • Keep Logs, Metrics, APMs in
    one place

    View Slide

  157. 1. Testability


    2. Instrumentation


    3. Feature Flags


    4. Static Verification

    View Slide

  158. Why?

    View Slide

  159. • Essential with Trunk Based
    Development


    • Works very well with product
    experimentation

    View Slide

  160. Observations.

    View Slide

  161. • We now deploy "under development"
    work to production on Day One


    • Having fewer technologies has helped
    in usage standardisation. Flipper is 🤘


    • Code gets littered with branching.
    Live with it.

    View Slide

  162. How to do it?

    View Slide

  163. 3 Kinds of Feature Flags

    View Slide

  164. 3 Kinds of Feature Flags
    1. Infra / systems level (types of CPUs, Aurora
    vs RDS, etc)

    View Slide

  165. 3 Kinds of Feature Flags
    1. Infra / systems level (types of CPUs, Aurora
    vs RDS, etc)
    2. Code level ( tied with continuous
    deployments and trunk development )

    View Slide

  166. 3 Kinds of Feature Flags
    1. Infra / systems level (types of CPUs, Aurora
    vs RDS, etc)
    2. Code level ( tied with continuous
    deployments and trunk development )
    3. Product and business level
    (
    A/B tests,
    experimentation )

    View Slide

  167. • Not all feature flags live forever, kill the
    code branches when feature matures.


    • Database changes have to be 100%
    backward and forward compatible


    • Prefer SDKs, libraries, code sharing over
    a centralised service for feature flags

    View Slide

  168. 1. Testability


    2. Instrumentation


    3. Feature Flags


    4. Static Verification

    View Slide

  169. • Linters ✅


    • Source Code Analysis for Security ✅


    • Metrics ✅


    • Exploring: TLA
    +
    🔮


    • Formal Verification: At the moment 🛑

    View Slide

  170. 1. Testability


    2. Instrumentation


    3. Feature Flags


    4. Static Verification

    View Slide

  171. 3. Reduce
    Operating Friction
    Help team focus on the important

    View Slide

  172. 1. Pipelines


    2. Infrastructure as Code


    3. Playbooks

    View Slide

  173. 1. Pipelines


    2. Infrastructure as Code


    3. Playbooks

    View Slide

  174. • Automated pipelines ✅


    • Manual deployment? 👎


    • Manual approval? 👎


    • Manual configuration? 👎

    View Slide

  175. • API service? Pipeline.


    • iOS app? Pipeline.


    • React App? Pipeline.


    • Data pipeline? Well, duh!

    View Slide

  176. Make reliable
    deployments a
    foregone conclusion.

    View Slide

  177. What do we do?

    View Slide


  178. ~
    50 deployments per day


    • Slowest deployment to prod is 15 min,
    fastest is 3 min — this includes ALL THE
    TESTING


    • Deployments are completely transparent.
    You push code and things happen. Teams
    can focus on product and problems.

    View Slide

  179. • Entire infra is managed from the
    pipeline, it's tied into the AWS
    ecosystem.

    View Slide

  180. • Entire infra is managed from the
    pipeline, it's tied into the AWS
    ecosystem.
    • Remember the golden rule: Every git
    push goes to production under 15
    minutes flat, with no manual approval
    whatsoever.

    View Slide

  181. 1. Pipelines


    2. Infrastructure as Code


    3. Playbooks

    View Slide

  182. Why did we pick it?

    View Slide

  183. • Declarative infrastructure; same code quality
    focus on DevOps as well

    View Slide

  184. • Declarative infrastructure; same code quality
    focus on DevOps as well
    • We want the vertical teams to define and
    manage their infra and not be blocked by a
    horizontal team

    View Slide

  185. • Declarative infrastructure; same code quality
    focus on DevOps as well
    • We want the vertical teams to define and
    manage their infra and not be blocked by a
    horizontal team
    • Reduce operational on-call burden

    View Slide

  186. • Declarative infrastructure; same code quality
    focus on DevOps as well
    • We want the vertical teams to define and
    manage their infra and not be blocked by a
    horizontal team
    • Reduce operational on-call burden
    • DevOps team works on hard platform problems
    and security challenges

    View Slide

  187. • We picked: AWS CDK


    • CDK Python makes it a low-
    barrier for engineering.

    View Slide

  188. What are the effects?

    View Slide

  189. Fast Turnaround
    •New Rails project from scratch, goes from 0 to (dev
    + stage + production) in 2 hours.

    View Slide

  190. Fast Turnaround
    •New Rails project from scratch, goes from 0 to (dev
    + stage + production) in 2 hours.
    •This includes Load balancer, DNS, HTTPS, Secrets,
    Docker
    (
    Fargate) cluster, Redis, Workers, RDS
    PostgreSQL, and all the things.

    View Slide

  191. Fast Turnaround
    •New Rails project from scratch, goes from 0 to (dev
    + stage + production) in 2 hours.
    •This includes Load balancer, DNS, HTTPS, Secrets,
    Docker
    (
    Fargate) cluster, Redis, Workers, RDS
    PostgreSQL, and all the things.
    •Out of this 2 hours, 45 min is taken by RDS to bring
    up the server

    View Slide

  192. Fast Turnaround
    •Adding a new AWS Lambda to dev + stage + prod:
    15 to 30 minutes. Git push and you're in production.


    •30-min when complex pieces like SQS / SNS are
    involved


    •New Redis server? Add code, git commit, 15 min
    later: ✅

    View Slide

  193. • Infrastructure thinking and action is fully
    absorbed into Engineering now.


    • DevOps team has spent
    <
    1% of their total
    time on on-call issues.


    • They're working on pieces like turn-key
    environments, load-testing setups, security
    compliance, performance optimisations

    View Slide

  194. • Lower-order/sub-prime
    environments are on a very high
    parity with Production in terms
    of infra.


    • Remember better testing? This
    makes it possible and easy.

    View Slide

  195. So, how do you do it?

    View Slide

  196. • Relentless focus on Developer
    Productivity over infra costs.

    View Slide

  197. • Relentless focus on Developer
    Productivity over infra costs.
    • Even in pure monetary terms, it's
    cheaper

    View Slide

  198. • Relentless focus on Developer
    Productivity over infra costs.
    • Even in pure monetary terms, it's
    cheaper
    • We routinely and constantly save costs
    because developers have the headspace
    to think about high impact problems.

    View Slide

  199. • Pick AWS CDK, Pick Cloud-
    Native: The combination is
    wildly effective.


    • Similar combinations exist with
    other providers

    View Slide

  200. • Treat infra team as an
    engineering team, not a support
    team.


    • Actively help them avoid
    becoming Jira card pushers

    View Slide

  201. 1. Pipelines


    2. Infrastructure as Code


    3. Playbooks

    View Slide

  202. Playbooks for Nearly Everything
    •Product Engineering? ✅


    •Mobile Development? ✅


    •Onboarding and Off-boarding? ✅


    •Git Usage? ✅
    (
    WIP
    )

    •Feature Flags? ✅
    (
    WIP
    )

    View Slide

  203. Templates for Nearly Everything
    •Decision Records? ✅


    •New code repositories? ✅


    •PRDs? ✅


    •Jira User Stories? ✅


    •Interview Problems? ✅
    (
    WIP
    )

    View Slide

  204. What is the idea?
    •Reduce decision fatigue by codifying frequent
    decisions.


    •Improve compliance through written
    procedures


    •Encourage participation by making it open
    and editable to all

    View Slide

  205. 1. Pipelines


    2. Infrastructure as Code


    3. Playbooks

    View Slide

  206. Summary

    View Slide

  207. Rationale
    1. Continuous Deployments are good for
    you.


    2. If you're not doing it, you're playing in hard
    mode.


    3. At minimum, think preventive maintenance

    View Slide

  208. Build Rigour:


    Pairing, TDD, Trunk
    Based Development,
    On-Call Rotation

    View Slide

  209. Make Verification Easy:


    Testing, Instrumentation,
    Observability, Feature
    Flags

    View Slide

  210. Reduce Operating Friction:


    Infrastructure as Code,
    Immutable Infra, Pipelines,
    Playbooks

    View Slide

  211. A sustainable culture
    of building & shipping
    great products.

    View Slide

  212. Thank you! 🙏

    View Slide

  213. Questions?

    View Slide