$30 off During Our Annual Pro Sale. View Details »

The Road to Continuous Deployments

The Road to Continuous Deployments

Engineering Excellent through Continuous Delivery

An experience report on how we built a sustainable culture of shipping great products and you can too.

It's a playbook covering a wide range of topics that help build engineering rigor including, but not limited to:

* Pair programming
* Instrumentation
* Trunk-based development
* Test-driven development
* Feature flags
* Observability
* On-call rotation
* And much more...

Swanand Pagnis

September 08, 2022
Tweet

More Decks by Swanand Pagnis

Other Decks in Programming

Transcript

  1. The Road to Continuous Deployments Experience Report from CoLearn Engineering

  2. Swanand Pagnis 👨💼 CTO at CoLearn 🍻 meetup.com/Bangalore-Ruby-Users-Group/ 📔 info.pagnis.in

    👨🏫 postgres-workshop.com
  3. Background

  4. Company •Education Platform for K - 12 •In Indonesia (for

    now) •Live Classes •AI - Powered Homework Help
  5. >1 year ago •8-month old codebases •Service Oriented Architecture •NodeJS

    backend + ReactJS Frontend •Native Android in Kotlin •Native iOS in Swift
  6. Now • ~ 2 year old codebases •Service Oriented Architecture

    •Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP.
  7. Now • ~ 2 year old codebases •Service Oriented Architecture

    •Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP. Deprioritised, because 🚀 & 💰
  8. Legacy Systems •Built for an MVP stage •Came without thorough

    engineering practices baked in
  9. Growth •1 year period •Engineering 10 -> 36 •Product 2

    -> 9 •Design + Content 4 -> 20
  10. Target Audience 🎯 •Product Engineering Teams •Founders, CTOs, CPOs, VPs

    •Software Developers •Product Managers
  11. Why CD? 🤔

  12. What is this?

  13. Public Transport. ✅

  14. Every 1 Hour

  15. Every 5 Minutes

  16. Which is better? Every 1 Hour Every 5 Minutes

  17. How about this?

  18. This belt is continuous. Hop-on whenever. Hop-off wherever.

  19. The Bottom Line •Find and surface bugs faster •Repeatable, reliable

    delivery •Risk mitigation: "When the costs are non-linear, keep it small"
  20. • Failure ☠ 🏦

  21. • Failure ☠ 🏦 • Major Repairs 😱💰

  22. • Failure ☠ 🏦 • Major Repairs 😱💰 • Minor

    Repairs 😟 💵
  23. • Failure ☠ 🏦 • Major Repairs 😱💰 • Minor

    Repairs 😟 💵 • Preventive Maintenance 😅 🪙
  24. • Failure ☠ 🏦 • Major Repairs 😱💰 • Minor

    Repairs 😟 💵 • Preventive Maintenance 😅 🪙 ✅
  25. • Improved velocity 🏎 • Better product via rapid iterations

    ♻ • Improved code quality, reliability, architecture ☮
  26. What's the catch? 🎣

  27. • It needs rigour, which is not always possible

  28. • It needs rigour, which is not always possible •

    High inertia — needs time, effort, careful execution
  29. • It needs rigour, which is not always possible •

    High inertia — needs time, effort, careful execution • High short term costs
  30. Methodology

  31. Pairing, TDD, Trunk Based Development, On-Call Rotation: Build Rigour

  32. Testing, Instrumentation, Observability, Feature Flags: Make Verification Easy

  33. Infrastructure as Code, Immutable Infra, Pipelines, Playbooks: Reduce Operating Friction

  34. A sustainable culture of building & shipping great products.

  35. 1. Build Rigour 2. Make Verification Easy 3. Low Operating

    Friction
  36. 1. Build Rigour

  37. 1. Build Rigour Putting the engineering in engineering

  38. 1. Pair Programming

  39. 1. Pair Programming 2. TDD

  40. 1. Pair Programming 2. TDD 3. Trunk Based Development

  41. 1. Pair Programming 2. TDD 3. Trunk Based Development 4.

    On-Call Rotation
  42. 1. Pair Programming 2. TDD 3. Trunk Based Development 4.

    On-Call Rotation
  43. Why did we pick it?

  44. ☝First, the results.

  45. Consistent high eNPS •Min 73, Max 87 •Better connections, better

    work relationships •Pandemic induced remote anxiety went down 📉
  46. Increasing Velocity •Pairing velocity caught up with non- pairing velocity

    •Fewer delivery streams, same overall speed
  47. Consistent upward trend. Towards the end, holidays and covid knocked

    us down.
  48. Fast Onboarding •Ship code to production in Week 1 •New

    languages, frameworks in a sprint or two 🏎 •Internal transfers with zero friction 🧈
  49. Low Tech Debt •Greenfield projects: ~ 0 tech debt 👌

    •Code quality up 📈 •Documentation quality and quantity up 📈 •Architecture has been flexible 💪
  50. Why did we pick it?

  51. A Combination Of •Prior experience •Established research •First principles thinking

  52. What does research say? •Improves design quality •Reduces defects (people

    spend less time on defective solutions) •Reduces staffing risk •Enhances technical skills •Improves team communications •Is considered more enjoyable at statistically significant levels. The Costs and Benefits of Pair Programming; Alistair Cockburn, Laurie Williams, Feb 2000
  53. How to do it?

  54. Use Driver Navigator •For an idea to go from Navigator's

    head to the code, it must go through Driver's hands.
  55. Use Driver Navigator •For an idea to go from Navigator's

    head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour.
  56. Use Driver Navigator •For an idea to go from Navigator's

    head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour. •Senior / Junior is immaterial. Both get both roles.
  57. Use Driver Navigator •Avoid giving line by line instructions, convey

    the general idea.
  58. Use Driver Navigator •Avoid giving line by line instructions, convey

    the general idea. •Seniors take the responsibility of mentoring
  59. Use Driver Navigator •When chopping onions, don't say "cut the

    top off, now break in half, make a slice" etc. •Just say "finely chopped" or "diced"
  60. Use Driver Navigator •Mentoring happens with both driving and navigating.

    •I leave it to you to figure out what are the differences.
  61. Use Driver Navigator •Remote pairing is better than in-person because

    of the natural role selection •Sharing screen? 👉 driver. Other 👉 navigator. •Mobbing is incredibly easy in remote. Just join the call and you're ready! 👍
  62. Switch Pairs Every Sprint •Avoid pairing silos, they stall culture

    propagation
  63. Switch Pairs Every Sprint •Avoid pairing silos, they stall culture

    propagation •Often, a pair will be 💯, switch them anyway.
  64. Switch Pairs Every Sprint •Avoid pairing silos, they stall culture

    propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first
  65. Switch Pairs Every Sprint •Avoid pairing silos, they stall culture

    propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first •When sprint ends, you swap even if WIP. This is an effective litmus test.
  66. DevX is Crucial •Let pairs figure out the balance between

    solo focussed work and pairing •Have routine health-checks about how people are pairing, their experience, etc •Let pairing feature in 1 : 1s and other discussions
  67. Pilot + Co-Pilot = Pairing

  68. You don't say "Turn the lever by 10° and push

    that button"
  69. You say "Raise the elevation by 1000m"

  70. Pairing Summary •Use Driver-Navigator •Switch pairs every sprints; no silos

    •Routine health checks with team
  71. 1. Pair Programming 2. TDD 3. Trunk Based Development 4.

    On-Call Rotation
  72. Why did we pick it?

  73. • TDD improves testability. This benefit alone is enough to

    embrace TDD. • TDD forces you to think in specifications, hence improving product thinking, along with code quality.
  74. What are the effects?

  75. • Clear & significant uptick in quality where TDD was

    followed vs where it wasn't.
  76. • Clear & significant uptick in quality where TDD was

    followed vs where it wasn't. • Legacy or greenfield doesn't matter
  77. • Clear & significant uptick in quality where TDD was

    followed vs where it wasn't. • Legacy or greenfield doesn't matter • TDD and Pairing are two incredible force multipliers, they feed into each other and create a strong positive gains loop.
  78. How to do it?

  79. None
  80. 1. Have senior engineers who are experienced in TDD

  81. 1. Have senior engineers who are experienced in TDD 2.

    Pair programming. Duh.
  82. 1. Pair Programming 2. TDD 3. Trunk Based Development 4.

    On-Call Rotation
  83. Why did we pick it?

  84. None
  85. • Impedance mismatch between long- lived-branch + PR-based workflow and

    how high-trust teams operate
  86. • Impedance mismatch between long- lived-branch + PR-based workflow and

    how high-trust teams operate • Build a sense of ownership in the codebase
  87. • Impedance mismatch between long- lived-branch + PR-based workflow and

    how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling
  88. • Impedance mismatch between long- lived-branch + PR-based workflow and

    how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling release ready
  89. What are the effects?

  90. • Code reviews are faster • Teams respond quicker to

    urgent and important bugs • We're running more iterations
  91. • Deploying to dev, stage has become slightly awkward because

    there's no 1 : 1 mapping • Turn-key environments have become a necessity rather than nice-to-have
  92. How to do it?

  93. Have fast builds • < 1 min ideally, if possible

    •15 min from git push to production deploy including build •Enable focussed tests i.e. run a single test from a single file
  94. 💯 Dev Machines •Fast, capable laptops •Must have automated &

    manual testing setup •Enable setting up any dependency
  95. 1. Pair Programming 2. TDD 3. Trunk Based Development 4.

    On-Call Rotation
  96. Observations?

  97. • Process still under iteration; no "yes, this works" yet

  98. • Process still under iteration; no "yes, this works" yet

    • Settled on functional rotations: Backend, Frontend, Mobile, DevOps
  99. • Process still under iteration; no "yes, this works" yet

    • Settled on functional rotations: Backend, Frontend, Mobile, DevOps • PTOs, leaves, Weekends still pose a challenge from time to time
  100. • Team members that have done really well during on-call

    have also done really well in their performance reviews. • Correlation, yes. Causation? 🤷
  101. How to do it?

  102. Sharing our experience, not a walkthrough for on-call. Literature: PagerDuty

    Docs & Google's SRE Book
  103. • Start with a robust triage process. First response under

    15 min. • Have a playbook where common problems and remedies are listed. • In B2C products, a handful few situations repeat like a persistent boomerang. FAB. Frequently Annoying Bugs.
  104. • Use managed services as much as possible; reduce operational

    on-call • Try hard for "follow-the-sun" model; i.e. no wee hours • All alerts must be actionable, keep adjusting until they are
  105. 1. Pair Programming 2. TDD 3. Trunk Based Development 4.

    On-Call Rotation
  106. Every single process / methodology discussed so far has ancillary

    benefits that go way beyond just CD.
  107. 2. Make Verification Easy How do you know what you've

    done is working?
  108. None
  109. 1. Testability

  110. 1. Testability 2. Instrumentation

  111. 1. Testability 2. Instrumentation 3. Feature Flags

  112. 1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

  113. 1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

  114. Why?

  115. • Testability is a core engineering principle. • To be

    able to answer questions about a system by probing the right points and looking at indicators
  116. • Cars, bridges, rack & pinion — you can't just

    restart them. • Neither can you go and update them at will
  117. Hit the hammer: $1.0 Knowing where to hit: $9999.0

  118. • The more testable your environment is, the more people

    will actually test it. • Make it easy to test something and it will get tested. • Conversely, make it difficult to test and it's easy to slip.
  119. What are the observations?

  120. • Not having Dev & Stage as close to production

    has routinely caused problems
  121. • Not having Dev & Stage as close to production

    has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎
  122. • Not having Dev & Stage as close to production

    has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎 • Opaque 3rd party dependencies are incredibly hard to test. e.g. WhatsApp business APIs
  123. • SoA + inter-service dependencies = complexity at a polynomial

    growth rate (or worse, factorial)
  124. • SoA + inter-service dependencies = complexity at a polynomial

    growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation.
  125. • SoA + inter-service dependencies = complexity at a polynomial

    growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation. • UIs are inherently hard to test, add probes ( Metrics, Analytics, Traces, Errors, etc)
  126. So, how to go about it?

  127. There are two main themes: 1. Development time 2. Runtime,

    in production
  128. Development Time •Use TDD •Add linters, code coverage to test

    builds •Postman / equivalent API tools are 💯 •Powerful Type Systems*
  129. Runtime • Make good use of lower order environments

  130. Runtime • Make good use of lower order environments •Heroku

    / Vercel style Review Apps are far more powerful than they seem
  131. Runtime • Make good use of lower order environments •Heroku

    / Vercel style Review Apps are far more powerful than they seem •Dive down deep into important bugs and see how they could've been tested earlier. ( Which is different from how to reproduce them)
  132. Runtime •Add traces, specially to lower-order environments. ( Example: AWS's

    X - Ray)
  133. Runtime •Add traces, specially to lower-order environments. ( Example: AWS's

    X - Ray) •Try and build idempotent units of work. APIs, Workers, etc.
  134. Runtime •Add traces, specially to lower-order environments. ( Example: AWS's

    X - Ray) •Try and build idempotent units of work. APIs, Workers, etc. •Pay special attention to non-idempotent units of work. Add checks and balances. ( Example: OTPs)
  135. Runtime •Add traces, specially to lower-order environments. ( Example: AWS's

    X - Ray) •Try and build idempotent units of work. APIs, Workers, etc. •Pay special attention to non-idempotent units of work. Add checks and balances. ( Example: OTPs) •Eliminate String logs. All log statements are events, with key value pairs*
  136. In both Environments •Always test for contention: • What must

    happen sequentially? Does it? •Always test for coherence: • How much and what information do two systems need to collect from each other? Do they?
  137. Recommended Reading •Neil Gunther's work on Universal Scalability Law and

    Quantifying Scalability and Performance •Michael Nygard's "Release It!"
  138. 1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

  139. Why did we pick it?

  140. • Learning from other engineering disciplines • High velocity, but

    preferably not at a very high upfront cost • Wanted to build upfront, not after the fact
  141. What are the effects?

  142. • NewRelic routinely predicts a lot of problems before they

    occur • Tech spec quality has gone up — we add metrics and dashboarding right into tech specs • Had our share of goof-ups. e.g. Shipped a major feature, which nobody used in production 🤦
  143. From our playbook

  144. • Number of bugs has gone down* • Bug triage

    process is fast (and getting faster; median first response is down to 4 min) • Consistently low tech debt; and we assess and track regularly
  145. None
  146. • Number of bugs has gone down* • Bug triage

    process is fast (and getting faster; median first response is down to 4 min) • Consistently low tech debt; and we assess and track regularly
  147. How to do it?

  148. • Have 3 levels of instrumentation:

  149. • Have 3 levels of instrumentation: • Infra & Systems

    level
  150. • Have 3 levels of instrumentation: • Infra & Systems

    level • Code & Application level
  151. • Have 3 levels of instrumentation: • Infra & Systems

    level • Code & Application level • Product & Business level
  152. • Have at least two kinds of thresholds:

  153. • Have at least two kinds of thresholds: • Too

    low and too high
  154. • Have at least two kinds of thresholds: • Too

    low and too high • Too long and too short
  155. • Envision your production dashboards before even writing a single

    line of code • We're running a trial with GQM technique • Answer the 🏅 question: How do you know what you've built is working?
  156. • NewRelic, Cloudwatch & friends are your friends • Keep

    Logs, Metrics, APMs in one place
  157. 1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

  158. Why?

  159. • Essential with Trunk Based Development • Works very well

    with product experimentation
  160. Observations.

  161. • We now deploy "under development" work to production on

    Day One • Having fewer technologies has helped in usage standardisation. Flipper is 🤘 • Code gets littered with branching. Live with it.
  162. How to do it?

  163. 3 Kinds of Feature Flags

  164. 3 Kinds of Feature Flags 1. Infra / systems level

    (types of CPUs, Aurora vs RDS, etc)
  165. 3 Kinds of Feature Flags 1. Infra / systems level

    (types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development )
  166. 3 Kinds of Feature Flags 1. Infra / systems level

    (types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development ) 3. Product and business level ( A/B tests, experimentation )
  167. • Not all feature flags live forever, kill the code

    branches when feature matures. • Database changes have to be 100% backward and forward compatible • Prefer SDKs, libraries, code sharing over a centralised service for feature flags
  168. 1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

  169. • Linters ✅ • Source Code Analysis for Security ✅

    • Metrics ✅ • Exploring: TLA + 🔮 • Formal Verification: At the moment 🛑
  170. 1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

  171. 3. Reduce Operating Friction Help team focus on the important

  172. 1. Pipelines 2. Infrastructure as Code 3. Playbooks

  173. 1. Pipelines 2. Infrastructure as Code 3. Playbooks

  174. • Automated pipelines ✅ • Manual deployment? 👎 • Manual

    approval? 👎 • Manual configuration? 👎
  175. • API service? Pipeline. • iOS app? Pipeline. • React

    App? Pipeline. • Data pipeline? Well, duh!
  176. Make reliable deployments a foregone conclusion.

  177. What do we do?

  178. • ~ 50 deployments per day • Slowest deployment to

    prod is 15 min, fastest is 3 min — this includes ALL THE TESTING • Deployments are completely transparent. You push code and things happen. Teams can focus on product and problems.
  179. • Entire infra is managed from the pipeline, it's tied

    into the AWS ecosystem.
  180. • Entire infra is managed from the pipeline, it's tied

    into the AWS ecosystem. • Remember the golden rule: Every git push goes to production under 15 minutes flat, with no manual approval whatsoever.
  181. 1. Pipelines 2. Infrastructure as Code 3. Playbooks

  182. Why did we pick it?

  183. • Declarative infrastructure; same code quality focus on DevOps as

    well
  184. • Declarative infrastructure; same code quality focus on DevOps as

    well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team
  185. • Declarative infrastructure; same code quality focus on DevOps as

    well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team • Reduce operational on-call burden
  186. • Declarative infrastructure; same code quality focus on DevOps as

    well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team • Reduce operational on-call burden • DevOps team works on hard platform problems and security challenges
  187. • We picked: AWS CDK • CDK Python makes it

    a low- barrier for engineering.
  188. What are the effects?

  189. Fast Turnaround •New Rails project from scratch, goes from 0

    to (dev + stage + production) in 2 hours.
  190. Fast Turnaround •New Rails project from scratch, goes from 0

    to (dev + stage + production) in 2 hours. •This includes Load balancer, DNS, HTTPS, Secrets, Docker ( Fargate) cluster, Redis, Workers, RDS PostgreSQL, and all the things.
  191. Fast Turnaround •New Rails project from scratch, goes from 0

    to (dev + stage + production) in 2 hours. •This includes Load balancer, DNS, HTTPS, Secrets, Docker ( Fargate) cluster, Redis, Workers, RDS PostgreSQL, and all the things. •Out of this 2 hours, 45 min is taken by RDS to bring up the server
  192. Fast Turnaround •Adding a new AWS Lambda to dev +

    stage + prod: 15 to 30 minutes. Git push and you're in production. •30-min when complex pieces like SQS / SNS are involved •New Redis server? Add code, git commit, 15 min later: ✅
  193. • Infrastructure thinking and action is fully absorbed into Engineering

    now. • DevOps team has spent < 1% of their total time on on-call issues. • They're working on pieces like turn-key environments, load-testing setups, security compliance, performance optimisations
  194. • Lower-order/sub-prime environments are on a very high parity with

    Production in terms of infra. • Remember better testing? This makes it possible and easy.
  195. So, how do you do it?

  196. • Relentless focus on Developer Productivity over infra costs.

  197. • Relentless focus on Developer Productivity over infra costs. •

    Even in pure monetary terms, it's cheaper
  198. • Relentless focus on Developer Productivity over infra costs. •

    Even in pure monetary terms, it's cheaper • We routinely and constantly save costs because developers have the headspace to think about high impact problems.
  199. • Pick AWS CDK, Pick Cloud- Native: The combination is

    wildly effective. • Similar combinations exist with other providers
  200. • Treat infra team as an engineering team, not a

    support team. • Actively help them avoid becoming Jira card pushers
  201. 1. Pipelines 2. Infrastructure as Code 3. Playbooks

  202. Playbooks for Nearly Everything •Product Engineering? ✅ •Mobile Development? ✅

    •Onboarding and Off-boarding? ✅ •Git Usage? ✅ ( WIP ) •Feature Flags? ✅ ( WIP )
  203. Templates for Nearly Everything •Decision Records? ✅ •New code repositories?

    ✅ •PRDs? ✅ •Jira User Stories? ✅ •Interview Problems? ✅ ( WIP )
  204. What is the idea? •Reduce decision fatigue by codifying frequent

    decisions. •Improve compliance through written procedures •Encourage participation by making it open and editable to all
  205. 1. Pipelines 2. Infrastructure as Code 3. Playbooks

  206. Summary

  207. Rationale 1. Continuous Deployments are good for you. 2. If

    you're not doing it, you're playing in hard mode. 3. At minimum, think preventive maintenance
  208. Build Rigour: Pairing, TDD, Trunk Based Development, On-Call Rotation

  209. Make Verification Easy: Testing, Instrumentation, Observability, Feature Flags

  210. Reduce Operating Friction: Infrastructure as Code, Immutable Infra, Pipelines, Playbooks

  211. A sustainable culture of building & shipping great products.

  212. Thank you! 🙏

  213. Questions?