Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Road to Continuous Deployments

The Road to Continuous Deployments

Engineering Excellence through Continuous Delivery

An experience report on how we built a sustainable culture of shipping great products and how you can too.

It's a playbook covering a wide range of topics that help build engineering rigor including, but not limited to:

* Pair programming
* Instrumentation
* Trunk-based development
* Test-driven development
* Feature flags
* Observability
* On-call rotation
* And much more...

Swanand Pagnis

September 08, 2022
Tweet

More Decks by Swanand Pagnis

Other Decks in Programming

Transcript

  1. Company •Education Platform for K - 12 •In Indonesia (for

    now) •Live Classes •AI - Powered Homework Help
  2. >1 year ago •8-month old codebases •Service Oriented Architecture •NodeJS

    backend + ReactJS Frontend •Native Android in Kotlin •Native iOS in Swift
  3. Now • ~ 2 year old codebases •Service Oriented Architecture

    •Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP.
  4. Now • ~ 2 year old codebases •Service Oriented Architecture

    •Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP. Deprioritised, because 🚀 & 💰
  5. The Bottom Line •Find and surface bugs faster •Repeatable, reliable

    delivery •Risk mitigation: "When the costs are non-linear, keep it small"
  6. • Failure ☠ 🏦 • Major Repairs 😱💰 • Minor

    Repairs 😟 💵 • Preventive Maintenance 😅 🪙
  7. • Failure ☠ 🏦 • Major Repairs 😱💰 • Minor

    Repairs 😟 💵 • Preventive Maintenance 😅 🪙 ✅
  8. • Improved velocity 🏎 • Better product via rapid iterations

    ♻ • Improved code quality, reliability, architecture ☮
  9. • It needs rigour, which is not always possible •

    High inertia — needs time, effort, careful execution
  10. • It needs rigour, which is not always possible •

    High inertia — needs time, effort, careful execution • High short term costs
  11. Consistent high eNPS •Min 73, Max 87 •Better connections, better

    work relationships •Pandemic induced remote anxiety went down 📉
  12. Fast Onboarding •Ship code to production in Week 1 •New

    languages, frameworks in a sprint or two 🏎 •Internal transfers with zero friction 🧈
  13. Low Tech Debt •Greenfield projects: ~ 0 tech debt 👌

    •Code quality up 📈 •Documentation quality and quantity up 📈 •Architecture has been flexible 💪
  14. What does research say? •Improves design quality •Reduces defects (people

    spend less time on defective solutions) •Reduces staffing risk •Enhances technical skills •Improves team communications •Is considered more enjoyable at statistically significant levels. The Costs and Benefits of Pair Programming; Alistair Cockburn, Laurie Williams, Feb 2000
  15. Use Driver Navigator •For an idea to go from Navigator's

    head to the code, it must go through Driver's hands.
  16. Use Driver Navigator •For an idea to go from Navigator's

    head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour.
  17. Use Driver Navigator •For an idea to go from Navigator's

    head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour. •Senior / Junior is immaterial. Both get both roles.
  18. Use Driver Navigator •Avoid giving line by line instructions, convey

    the general idea. •Seniors take the responsibility of mentoring
  19. Use Driver Navigator •When chopping onions, don't say "cut the

    top off, now break in half, make a slice" etc. •Just say "finely chopped" or "diced"
  20. Use Driver Navigator •Mentoring happens with both driving and navigating.

    •I leave it to you to figure out what are the differences.
  21. Use Driver Navigator •Remote pairing is better than in-person because

    of the natural role selection •Sharing screen? 👉 driver. Other 👉 navigator. •Mobbing is incredibly easy in remote. Just join the call and you're ready! 👍
  22. Switch Pairs Every Sprint •Avoid pairing silos, they stall culture

    propagation •Often, a pair will be 💯, switch them anyway.
  23. Switch Pairs Every Sprint •Avoid pairing silos, they stall culture

    propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first
  24. Switch Pairs Every Sprint •Avoid pairing silos, they stall culture

    propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first •When sprint ends, you swap even if WIP. This is an effective litmus test.
  25. DevX is Crucial •Let pairs figure out the balance between

    solo focussed work and pairing •Have routine health-checks about how people are pairing, their experience, etc •Let pairing feature in 1 : 1s and other discussions
  26. • TDD improves testability. This benefit alone is enough to

    embrace TDD. • TDD forces you to think in specifications, hence improving product thinking, along with code quality.
  27. • Clear & significant uptick in quality where TDD was

    followed vs where it wasn't. • Legacy or greenfield doesn't matter
  28. • Clear & significant uptick in quality where TDD was

    followed vs where it wasn't. • Legacy or greenfield doesn't matter • TDD and Pairing are two incredible force multipliers, they feed into each other and create a strong positive gains loop.
  29. • Impedance mismatch between long- lived-branch + PR-based workflow and

    how high-trust teams operate • Build a sense of ownership in the codebase
  30. • Impedance mismatch between long- lived-branch + PR-based workflow and

    how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling
  31. • Impedance mismatch between long- lived-branch + PR-based workflow and

    how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling release ready
  32. • Code reviews are faster • Teams respond quicker to

    urgent and important bugs • We're running more iterations
  33. • Deploying to dev, stage has become slightly awkward because

    there's no 1 : 1 mapping • Turn-key environments have become a necessity rather than nice-to-have
  34. Have fast builds • < 1 min ideally, if possible

    •15 min from git push to production deploy including build •Enable focussed tests i.e. run a single test from a single file
  35. 💯 Dev Machines •Fast, capable laptops •Must have automated &

    manual testing setup •Enable setting up any dependency
  36. • Process still under iteration; no "yes, this works" yet

    • Settled on functional rotations: Backend, Frontend, Mobile, DevOps
  37. • Process still under iteration; no "yes, this works" yet

    • Settled on functional rotations: Backend, Frontend, Mobile, DevOps • PTOs, leaves, Weekends still pose a challenge from time to time
  38. • Team members that have done really well during on-call

    have also done really well in their performance reviews. • Correlation, yes. Causation? 🤷
  39. • Start with a robust triage process. First response under

    15 min. • Have a playbook where common problems and remedies are listed. • In B2C products, a handful few situations repeat like a persistent boomerang. FAB. Frequently Annoying Bugs.
  40. • Use managed services as much as possible; reduce operational

    on-call • Try hard for "follow-the-sun" model; i.e. no wee hours • All alerts must be actionable, keep adjusting until they are
  41. • Testability is a core engineering principle. • To be

    able to answer questions about a system by probing the right points and looking at indicators
  42. • Cars, bridges, rack & pinion — you can't just

    restart them. • Neither can you go and update them at will
  43. • The more testable your environment is, the more people

    will actually test it. • Make it easy to test something and it will get tested. • Conversely, make it difficult to test and it's easy to slip.
  44. • Not having Dev & Stage as close to production

    has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎
  45. • Not having Dev & Stage as close to production

    has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎 • Opaque 3rd party dependencies are incredibly hard to test. e.g. WhatsApp business APIs
  46. • SoA + inter-service dependencies = complexity at a polynomial

    growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation.
  47. • SoA + inter-service dependencies = complexity at a polynomial

    growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation. • UIs are inherently hard to test, add probes ( Metrics, Analytics, Traces, Errors, etc)
  48. Development Time •Use TDD •Add linters, code coverage to test

    builds •Postman / equivalent API tools are 💯 •Powerful Type Systems*
  49. Runtime • Make good use of lower order environments •Heroku

    / Vercel style Review Apps are far more powerful than they seem
  50. Runtime • Make good use of lower order environments •Heroku

    / Vercel style Review Apps are far more powerful than they seem •Dive down deep into important bugs and see how they could've been tested earlier. ( Which is different from how to reproduce them)
  51. Runtime •Add traces, specially to lower-order environments. ( Example: AWS's

    X - Ray) •Try and build idempotent units of work. APIs, Workers, etc.
  52. Runtime •Add traces, specially to lower-order environments. ( Example: AWS's

    X - Ray) •Try and build idempotent units of work. APIs, Workers, etc. •Pay special attention to non-idempotent units of work. Add checks and balances. ( Example: OTPs)
  53. Runtime •Add traces, specially to lower-order environments. ( Example: AWS's

    X - Ray) •Try and build idempotent units of work. APIs, Workers, etc. •Pay special attention to non-idempotent units of work. Add checks and balances. ( Example: OTPs) •Eliminate String logs. All log statements are events, with key value pairs*
  54. In both Environments •Always test for contention: • What must

    happen sequentially? Does it? •Always test for coherence: • How much and what information do two systems need to collect from each other? Do they?
  55. Recommended Reading •Neil Gunther's work on Universal Scalability Law and

    Quantifying Scalability and Performance •Michael Nygard's "Release It!"
  56. • Learning from other engineering disciplines • High velocity, but

    preferably not at a very high upfront cost • Wanted to build upfront, not after the fact
  57. • NewRelic routinely predicts a lot of problems before they

    occur • Tech spec quality has gone up — we add metrics and dashboarding right into tech specs • Had our share of goof-ups. e.g. Shipped a major feature, which nobody used in production 🤦
  58. • Number of bugs has gone down* • Bug triage

    process is fast (and getting faster; median first response is down to 4 min) • Consistently low tech debt; and we assess and track regularly
  59. • Number of bugs has gone down* • Bug triage

    process is fast (and getting faster; median first response is down to 4 min) • Consistently low tech debt; and we assess and track regularly
  60. • Have 3 levels of instrumentation: • Infra & Systems

    level • Code & Application level • Product & Business level
  61. • Have at least two kinds of thresholds: • Too

    low and too high • Too long and too short
  62. • Envision your production dashboards before even writing a single

    line of code • We're running a trial with GQM technique • Answer the 🏅 question: How do you know what you've built is working?
  63. • We now deploy "under development" work to production on

    Day One • Having fewer technologies has helped in usage standardisation. Flipper is 🤘 • Code gets littered with branching. Live with it.
  64. 3 Kinds of Feature Flags 1. Infra / systems level

    (types of CPUs, Aurora vs RDS, etc)
  65. 3 Kinds of Feature Flags 1. Infra / systems level

    (types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development )
  66. 3 Kinds of Feature Flags 1. Infra / systems level

    (types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development ) 3. Product and business level ( A/B tests, experimentation )
  67. • Not all feature flags live forever, kill the code

    branches when feature matures. • Database changes have to be 100% backward and forward compatible • Prefer SDKs, libraries, code sharing over a centralised service for feature flags
  68. • Linters ✅ • Source Code Analysis for Security ✅

    • Metrics ✅ • Exploring: TLA + 🔮 • Formal Verification: At the moment 🛑
  69. • Automated pipelines ✅ • Manual deployment? 👎 • Manual

    approval? 👎 • Manual configuration? 👎
  70. • API service? Pipeline. • iOS app? Pipeline. • React

    App? Pipeline. • Data pipeline? Well, duh!
  71. • ~ 50 deployments per day • Slowest deployment to

    prod is 15 min, fastest is 3 min — this includes ALL THE TESTING • Deployments are completely transparent. You push code and things happen. Teams can focus on product and problems.
  72. • Entire infra is managed from the pipeline, it's tied

    into the AWS ecosystem. • Remember the golden rule: Every git push goes to production under 15 minutes flat, with no manual approval whatsoever.
  73. • Declarative infrastructure; same code quality focus on DevOps as

    well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team
  74. • Declarative infrastructure; same code quality focus on DevOps as

    well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team • Reduce operational on-call burden
  75. • Declarative infrastructure; same code quality focus on DevOps as

    well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team • Reduce operational on-call burden • DevOps team works on hard platform problems and security challenges
  76. • We picked: AWS CDK • CDK Python makes it

    a low- barrier for engineering.
  77. Fast Turnaround •New Rails project from scratch, goes from 0

    to (dev + stage + production) in 2 hours.
  78. Fast Turnaround •New Rails project from scratch, goes from 0

    to (dev + stage + production) in 2 hours. •This includes Load balancer, DNS, HTTPS, Secrets, Docker ( Fargate) cluster, Redis, Workers, RDS PostgreSQL, and all the things.
  79. Fast Turnaround •New Rails project from scratch, goes from 0

    to (dev + stage + production) in 2 hours. •This includes Load balancer, DNS, HTTPS, Secrets, Docker ( Fargate) cluster, Redis, Workers, RDS PostgreSQL, and all the things. •Out of this 2 hours, 45 min is taken by RDS to bring up the server
  80. Fast Turnaround •Adding a new AWS Lambda to dev +

    stage + prod: 15 to 30 minutes. Git push and you're in production. •30-min when complex pieces like SQS / SNS are involved •New Redis server? Add code, git commit, 15 min later: ✅
  81. • Infrastructure thinking and action is fully absorbed into Engineering

    now. • DevOps team has spent < 1% of their total time on on-call issues. • They're working on pieces like turn-key environments, load-testing setups, security compliance, performance optimisations
  82. • Lower-order/sub-prime environments are on a very high parity with

    Production in terms of infra. • Remember better testing? This makes it possible and easy.
  83. • Relentless focus on Developer Productivity over infra costs. •

    Even in pure monetary terms, it's cheaper • We routinely and constantly save costs because developers have the headspace to think about high impact problems.
  84. • Pick AWS CDK, Pick Cloud- Native: The combination is

    wildly effective. • Similar combinations exist with other providers
  85. • Treat infra team as an engineering team, not a

    support team. • Actively help them avoid becoming Jira card pushers
  86. Playbooks for Nearly Everything •Product Engineering? ✅ •Mobile Development? ✅

    •Onboarding and Off-boarding? ✅ •Git Usage? ✅ ( WIP ) •Feature Flags? ✅ ( WIP )
  87. Templates for Nearly Everything •Decision Records? ✅ •New code repositories?

    ✅ •PRDs? ✅ •Jira User Stories? ✅ •Interview Problems? ✅ ( WIP )
  88. What is the idea? •Reduce decision fatigue by codifying frequent

    decisions. •Improve compliance through written procedures •Encourage participation by making it open and editable to all
  89. Rationale 1. Continuous Deployments are good for you. 2. If

    you're not doing it, you're playing in hard mode. 3. At minimum, think preventive maintenance