The Road to Continuous Deployments

The Road to Continuous Deployments Experience Report from CoLearn Engineering

Swanand Pagnis 👨💼 CTO at CoLearn 🍻 meetup.com/Bangalore-Ruby-Users-Group/ 📔 info.pagnis.in
👨🏫 postgres-workshop.com

Background

Company •Education Platform for K - 12 •In Indonesia (for
now) •Live Classes •AI - Powered Homework Help

>1 year ago •8-month old codebases •Service Oriented Architecture •NodeJS
backend + ReactJS Frontend •Native Android in Kotlin •Native iOS in Swift

Now • ~ 2 year old codebases •Service Oriented Architecture
•Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP.

Now • ~ 2 year old codebases •Service Oriented Architecture
•Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP. Deprioritised, because 🚀 & 💰

Legacy Systems •Built for an MVP stage •Came without thorough
engineering practices baked in

Growth •1 year period •Engineering 10 -> 36 •Product 2
-> 9 •Design + Content 4 -> 20

Target Audience 🎯 •Product Engineering Teams •Founders, CTOs, CPOs, VPs
•Software Developers •Product Managers

Why CD? 🤔

What is this?

Public Transport. ✅

Every 1 Hour

Every 5 Minutes

Which is better? Every 1 Hour Every 5 Minutes

How about this?

This belt is continuous. Hop-on whenever. Hop-off wherever.

The Bottom Line •Find and surface bugs faster •Repeatable, reliable
delivery •Risk mitigation: "When the costs are non-linear, keep it small"

• Failure ☠ 🏦

• Failure ☠ 🏦 • Major Repairs 😱💰

• Failure ☠ 🏦 • Major Repairs 😱💰 • Minor
Repairs 😟 💵

Repairs 😟 💵 • Preventive Maintenance 😅 🪙

Repairs 😟 💵 • Preventive Maintenance 😅 🪙 ✅

• Improved velocity 🏎 • Better product via rapid iterations
♻ • Improved code quality, reliability, architecture ☮

What's the catch? 🎣

• It needs rigour, which is not always possible

• It needs rigour, which is not always possible •
High inertia — needs time, effort, careful execution

• It needs rigour, which is not always possible •
High inertia — needs time, effort, careful execution • High short term costs

Methodology

Pairing, TDD, Trunk Based Development, On-Call Rotation: Build Rigour

Testing, Instrumentation, Observability, Feature Flags: Make Verification Easy

Infrastructure as Code, Immutable Infra, Pipelines, Playbooks: Reduce Operating Friction

A sustainable culture of building & shipping great products.

1. Build Rigour 2. Make Verification Easy 3. Low Operating
Friction

1. Build Rigour

1. Build Rigour Putting the engineering in engineering

1. Pair Programming

1. Pair Programming 2. TDD

1. Pair Programming 2. TDD 3. Trunk Based Development

1. Pair Programming 2. TDD 3. Trunk Based Development 4.
On-Call Rotation

Why did we pick it?

☝First, the results.

Consistent high eNPS •Min 73, Max 87 •Better connections, better
work relationships •Pandemic induced remote anxiety went down 📉

Increasing Velocity •Pairing velocity caught up with non- pairing velocity
•Fewer delivery streams, same overall speed

Consistent upward trend. Towards the end, holidays and covid knocked
us down.

Fast Onboarding •Ship code to production in Week 1 •New
languages, frameworks in a sprint or two 🏎 •Internal transfers with zero friction 🧈

Low Tech Debt •Greenfield projects: ~ 0 tech debt 👌
•Code quality up 📈 •Documentation quality and quantity up 📈 •Architecture has been flexible 💪

Why did we pick it?

A Combination Of •Prior experience •Established research •First principles thinking

What does research say? •Improves design quality •Reduces defects (people
spend less time on defective solutions) •Reduces staffing risk •Enhances technical skills •Improves team communications •Is considered more enjoyable at statistically significant levels. The Costs and Benefits of Pair Programming; Alistair Cockburn, Laurie Williams, Feb 2000

How to do it?

Use Driver Navigator •For an idea to go from Navigator's
head to the code, it must go through Driver's hands.

head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour.

head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour. •Senior / Junior is immaterial. Both get both roles.

Use Driver Navigator •Avoid giving line by line instructions, convey
the general idea.

Use Driver Navigator •Avoid giving line by line instructions, convey
the general idea. •Seniors take the responsibility of mentoring

Use Driver Navigator •When chopping onions, don't say "cut the
top off, now break in half, make a slice" etc. •Just say "finely chopped" or "diced"

Use Driver Navigator •Mentoring happens with both driving and navigating.
•I leave it to you to figure out what are the differences.

Use Driver Navigator •Remote pairing is better than in-person because
of the natural role selection •Sharing screen? 👉 driver. Other 👉 navigator. •Mobbing is incredibly easy in remote. Just join the call and you're ready! 👍

Switch Pairs Every Sprint •Avoid pairing silos, they stall culture
propagation

propagation •Often, a pair will be 💯, switch them anyway.

propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first

propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first •When sprint ends, you swap even if WIP. This is an effective litmus test.

DevX is Crucial •Let pairs figure out the balance between
solo focussed work and pairing •Have routine health-checks about how people are pairing, their experience, etc •Let pairing feature in 1 : 1s and other discussions

Pilot + Co-Pilot = Pairing

You don't say "Turn the lever by 10° and push
that button"

You say "Raise the elevation by 1000m"

Pairing Summary •Use Driver-Navigator •Switch pairs every sprints; no silos
•Routine health checks with team

On-Call Rotation

Why did we pick it?

• TDD improves testability. This benefit alone is enough to
embrace TDD. • TDD forces you to think in specifications, hence improving product thinking, along with code quality.

What are the effects?

• Clear & significant uptick in quality where TDD was
followed vs where it wasn't.

followed vs where it wasn't. • Legacy or greenfield doesn't matter

followed vs where it wasn't. • Legacy or greenfield doesn't matter • TDD and Pairing are two incredible force multipliers, they feed into each other and create a strong positive gains loop.

How to do it?

1. Have senior engineers who are experienced in TDD

1. Have senior engineers who are experienced in TDD 2.
Pair programming. Duh.

On-Call Rotation

Why did we pick it?

• Impedance mismatch between long- lived-branch + PR-based workflow and
how high-trust teams operate

how high-trust teams operate • Build a sense of ownership in the codebase

how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling

how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling release ready

• Code reviews are faster • Teams respond quicker to
urgent and important bugs • We're running more iterations

• Deploying to dev, stage has become slightly awkward because
there's no 1 : 1 mapping • Turn-key environments have become a necessity rather than nice-to-have

How to do it?

Have fast builds • < 1 min ideally, if possible
•15 min from git push to production deploy including build •Enable focussed tests i.e. run a single test from a single file

💯 Dev Machines •Fast, capable laptops •Must have automated &
manual testing setup •Enable setting up any dependency

On-Call Rotation

Observations?

• Process still under iteration; no "yes, this works" yet

• Settled on functional rotations: Backend, Frontend, Mobile, DevOps

• Settled on functional rotations: Backend, Frontend, Mobile, DevOps • PTOs, leaves, Weekends still pose a challenge from time to time

• Team members that have done really well during on-call
have also done really well in their performance reviews. • Correlation, yes. Causation? 🤷

How to do it?

Sharing our experience, not a walkthrough for on-call. Literature: PagerDuty
Docs & Google's SRE Book

• Start with a robust triage process. First response under
15 min. • Have a playbook where common problems and remedies are listed. • In B2C products, a handful few situations repeat like a persistent boomerang. FAB. Frequently Annoying Bugs.

• Use managed services as much as possible; reduce operational
on-call • Try hard for "follow-the-sun" model; i.e. no wee hours • All alerts must be actionable, keep adjusting until they are

On-Call Rotation

Every single process / methodology discussed so far has ancillary
benefits that go way beyond just CD.

2. Make Verification Easy How do you know what you've
done is working?

1. Testability

1. Testability 2. Instrumentation

1. Testability 2. Instrumentation 3. Feature Flags

1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

• Testability is a core engineering principle. • To be
able to answer questions about a system by probing the right points and looking at indicators

• Cars, bridges, rack & pinion — you can't just
restart them. • Neither can you go and update them at will

Hit the hammer: $1.0 Knowing where to hit: $9999.0

• The more testable your environment is, the more people
will actually test it. • Make it easy to test something and it will get tested. • Conversely, make it difficult to test and it's easy to slip.

What are the observations?

• Not having Dev & Stage as close to production
has routinely caused problems

has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎

has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎 • Opaque 3rd party dependencies are incredibly hard to test. e.g. WhatsApp business APIs

• SoA + inter-service dependencies = complexity at a polynomial
growth rate (or worse, factorial)

growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation.

growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation. • UIs are inherently hard to test, add probes ( Metrics, Analytics, Traces, Errors, etc)

So, how to go about it?

There are two main themes: 1. Development time 2. Runtime,
in production

Development Time •Use TDD •Add linters, code coverage to test
builds •Postman / equivalent API tools are 💯 •Powerful Type Systems*

Runtime • Make good use of lower order environments

Runtime • Make good use of lower order environments •Heroku
/ Vercel style Review Apps are far more powerful than they seem

Runtime • Make good use of lower order environments •Heroku
/ Vercel style Review Apps are far more powerful than they seem •Dive down deep into important bugs and see how they could've been tested earlier. ( Which is different from how to reproduce them)

Runtime •Add traces, specially to lower-order environments. ( Example: AWS's
X - Ray)

X - Ray) •Try and build idempotent units of work. APIs, Workers, etc.

X - Ray) •Try and build idempotent units of work. APIs, Workers, etc. •Pay special attention to non-idempotent units of work. Add checks and balances. ( Example: OTPs)

X - Ray) •Try and build idempotent units of work. APIs, Workers, etc. •Pay special attention to non-idempotent units of work. Add checks and balances. ( Example: OTPs) •Eliminate String logs. All log statements are events, with key value pairs*

In both Environments •Always test for contention: • What must
happen sequentially? Does it? •Always test for coherence: • How much and what information do two systems need to collect from each other? Do they?

Recommended Reading •Neil Gunther's work on Universal Scalability Law and
Quantifying Scalability and Performance •Michael Nygard's "Release It!"

Why did we pick it?

• Learning from other engineering disciplines • High velocity, but
preferably not at a very high upfront cost • Wanted to build upfront, not after the fact

• NewRelic routinely predicts a lot of problems before they
occur • Tech spec quality has gone up — we add metrics and dashboarding right into tech specs • Had our share of goof-ups. e.g. Shipped a major feature, which nobody used in production 🤦

From our playbook

• Number of bugs has gone down* • Bug triage
process is fast (and getting faster; median first response is down to 4 min) • Consistently low tech debt; and we assess and track regularly

How to do it?

• Have 3 levels of instrumentation:

• Have 3 levels of instrumentation: • Infra & Systems
level

level • Code & Application level

level • Code & Application level • Product & Business level

• Have at least two kinds of thresholds:

• Have at least two kinds of thresholds: • Too
low and too high

• Have at least two kinds of thresholds: • Too
low and too high • Too long and too short

• Envision your production dashboards before even writing a single
line of code • We're running a trial with GQM technique • Answer the 🏅 question: How do you know what you've built is working?

• NewRelic, Cloudwatch & friends are your friends • Keep
Logs, Metrics, APMs in one place

• Essential with Trunk Based Development • Works very well
with product experimentation

Observations.

• We now deploy "under development" work to production on
Day One • Having fewer technologies has helped in usage standardisation. Flipper is 🤘 • Code gets littered with branching. Live with it.

How to do it?

3 Kinds of Feature Flags

3 Kinds of Feature Flags 1. Infra / systems level
(types of CPUs, Aurora vs RDS, etc)

(types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development )

(types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development ) 3. Product and business level ( A/B tests, experimentation )

• Not all feature flags live forever, kill the code
branches when feature matures. • Database changes have to be 100% backward and forward compatible • Prefer SDKs, libraries, code sharing over a centralised service for feature flags

• Linters ✅ • Source Code Analysis for Security ✅
• Metrics ✅ • Exploring: TLA + 🔮 • Formal Verification: At the moment 🛑

3. Reduce Operating Friction Help team focus on the important

1. Pipelines 2. Infrastructure as Code 3. Playbooks

• Automated pipelines ✅ • Manual deployment? 👎 • Manual
approval? 👎 • Manual configuration? 👎

• API service? Pipeline. • iOS app? Pipeline. • React
App? Pipeline. • Data pipeline? Well, duh!

Make reliable deployments a foregone conclusion.

What do we do?

• ~ 50 deployments per day • Slowest deployment to
prod is 15 min, fastest is 3 min — this includes ALL THE TESTING • Deployments are completely transparent. You push code and things happen. Teams can focus on product and problems.

• Entire infra is managed from the pipeline, it's tied
into the AWS ecosystem.

• Entire infra is managed from the pipeline, it's tied
into the AWS ecosystem. • Remember the golden rule: Every git push goes to production under 15 minutes flat, with no manual approval whatsoever.

Why did we pick it?

• Declarative infrastructure; same code quality focus on DevOps as
well

well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team

well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team • Reduce operational on-call burden

well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team • Reduce operational on-call burden • DevOps team works on hard platform problems and security challenges

• We picked: AWS CDK • CDK Python makes it
a low- barrier for engineering.

Fast Turnaround •New Rails project from scratch, goes from 0
to (dev + stage + production) in 2 hours.

to (dev + stage + production) in 2 hours. •This includes Load balancer, DNS, HTTPS, Secrets, Docker ( Fargate) cluster, Redis, Workers, RDS PostgreSQL, and all the things.

to (dev + stage + production) in 2 hours. •This includes Load balancer, DNS, HTTPS, Secrets, Docker ( Fargate) cluster, Redis, Workers, RDS PostgreSQL, and all the things. •Out of this 2 hours, 45 min is taken by RDS to bring up the server

Fast Turnaround •Adding a new AWS Lambda to dev +
stage + prod: 15 to 30 minutes. Git push and you're in production. •30-min when complex pieces like SQS / SNS are involved •New Redis server? Add code, git commit, 15 min later: ✅

• Infrastructure thinking and action is fully absorbed into Engineering
now. • DevOps team has spent < 1% of their total time on on-call issues. • They're working on pieces like turn-key environments, load-testing setups, security compliance, performance optimisations

• Lower-order/sub-prime environments are on a very high parity with
Production in terms of infra. • Remember better testing? This makes it possible and easy.

So, how do you do it?

• Relentless focus on Developer Productivity over infra costs.

• Relentless focus on Developer Productivity over infra costs. •
Even in pure monetary terms, it's cheaper

• Relentless focus on Developer Productivity over infra costs. •
Even in pure monetary terms, it's cheaper • We routinely and constantly save costs because developers have the headspace to think about high impact problems.

• Pick AWS CDK, Pick Cloud- Native: The combination is
wildly effective. • Similar combinations exist with other providers

• Treat infra team as an engineering team, not a
support team. • Actively help them avoid becoming Jira card pushers

Playbooks for Nearly Everything •Product Engineering? ✅ •Mobile Development? ✅
•Onboarding and Off-boarding? ✅ •Git Usage? ✅ ( WIP ) •Feature Flags? ✅ ( WIP )

Templates for Nearly Everything •Decision Records? ✅ •New code repositories?
✅ •PRDs? ✅ •Jira User Stories? ✅ •Interview Problems? ✅ ( WIP )

What is the idea? •Reduce decision fatigue by codifying frequent
decisions. •Improve compliance through written procedures •Encourage participation by making it open and editable to all

Summary

Rationale 1. Continuous Deployments are good for you. 2. If
you're not doing it, you're playing in hard mode. 3. At minimum, think preventive maintenance

Build Rigour: Pairing, TDD, Trunk Based Development, On-Call Rotation

Make Verification Easy: Testing, Instrumentation, Observability, Feature Flags

Reduce Operating Friction: Infrastructure as Code, Immutable Infra, Pipelines, Playbooks

A sustainable culture of building & shipping great products.

Thank you! 🙏

Questions?

The Road to Continuous Deployments

The Road to Continuous Deployments

More Decks by Swanand Pagnis

Other Decks in Programming

Featured

Transcript