The Road to Continuous Deployments

Slide 1

Slide 1 text

The Road to Continuous Deployments Experience Report from CoLearn Engineering

Slide 2

Slide 2 text

Swanand Pagnis 👨💼 CTO at CoLearn 🍻 meetup.com/Bangalore-Ruby-Users-Group/ 📔 info.pagnis.in 👨🏫 postgres-workshop.com

Slide 3

Slide 3 text

Background

Slide 4

Slide 4 text

Company •Education Platform for K - 12 •In Indonesia (for now) •Live Classes •AI - Powered Homework Help

Slide 5

Slide 5 text

>1 year ago •8-month old codebases •Service Oriented Architecture •NodeJS backend + ReactJS Frontend •Native Android in Kotlin •Native iOS in Swift

Slide 6

Slide 6 text

Now • ~ 2 year old codebases •Service Oriented Architecture •Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP.

Slide 7

Slide 7 text

Now • ~ 2 year old codebases •Service Oriented Architecture •Rails/Django backend + ReactJS Frontend •Native Android in Kotlin. Flutter WIP. •Native iOS in Swift. Flutter WIP. Deprioritised, because 🚀 & 💰

Slide 8

Slide 8 text

Legacy Systems •Built for an MVP stage •Came without thorough engineering practices baked in

Slide 9

Slide 9 text

Growth •1 year period •Engineering 10 -> 36 •Product 2 -> 9 •Design + Content 4 -> 20

Slide 10

Slide 10 text

Target Audience 🎯 •Product Engineering Teams •Founders, CTOs, CPOs, VPs •Software Developers •Product Managers

Slide 11

Slide 11 text

Why CD? 🤔

Slide 12

Slide 12 text

What is this?

Slide 13

Slide 13 text

Public Transport. ✅

Slide 14

Slide 14 text

Every 1 Hour

Slide 15

Slide 15 text

Every 5 Minutes

Slide 16

Slide 16 text

Which is better? Every 1 Hour Every 5 Minutes

Slide 17

Slide 17 text

How about this?

Slide 18

Slide 18 text

This belt is continuous. Hop-on whenever. Hop-off wherever.

Slide 19

Slide 19 text

The Bottom Line •Find and surface bugs faster •Repeatable, reliable delivery •Risk mitigation: "When the costs are non-linear, keep it small"

Slide 20

Slide 20 text

• Failure ☠ 🏦

Slide 21

Slide 21 text

• Failure ☠ 🏦 • Major Repairs 😱💰

Slide 22

Slide 22 text

• Failure ☠ 🏦 • Major Repairs 😱💰 • Minor Repairs 😟 💵

Slide 23

Slide 23 text

• Failure ☠ 🏦 • Major Repairs 😱💰 • Minor Repairs 😟 💵 • Preventive Maintenance 😅 🪙

Slide 24

Slide 24 text

• Failure ☠ 🏦 • Major Repairs 😱💰 • Minor Repairs 😟 💵 • Preventive Maintenance 😅 🪙 ✅

Slide 25

Slide 25 text

• Improved velocity 🏎 • Better product via rapid iterations ♻ • Improved code quality, reliability, architecture ☮

Slide 26

Slide 26 text

What's the catch? 🎣

Slide 27

Slide 27 text

• It needs rigour, which is not always possible

Slide 28

Slide 28 text

• It needs rigour, which is not always possible • High inertia — needs time, effort, careful execution

Slide 29

Slide 29 text

• It needs rigour, which is not always possible • High inertia — needs time, effort, careful execution • High short term costs

Slide 30

Slide 30 text

Methodology

Slide 31

Slide 31 text

Pairing, TDD, Trunk Based Development, On-Call Rotation: Build Rigour

Slide 32

Slide 32 text

Testing, Instrumentation, Observability, Feature Flags: Make Verification Easy

Slide 33

Slide 33 text

Infrastructure as Code, Immutable Infra, Pipelines, Playbooks: Reduce Operating Friction

Slide 34

Slide 34 text

A sustainable culture of building & shipping great products.

Slide 35

Slide 35 text

1. Build Rigour 2. Make Verification Easy 3. Low Operating Friction

Slide 36

Slide 36 text

1. Build Rigour

Slide 37

Slide 37 text

1. Build Rigour Putting the engineering in engineering

Slide 38

Slide 38 text

1. Pair Programming

Slide 39

Slide 39 text

1. Pair Programming 2. TDD

Slide 40

Slide 40 text

1. Pair Programming 2. TDD 3. Trunk Based Development

Slide 41

Slide 41 text

1. Pair Programming 2. TDD 3. Trunk Based Development 4. On-Call Rotation

Slide 42

Slide 42 text

1. Pair Programming 2. TDD 3. Trunk Based Development 4. On-Call Rotation

Slide 43

Slide 43 text

Why did we pick it?

Slide 44

Slide 44 text

☝First, the results.

Slide 45

Slide 45 text

Consistent high eNPS •Min 73, Max 87 •Better connections, better work relationships •Pandemic induced remote anxiety went down 📉

Slide 46

Slide 46 text

Increasing Velocity •Pairing velocity caught up with non- pairing velocity •Fewer delivery streams, same overall speed

Slide 47

Slide 47 text

Consistent upward trend. Towards the end, holidays and covid knocked us down.

Slide 48

Slide 48 text

Fast Onboarding •Ship code to production in Week 1 •New languages, frameworks in a sprint or two 🏎 •Internal transfers with zero friction 🧈

Slide 49

Slide 49 text

Low Tech Debt •Greenfield projects: ~ 0 tech debt 👌 •Code quality up 📈 •Documentation quality and quantity up 📈 •Architecture has been flexible 💪

Slide 50

Slide 50 text

Why did we pick it?

Slide 51

Slide 51 text

A Combination Of •Prior experience •Established research •First principles thinking

Slide 52

Slide 52 text

What does research say? •Improves design quality •Reduces defects (people spend less time on defective solutions) •Reduces staffing risk •Enhances technical skills •Improves team communications •Is considered more enjoyable at statistically significant levels. The Costs and Benefits of Pair Programming; Alistair Cockburn, Laurie Williams, Feb 2000

Slide 53

Slide 53 text

How to do it?

Slide 54

Slide 54 text

Use Driver Navigator •For an idea to go from Navigator's head to the code, it must go through Driver's hands.

Slide 55

Slide 55 text

Use Driver Navigator •For an idea to go from Navigator's head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour.

Slide 56

Slide 56 text

Use Driver Navigator •For an idea to go from Navigator's head to the code, it must go through Driver's hands. •Switch roles periodically. Say, every hour. •Senior / Junior is immaterial. Both get both roles.

Slide 57

Slide 57 text

Use Driver Navigator •Avoid giving line by line instructions, convey the general idea.

Slide 58

Slide 58 text

Use Driver Navigator •Avoid giving line by line instructions, convey the general idea. •Seniors take the responsibility of mentoring

Slide 59

Slide 59 text

Use Driver Navigator •When chopping onions, don't say "cut the top off, now break in half, make a slice" etc. •Just say "finely chopped" or "diced"

Slide 60

Slide 60 text

Use Driver Navigator •Mentoring happens with both driving and navigating. •I leave it to you to figure out what are the differences.

Slide 61

Slide 61 text

Use Driver Navigator •Remote pairing is better than in-person because of the natural role selection •Sharing screen? 👉 driver. Other 👉 navigator. •Mobbing is incredibly easy in remote. Just join the call and you're ready! 👍

Slide 62

Slide 62 text

Switch Pairs Every Sprint •Avoid pairing silos, they stall culture propagation

Slide 63

Slide 63 text

Switch Pairs Every Sprint •Avoid pairing silos, they stall culture propagation •Often, a pair will be 💯, switch them anyway.

Slide 64

Slide 64 text

Switch Pairs Every Sprint •Avoid pairing silos, they stall culture propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first

Slide 65

Slide 65 text

Switch Pairs Every Sprint •Avoid pairing silos, they stall culture propagation •Often, a pair will be 💯, switch them anyway. •Exhaust senior-junior pairs first •When sprint ends, you swap even if WIP. This is an effective litmus test.

Slide 66

Slide 66 text

DevX is Crucial •Let pairs figure out the balance between solo focussed work and pairing •Have routine health-checks about how people are pairing, their experience, etc •Let pairing feature in 1 : 1s and other discussions

Slide 67

Slide 67 text

Pilot + Co-Pilot = Pairing

Slide 68

Slide 68 text

You don't say "Turn the lever by 10° and push that button"

Slide 69

Slide 69 text

You say "Raise the elevation by 1000m"

Slide 70

Slide 70 text

Pairing Summary •Use Driver-Navigator •Switch pairs every sprints; no silos •Routine health checks with team

Slide 71

Slide 71 text

1. Pair Programming 2. TDD 3. Trunk Based Development 4. On-Call Rotation

Slide 72

Slide 72 text

Why did we pick it?

Slide 73

Slide 73 text

• TDD improves testability. This benefit alone is enough to embrace TDD. • TDD forces you to think in specifications, hence improving product thinking, along with code quality.

Slide 74

Slide 74 text

What are the effects?

Slide 75

Slide 75 text

• Clear & significant uptick in quality where TDD was followed vs where it wasn't.

Slide 76

Slide 76 text

• Clear & significant uptick in quality where TDD was followed vs where it wasn't. • Legacy or greenfield doesn't matter

Slide 77

Slide 77 text

• Clear & significant uptick in quality where TDD was followed vs where it wasn't. • Legacy or greenfield doesn't matter • TDD and Pairing are two incredible force multipliers, they feed into each other and create a strong positive gains loop.

Slide 78

Slide 78 text

How to do it?

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

1. Have senior engineers who are experienced in TDD

Slide 81

Slide 81 text

1. Have senior engineers who are experienced in TDD 2. Pair programming. Duh.

Slide 82

Slide 82 text

1. Pair Programming 2. TDD 3. Trunk Based Development 4. On-Call Rotation

Slide 83

Slide 83 text

Why did we pick it?

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

• Impedance mismatch between long- lived-branch + PR-based workflow and how high-trust teams operate

Slide 86

Slide 86 text

• Impedance mismatch between long- lived-branch + PR-based workflow and how high-trust teams operate • Build a sense of ownership in the codebase

Slide 87

Slide 87 text

• Impedance mismatch between long- lived-branch + PR-based workflow and how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling

Slide 88

Slide 88 text

• Impedance mismatch between long- lived-branch + PR-based workflow and how high-trust teams operate • Build a sense of ownership in the codebase • Always be selling release ready

Slide 89

Slide 89 text

What are the effects?

Slide 90

Slide 90 text

• Code reviews are faster • Teams respond quicker to urgent and important bugs • We're running more iterations

Slide 91

Slide 91 text

• Deploying to dev, stage has become slightly awkward because there's no 1 : 1 mapping • Turn-key environments have become a necessity rather than nice-to-have

Slide 92

Slide 92 text

How to do it?

Slide 93

Slide 93 text

Have fast builds • < 1 min ideally, if possible •15 min from git push to production deploy including build •Enable focussed tests i.e. run a single test from a single file

Slide 94

Slide 94 text

💯 Dev Machines •Fast, capable laptops •Must have automated & manual testing setup •Enable setting up any dependency

Slide 95

Slide 95 text

1. Pair Programming 2. TDD 3. Trunk Based Development 4. On-Call Rotation

Slide 96

Slide 96 text

Observations?

Slide 97

Slide 97 text

• Process still under iteration; no "yes, this works" yet

Slide 98

Slide 98 text

• Process still under iteration; no "yes, this works" yet • Settled on functional rotations: Backend, Frontend, Mobile, DevOps

Slide 99

Slide 99 text

• Process still under iteration; no "yes, this works" yet • Settled on functional rotations: Backend, Frontend, Mobile, DevOps • PTOs, leaves, Weekends still pose a challenge from time to time

Slide 100

Slide 100 text

• Team members that have done really well during on-call have also done really well in their performance reviews. • Correlation, yes. Causation? 🤷

Slide 101

Slide 101 text

How to do it?

Slide 102

Slide 102 text

Sharing our experience, not a walkthrough for on-call. Literature: PagerDuty Docs & Google's SRE Book

Slide 103

Slide 103 text

• Start with a robust triage process. First response under 15 min. • Have a playbook where common problems and remedies are listed. • In B2C products, a handful few situations repeat like a persistent boomerang. FAB. Frequently Annoying Bugs.

Slide 104

Slide 104 text

• Use managed services as much as possible; reduce operational on-call • Try hard for "follow-the-sun" model; i.e. no wee hours • All alerts must be actionable, keep adjusting until they are

Slide 105

Slide 105 text

1. Pair Programming 2. TDD 3. Trunk Based Development 4. On-Call Rotation

Slide 106

Slide 106 text

Every single process / methodology discussed so far has ancillary benefits that go way beyond just CD.

Slide 107

Slide 107 text

2. Make Verification Easy How do you know what you've done is working?

Slide 108

Slide 108 text

No content

Slide 109

Slide 109 text

1. Testability

Slide 110

Slide 110 text

1. Testability 2. Instrumentation

Slide 111

Slide 111 text

1. Testability 2. Instrumentation 3. Feature Flags

Slide 112

Slide 112 text

1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

Slide 113

Slide 113 text

1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

Slide 114

Slide 114 text

Why?

Slide 115

Slide 115 text

• Testability is a core engineering principle. • To be able to answer questions about a system by probing the right points and looking at indicators

Slide 116

Slide 116 text

• Cars, bridges, rack & pinion — you can't just restart them. • Neither can you go and update them at will

Slide 117

Slide 117 text

Hit the hammer: $1.0 Knowing where to hit: $9999.0

Slide 118

Slide 118 text

• The more testable your environment is, the more people will actually test it. • Make it easy to test something and it will get tested. • Conversely, make it difficult to test and it's easy to slip.

Slide 119

Slide 119 text

What are the observations?

Slide 120

Slide 120 text

• Not having Dev & Stage as close to production has routinely caused problems

Slide 121

Slide 121 text

• Not having Dev & Stage as close to production has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎

Slide 122

Slide 122 text

• Not having Dev & Stage as close to production has routinely caused problems • Static branches mapping to environments (dev, stage, main) seem 👍, but are a 👎 • Opaque 3rd party dependencies are incredibly hard to test. e.g. WhatsApp business APIs

Slide 123

Slide 123 text

• SoA + inter-service dependencies = complexity at a polynomial growth rate (or worse, factorial)

Slide 124

Slide 124 text

• SoA + inter-service dependencies = complexity at a polynomial growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation.

Slide 125

Slide 125 text

• SoA + inter-service dependencies = complexity at a polynomial growth rate (or worse, factorial) • Cloud-Native systems are a pain to test, but they do offer instrumentation. • UIs are inherently hard to test, add probes ( Metrics, Analytics, Traces, Errors, etc)

Slide 126

Slide 126 text

So, how to go about it?

Slide 127

Slide 127 text

There are two main themes: 1. Development time 2. Runtime, in production

Slide 128

Slide 128 text

Development Time •Use TDD •Add linters, code coverage to test builds •Postman / equivalent API tools are 💯 •Powerful Type Systems*

Slide 129

Slide 129 text

Runtime • Make good use of lower order environments

Slide 130

Slide 130 text

Runtime • Make good use of lower order environments •Heroku / Vercel style Review Apps are far more powerful than they seem

Slide 131

Slide 131 text

Runtime • Make good use of lower order environments •Heroku / Vercel style Review Apps are far more powerful than they seem •Dive down deep into important bugs and see how they could've been tested earlier. ( Which is different from how to reproduce them)

Slide 132

Slide 132 text

Runtime •Add traces, specially to lower-order environments. ( Example: AWS's X - Ray)

Slide 133

Slide 133 text

Runtime •Add traces, specially to lower-order environments. ( Example: AWS's X - Ray) •Try and build idempotent units of work. APIs, Workers, etc.

Slide 134

Slide 134 text

Runtime •Add traces, specially to lower-order environments. ( Example: AWS's X - Ray) •Try and build idempotent units of work. APIs, Workers, etc. •Pay special attention to non-idempotent units of work. Add checks and balances. ( Example: OTPs)

Slide 135

Slide 135 text

Slide 136

Slide 136 text

In both Environments •Always test for contention: • What must happen sequentially? Does it? •Always test for coherence: • How much and what information do two systems need to collect from each other? Do they?

Slide 137

Slide 137 text

Recommended Reading •Neil Gunther's work on Universal Scalability Law and Quantifying Scalability and Performance •Michael Nygard's "Release It!"

Slide 138

Slide 138 text

1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

Slide 139

Slide 139 text

Why did we pick it?

Slide 140

Slide 140 text

• Learning from other engineering disciplines • High velocity, but preferably not at a very high upfront cost • Wanted to build upfront, not after the fact

Slide 141

Slide 141 text

What are the effects?

Slide 142

Slide 142 text

• NewRelic routinely predicts a lot of problems before they occur • Tech spec quality has gone up — we add metrics and dashboarding right into tech specs • Had our share of goof-ups. e.g. Shipped a major feature, which nobody used in production 🤦

Slide 143

Slide 143 text

From our playbook

Slide 144

Slide 144 text

• Number of bugs has gone down* • Bug triage process is fast (and getting faster; median first response is down to 4 min) • Consistently low tech debt; and we assess and track regularly

Slide 145

Slide 145 text

No content

Slide 146

Slide 146 text

• Number of bugs has gone down* • Bug triage process is fast (and getting faster; median first response is down to 4 min) • Consistently low tech debt; and we assess and track regularly

Slide 147

Slide 147 text

How to do it?

Slide 148

Slide 148 text

• Have 3 levels of instrumentation:

Slide 149

Slide 149 text

• Have 3 levels of instrumentation: • Infra & Systems level

Slide 150

Slide 150 text

• Have 3 levels of instrumentation: • Infra & Systems level • Code & Application level

Slide 151

Slide 151 text

• Have 3 levels of instrumentation: • Infra & Systems level • Code & Application level • Product & Business level

Slide 152

Slide 152 text

• Have at least two kinds of thresholds:

Slide 153

Slide 153 text

• Have at least two kinds of thresholds: • Too low and too high

Slide 154

Slide 154 text

• Have at least two kinds of thresholds: • Too low and too high • Too long and too short

Slide 155

Slide 155 text

• Envision your production dashboards before even writing a single line of code • We're running a trial with GQM technique • Answer the 🏅 question: How do you know what you've built is working?

Slide 156

Slide 156 text

• NewRelic, Cloudwatch & friends are your friends • Keep Logs, Metrics, APMs in one place

Slide 157

Slide 157 text

1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

Slide 158

Slide 158 text

Why?

Slide 159

Slide 159 text

• Essential with Trunk Based Development • Works very well with product experimentation

Slide 160

Slide 160 text

Observations.

Slide 161

Slide 161 text

• We now deploy "under development" work to production on Day One • Having fewer technologies has helped in usage standardisation. Flipper is 🤘 • Code gets littered with branching. Live with it.

Slide 162

Slide 162 text

How to do it?

Slide 163

Slide 163 text

3 Kinds of Feature Flags

Slide 164

Slide 164 text

3 Kinds of Feature Flags 1. Infra / systems level (types of CPUs, Aurora vs RDS, etc)

Slide 165

Slide 165 text

3 Kinds of Feature Flags 1. Infra / systems level (types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development )

Slide 166

Slide 166 text

3 Kinds of Feature Flags 1. Infra / systems level (types of CPUs, Aurora vs RDS, etc) 2. Code level ( tied with continuous deployments and trunk development ) 3. Product and business level ( A/B tests, experimentation )

Slide 167

Slide 167 text

• Not all feature flags live forever, kill the code branches when feature matures. • Database changes have to be 100% backward and forward compatible • Prefer SDKs, libraries, code sharing over a centralised service for feature flags

Slide 168

Slide 168 text

1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

Slide 169

Slide 169 text

• Linters ✅ • Source Code Analysis for Security ✅ • Metrics ✅ • Exploring: TLA + 🔮 • Formal Verification: At the moment 🛑

Slide 170

Slide 170 text

1. Testability 2. Instrumentation 3. Feature Flags 4. Static Verification

Slide 171

Slide 171 text

3. Reduce Operating Friction Help team focus on the important

Slide 172

Slide 172 text

1. Pipelines 2. Infrastructure as Code 3. Playbooks

Slide 173

Slide 173 text

1. Pipelines 2. Infrastructure as Code 3. Playbooks

Slide 174

Slide 174 text

• Automated pipelines ✅ • Manual deployment? 👎 • Manual approval? 👎 • Manual configuration? 👎

Slide 175

Slide 175 text

• API service? Pipeline. • iOS app? Pipeline. • React App? Pipeline. • Data pipeline? Well, duh!

Slide 176

Slide 176 text

Make reliable deployments a foregone conclusion.

Slide 177

Slide 177 text

What do we do?

Slide 178

Slide 178 text

• ~ 50 deployments per day • Slowest deployment to prod is 15 min, fastest is 3 min — this includes ALL THE TESTING • Deployments are completely transparent. You push code and things happen. Teams can focus on product and problems.

Slide 179

Slide 179 text

• Entire infra is managed from the pipeline, it's tied into the AWS ecosystem.

Slide 180

Slide 180 text

• Entire infra is managed from the pipeline, it's tied into the AWS ecosystem. • Remember the golden rule: Every git push goes to production under 15 minutes flat, with no manual approval whatsoever.

Slide 181

Slide 181 text

1. Pipelines 2. Infrastructure as Code 3. Playbooks

Slide 182

Slide 182 text

Why did we pick it?

Slide 183

Slide 183 text

• Declarative infrastructure; same code quality focus on DevOps as well

Slide 184

Slide 184 text

• Declarative infrastructure; same code quality focus on DevOps as well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team

Slide 185

Slide 185 text

• Declarative infrastructure; same code quality focus on DevOps as well • We want the vertical teams to define and manage their infra and not be blocked by a horizontal team • Reduce operational on-call burden

Slide 186

Slide 186 text

Slide 187

Slide 187 text

• We picked: AWS CDK • CDK Python makes it a low- barrier for engineering.

Slide 188

Slide 188 text

What are the effects?

Slide 189

Slide 189 text

Fast Turnaround •New Rails project from scratch, goes from 0 to (dev + stage + production) in 2 hours.

Slide 190

Slide 190 text

Fast Turnaround •New Rails project from scratch, goes from 0 to (dev + stage + production) in 2 hours. •This includes Load balancer, DNS, HTTPS, Secrets, Docker ( Fargate) cluster, Redis, Workers, RDS PostgreSQL, and all the things.

Slide 191

Slide 191 text

Slide 192

Slide 192 text

Fast Turnaround •Adding a new AWS Lambda to dev + stage + prod: 15 to 30 minutes. Git push and you're in production. •30-min when complex pieces like SQS / SNS are involved •New Redis server? Add code, git commit, 15 min later: ✅

Slide 193

Slide 193 text

• Infrastructure thinking and action is fully absorbed into Engineering now. • DevOps team has spent < 1% of their total time on on-call issues. • They're working on pieces like turn-key environments, load-testing setups, security compliance, performance optimisations

Slide 194

Slide 194 text

• Lower-order/sub-prime environments are on a very high parity with Production in terms of infra. • Remember better testing? This makes it possible and easy.

Slide 195

Slide 195 text

So, how do you do it?

Slide 196

Slide 196 text

• Relentless focus on Developer Productivity over infra costs.

Slide 197

Slide 197 text

• Relentless focus on Developer Productivity over infra costs. • Even in pure monetary terms, it's cheaper

Slide 198

Slide 198 text

• Relentless focus on Developer Productivity over infra costs. • Even in pure monetary terms, it's cheaper • We routinely and constantly save costs because developers have the headspace to think about high impact problems.

Slide 199

Slide 199 text

• Pick AWS CDK, Pick Cloud- Native: The combination is wildly effective. • Similar combinations exist with other providers

Slide 200

Slide 200 text

• Treat infra team as an engineering team, not a support team. • Actively help them avoid becoming Jira card pushers

Slide 201

Slide 201 text

1. Pipelines 2. Infrastructure as Code 3. Playbooks

Slide 202

Slide 202 text

Playbooks for Nearly Everything •Product Engineering? ✅ •Mobile Development? ✅ •Onboarding and Off-boarding? ✅ •Git Usage? ✅ ( WIP ) •Feature Flags? ✅ ( WIP )

Slide 203

Slide 203 text

Templates for Nearly Everything •Decision Records? ✅ •New code repositories? ✅ •PRDs? ✅ •Jira User Stories? ✅ •Interview Problems? ✅ ( WIP )

Slide 204

Slide 204 text

What is the idea? •Reduce decision fatigue by codifying frequent decisions. •Improve compliance through written procedures •Encourage participation by making it open and editable to all

Slide 205

Slide 205 text

1. Pipelines 2. Infrastructure as Code 3. Playbooks

Slide 206

Slide 206 text

Summary

Slide 207

Slide 207 text

Rationale 1. Continuous Deployments are good for you. 2. If you're not doing it, you're playing in hard mode. 3. At minimum, think preventive maintenance

Slide 208

Slide 208 text

Build Rigour: Pairing, TDD, Trunk Based Development, On-Call Rotation

Slide 209

Slide 209 text

Make Verification Easy: Testing, Instrumentation, Observability, Feature Flags

Slide 210

Slide 210 text

Reduce Operating Friction: Infrastructure as Code, Immutable Infra, Pipelines, Playbooks