Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices

by Charity Majors

Slide 1

Slide 1 text

@mipsytipsy Compliance & Regulatory Standards Are ✨Not✨ Incompatible With Modern Development Best Practices

Slide 2

Slide 2 text

@mipsytipsy engineer/cofounder/CTO https://charity.wtf

Slide 3

Slide 3 text

“The Sociotechnical Path to High-Performing Teams” “It Is Time To Fulfill The Promise Of Continuous Delivery” “Debugging Is A Team Sport” “On Call Does Not Have To Suck” “Testing in Production”

Slide 4

Slide 4 text

“Okay, could you talk about that stuff, but also explain how and why we can do these things in a heavily regulated environment?” YES I can!

Slide 5

Slide 5 text

Modern software development practices 1.Engineers owning their own code in production 2.Practicing observability-driven development 3.Testing in production 4.Separating deploys from releases using feature flags 5.Continuous deployment (or at least delivery)

Slide 6

Slide 6 text

Getting your code into production as fast as possible after writing it. FAST FEEDBACK LOOPS Modern software development practices are ✨ALL✨ about

Slide 7

Slide 7 text

These practices, which have gone mainstream just in the last five years, aren’t about being trendy or showing off on twitter. They represent thousands of people-years of research and experimentation into how to build better software. How well your team performs can make the difference between loving your job or hating it; an exciting career or stagnation; happy users or angry users; even the success or failure of your company.

Slide 8

Slide 8 text

Engineers owning their code in production • No dev/ops divide • You write it, you are on call for it • You kick off your own deploys • Systems are becoming too complex for anyone to operate systems they didn’t write, or write systems they don’t also operate. #1 Practice

Slide 9

Slide 9 text

Observability-driven development • Instrument your code as you go • After you deploy it, you go and look at it in production • Is it doing what you expected? • Does anything else look…weird? #2 Practice

Slide 10

Slide 10 text

Testing in production • Everybody tests in production… • …but only some of us admit it. • Instrument your code. Get used to looking at it. • And not just when things are broken. Know what good looks like. • Close the loop by looking at your code after you deploy it, every time. #3 Practice

Slide 11

Slide 11 text

Separating deploys from releases using feature flags • The key to reliable software is shipping smaller diffs, more frequently. • Using feature flags is how you do this. • Deploy continuously and flexibly. Roll changes out to users gradually, by groups, opt-in, etc. • Get your diffs out swiftly, while honoring scheduled release dates for product features. #4 Practice

Slide 12

Slide 12 text

Continuous Delivery (or even better, Continuous Deployment) • NO manual QA, Change Advisory Board, or approval gates • We have an ocean of evidence that these do nothing to make software better, and in fact make software worse. • Deploy as fast as possible, • As automated as possible. • If you haven’t read it, read it: —> #5 Practice

Slide 13

Slide 13 text

Security: “Explain it to me like I’m five” (ELI5) Confidentiality, Integrity, Availability “You must protect customer data” You must demonstrate that you have policies, procedures, and safeguards in place to protect customer data, and supply evidence you are actually following those policies, procedures, and safeguards. “You must protect your code”

Slide 14

Slide 14 text

✅ Frameworks: ✅ Written policies for how you are going to comply with regulations (security team) ✅ Regulations: GDPR, CCPA, HIPAA, PCI/DSS, etc SOC2, ISO 27001, NIST, FedRAMP etc State banking regulations ❌ We are NOT fucking around with FedRAMP or state banking regulations in this talk. ✅ Contractual terms/DPAs for big customers (legal team) ELI5

Slide 15

Slide 15 text

Frameworks are typically very loose on the specifics. None of them expressly forbid any modern development practices. However, they may conflict with your own written policies, the ones that are being used to demonstrate compliance. They may also conflict with terms in your own customer contracts. E.g. “People should not be able to see private data unless you have a business need to do so.” (but the definition of “business need” is left up to us) Like, “You need to be scanning your code for known vulnerabilities before it goes live”

Slide 16

Slide 16 text

Frameworks can be used to achieve compliance with regulations. Policies are living documents. They should be subject to regular review and reconsideration. Contracts should be negotiated, not blindly signed. Is your security team reviewing contracts before signing them? Are YOU? Are you giving your teams guidance on where to push back? But!

Slide 17

Slide 17 text

Compliance standards exist for a reason. Our goal here is NOT to avoid or evade them. The problem is that elaborate security theater makes us slower and less competitive, while also making us no more (or even LESS!) secure. Always honor the spirit of the control, when devising a solution. As engineers, we may be best positioned to find the solution that is actually secure, not only theatrically secure.

Slide 18

Slide 18 text

“We can’t have continuous delivery because …” Jez Humble, “Continuous Delivery Sounds Great But It Won’t Work Here” DevOpsDays Seattle 2017 1. We’re regulated 2. We’re not building websites 3. We have too much legacy 4. Our people are too stupid Stated Reasons: • Our culture sucks • Our architecture sucks • We haven’t tried • We don’t care enough Actual Reasons: (borrowed from a Jez Humble slide circa 2017 👇)

Slide 19

Slide 19 text

1. We’re regulated 2. We’re not building websites 3. We have too much legacy 4. Our people are too stupid But this is a solved problem. This was a solved problem a decade ago! Etsy, since 2013 Amazon Stripe HP firmware Branch Insurance   Jack Henry Moov Honeycomb US gov (!!) Some of your competitors You can be, too.

Slide 20

Slide 20 text

How Etsy did it (in 2013!): • Decouple the cardholder data and PCI/DSS regulations from the rest of the system • The systems that form the cardholder data environment (CDE) are separated from the rest of Etsy’s environments at the physical, network, source code, and logical infra levels • The CDE is built and operated by an xfn team that is solely responsible for the CDE. Again, this limits the scope of the PCI DSS regulations to just this team. https://queue.acm.org/detail.cfm?id=3190610

Slide 21

Slide 21 text

How Branch Insurance does it: • Regulated by 36 states and DC, annual SOC2s • Production data and envs mostly isolated from most engineers; only TLs can analyze production telemetry for PII purposes (despite masking and filtering and tokenizing) • Every developer has their own AWS account, massive investment in testing. Trunk-based development. • Uses serverless extensively; pushes to trunk many times/ day, pushes to prod many times/week, in under an hour end to end.

Slide 22

Slide 22 text

How Honeycomb does it: • Certified SOC2 Type 2. Subject to GDPR, HIPAA, CCPA, state regs • Auto-deploys once an hour off trunk via a cron job. Extensive investment into tests. Takes about an hour for code to go live. • Practices trunk-based development, short-lived branches, code reviews • Access Management policy based on least privilege model. Access to PII/production data is limited to those who have a business need for it, i.e. need it to do their jobs.

Slide 23

Slide 23 text

Stop blaming regulations and frameworks. It’s all about how we choose to interpret the standards.

Slide 24

Slide 24 text

Interpretations vary based on risk tolerance. Far too often, the paperwork seems to matter more than the actual security of the implementation. ☹ The difficulty here is that every product, company, and architecture is sui generis, so we can’t apply cookie- cutter solutions — we need to actually understand each use case before we can negotiate a solution. Also, we are terrible about sharing the solutions we do find. Every situation is ✨unique✨

Slide 25

Slide 25 text

Architecture The biggest architectural obstacle to continuous delivery is when you want to ship a single line of code, but you have to deploy the whole world. Can you deploy the service you’re working on without having to deploy all the dependencies? Can you test the service you’re working on on your laptop, without needing an integrated environment?

Slide 26

Slide 26 text

Architectural considerations: • Use a well-designed PaaS, if you can • Design for testability and deployability • Invest heavily in your test suite • If you need to unbundle a monolith, do not rip and replace; redesign iteratively into services. • Make sure services have their own databases! • Bring security in to the discussion from day one.

Slide 27

Slide 27 text

In general, engineers shouldn’t need to be constantly thinking about compliance. Mostly just when setting up a new thing, or when gathering PII — does this matter, and where should I put it? Engineering performance and productivity, on the other hand, should ALWAYS be on our minds. Entropy is constantly eating away at our efficiency.

Slide 28

Slide 28 text

If you want category-defining, competition-crushing engineering excellence, your engineering leadership will have to engage with security and legal as partners. One thing is exceptionally clear:

Slide 29

Slide 29 text

We need engineering leaders who understand the existential urgency of a short cycle time, and will fight for it. Not just once or twice. Every day.

Slide 30

Slide 30 text

“How well does your team perform?” != “how good are you at engineering”

Slide 31

Slide 31 text

High-performing teams get to spend the majority of their time solving interesting, novel problems that move the business materially forward. Lower-performing teams spend almost all their time firefighting, waiting on code review, context switching, rolling back, rolling forwards, reproducing tricky bugs, solving problems they thought were fixed, responding to customer complaints, fixing flaky tests, running deploys by hand, fighting with their infrastructure, fighting with their tools, fighting with each other, debugging merge conflicts, triaging failed deploys, debugging and reproducing problems for each other when the rest of the team can’t use the debugging tools adequately, waiting on CI/CD to complete, waiting on tests to run, waiting on the queue to deploy, re-running tests because they aren’t sure if the one that failed is a real failure or not, paging in a different project to work on while your other project is stalled… basically everything BUT making progress on core business problems.

Slide 32

Slide 32 text

🔥1 — How frequently do you deploy? 🔥2 — How long does it take for code to go live? 🔥3 — How many of your deploys fail? 🔥4 — How long does it take to recover from an outage? 🔥5 — How often are you paged outside work hours? How high-performing is YOUR team? DORA metrics: https://dora.dev

Slide 33

Slide 33 text

It really, really, really, really, really pays off to be on a high performing team. Like REALLY. 2019 numbers 2021 numbers

Slide 34

Slide 34 text

“Hire the smartest people you can find. Recruit from the best schools. Aggressively poach as much talent from FAANG as you can.” How do we build high-performing teams?

Slide 35

Slide 35 text

Who is going to be a better engineer in two years? An engineer on an “Elite” team 3000 deploys/year 9 outages/year 6 hours firefighting An engineer on a “Medium” team 5 deploys/year 65 outages/year firefighting: constant

Slide 36

Slide 36 text

Q: What happens when an engineer from the “elite” yellow bubble joins a medium- performing team in the blue bubble? A: Your productivity tends to rise (or fall) to match that of the team you join.

Slide 37

Slide 37 text

Great teams make great engineers. ❤

Slide 38

Slide 38 text

Your ability to ship code swiftly and safely has less to do with your personal knowledge of algorithms and data structures, sociotechnical (n) “Technology is the sum of ways in which social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia and more to do with the sociotechnical system you participate in.

Slide 39

Slide 39 text

Technical leadership should focus intensely on constructing and tightening the feedback loops at the heart of their system. The smallest unit of software delivery is the team.

Slide 40

Slide 40 text

which brings us to… ✨CI / CD✨ 💜 Shipping is the heartbeat of your company. 💜 Shipping new code should be as small, as common, as regular, as boring, as unremarkable as a heartbeat. and CI/CD is how we get there. Right? So … do YOU do CI/CD?!??

Slide 41

Slide 41 text

“YES! We do CI/CD.” …but do you really? “Well, we have a Circle-CI account?”

Slide 42

Slide 42 text

Most people are doing *CI*… sorta … But CI is only the prelude to the main course.   The ENTIRE POINT of CI is to prepare the path for you to do CD. Continuous Deployment Continuous DELIVERY? At least. Better yet,

Slide 43

Slide 43 text

If you aren’t going to hook CI up to production, honestly, why even bother with CI? Just run your tests continuously in a shell loop from your laptop. Same deal, less hassle. ¯\_(ツ)_/¯ Once you merge your code to main, it should be automatically deployed by default. No manual gates. ✨One hour or less✨ Continuous Deployment is what will change your life. Continuous Deployment is what will change your life. Continuous Deployment is what will change your life. Continuous Deployment is what will change your life

Slide 44

Slide 44 text

P.S.: Fear of deploys is the single largest source of technical debt in most organizations.

Slide 45

Slide 45 text

The speed, coverage, and cadence of your CI/CD pipeline will set the high water mark for your team’s performance. The “You Had One Job” of engineering leadership is tuning the feedback loops of our sociotechnical systems. It can’t get any better or faster than that, but it can definitely get slower and worse downstream.

Slide 46

Slide 46 text

That precious interval of time between when you wrote the code and when the code has been deployed is everything. wrote the code deployed the code This is the cornerstone of high performing teams.

Slide 47

Slide 47 text

At that moment when you finish solving a problem, your mental state holds everything: your original intent, motivation, implementation details tried and tossed, tradeoffs, variable names, etc. This lasts for … minutes? hours? 😬 Until you move on to the next problem, maybe.

Slide 48

Slide 48 text

Which is why engineers can find upwards of 80% of all bugs in that magical, fleeting interval, so long as they 1) have good observability tooling, 2), instrument their code and 3) go and look at it. Ask yourself: 🌟 is it doing what I expected it to? 🌟 and does anything else look … weird? A predictable interval of a few minutes lets you to hook into the body’s own intrinsic reward systems. Muscle memory. Dopamine hits! 🥰

Slide 49

Slide 49 text

https://deepsource.io/blog/exponential-cost-of-fixing-bugs/ The cost of finding and fixing bugs goes up exponentially with time elapsed since development.

Slide 50

Slide 50 text

welcome to the software development death spiral. If it takes you hours (or even days!) to get a single line of code out,

Slide 51

Slide 51 text

a longer interval between when code is written & deployed leads to … larger diffs … longer turnaround time for code review … multiple changes getting batched up and deployed at once … makes it hard to identify whose code is at fault … which severs ownership of changes … and soon requires specialists to deploy, run, monitor, and debug … more and more engineering cycles are spent waiting on each other … now we need to hire more engineers, managers, TPMs, project managers … more people and teams incur more coordination costs … more time spent paging state in and out of your brain … which all costs MORE TIME …😱

Slide 52

Slide 52 text

large diffs, long review turnaround, batched up changes in a single deploy, complicated outage recovery processes, bloated org, coordination costs, tool proliferation, too many teams, burnout, boredom, boilerplate, unhappy customers, competitive losses, too little time spent on core business problems… You can spend your life chasing symptoms and pathologies … Or you can fix it at the source. 60 minutes or bust.

Slide 53

Slide 53 text

A fast cycle time is an enormous competitive advantage. It is worth taking up this fight. ☺ I have never known a company where engineers were happy and customers were unhappy, or vice versa. Users’ and engineers’ happiness tends to rise and fall in tandem.

Slide 54

Slide 54 text

“We can’t do this because of regulations…” Bullshit. Engineers can be overly literal. You are interpreters between security, legal, and tech…not transcriptionists. YOU are the experts in your code. YOU are the experts in software development. YOU are responsible for resolving conflicting requirements from security, legal and dev.

Slide 55

Slide 55 text

Again: there is NO LAW or regulatory framework preventing you from following modern software development best practices. None. Zero. Zip.

Slide 56

Slide 56 text

We are all on the same side. This is about better security, not worse. Documentation is a HUGE part of what matters, so use this to your advantage. Document what you’re going to do up front, do what you say you’re going to do, then document that you did it.

Slide 57

Slide 57 text

Start small. Look for ways to demonstrate what you’re talking about with small wins that benefit everyone. Come to understand their pain, develop empathy for them. Then help them understand your pain and develop some empathy for you. Start by… building relationships. Get to know your peers in security and legal. Understand the constraints they are working under. They are probably held responsible for a pile of nightmares that you have no idea even exists. ☠ This will take time…possibly years, at calcified organizations. And you won’t progress much without SOME cover from the top. Get anyone and everyone you can to read “Accelerate”. How to drive change in your org:

Slide 58

Slide 58 text

P.S. Learn this phrase: “Compensating Controls” “I’m not following the letter of the law, but I have this other system that proves I’m following the spirit of the law”

Slide 59

Slide 59 text

Instrument for observability. Engineers shouldn’t need full production access; you should be able to understand your software with just commit access and observability. Observability is what gives us the confidence to move swiftly, not blindly.

Slide 60

Slide 60 text

Good SLOs actually check multiple boxes for us. Executive visibility into important numbers, monitoring, alerts, etc … instead of needing a different system for each one, SLOs cover many.

Slide 61

Slide 61 text

“How well does your team perform?” Your team’s performance is defined by your sociotechnical systems, and especially by the speed of your feedback loops. It isn’t just about the security or economic arguments…

Slide 62

Slide 62 text

High-performing teams spend the majority of their time solving interesting, novel problems that move the business materially forward. Everybody wants to be on teams like these. ❤

Slide 63

Slide 63 text

This is a quality of life issue. This is an ethical issue. We must build high-performing teams that are low in toil and high in Autonomy, Mastery, and Meaning. This begins with keeping your intervals low and your feedback loops tight.