Slide 1

Slide 1 text

@ramimacisabird Rami McCarthy The Path to Zero Touch Production

Slide 2

Slide 2 text

I’m Rami 👋

Slide 3

Slide 3 text

Sometimes I look like this online 👋

Slide 4

Slide 4 text

@ramimacisabird https://ramimac.me

Slide 5

Slide 5 text

Let’s talk about: 🫵touching production🫵

Slide 6

Slide 6 text

@ramimacisabird Why do we care about touching production?

Slide 7

Slide 7 text

@ramimacisabird Mistakes happen

Slide 8

Slide 8 text

Have you ever applied the wrong Terraform workspace?

Slide 9

Slide 9 text

Have you ever deleted the wrong database?

Slide 10

Slide 10 text

Have you ever realized you were in the wrong terminal window?

Slide 11

Slide 11 text

@ramimacisabird Mistakes happen frequently Google traced 13% of their incidents to human interaction

Slide 12

Slide 12 text

@ramimacisabird Access is attack surface

Slide 13

Slide 13 text

@ramimacisabird Sad fact: Access can be abused

Slide 14

Slide 14 text

@ramimacisabird Cloud Access Maturity Journey 1. Initial 2. Foundational 3. Established 4. Advanced 5. Zero Touch Production

Slide 15

Slide 15 text

@ramimacisabird Cloud Access Maturity Journey Initial • Everybody has Admin and can Shell to production • Often using SSH over the internet • Until you hit 10 - 20 engineers (in non-sensitive sectors)

Slide 16

Slide 16 text

@ramimacisabird Cloud Access Maturity Journey Foundational • Standing RBAC ( Read, Engineer, Admin) • Shell and Admin broadly available • By the time you hit ~ 50 engineers

Slide 17

Slide 17 text

@ramimacisabird Cloud Access Maturity Journey Established • More granular RBAC (team based or engineering sub-roles) • Shell and Admin start to get broken out • With a few narrow capabilities layered in • Technical controls for user lifecycle, JIT access • By the time you hit ~ 500 engineers

Slide 18

Slide 18 text

@ramimacisabird Cloud Access Maturity Journey Advanced • Shell and Admin going away due to improved automation and observability • Least-privileged standing access, granular RBAC or ABAC • Common administrative actions are implemented with guardrails in internal tools • JIT access is used for specific cloud resources and granular tools, including restrictive shells

Slide 19

Slide 19 text

@ramimacisabird Cloud Access Maturity Journey Zero Touch Production

Slide 20

Slide 20 text

@ramimacisabird Every change in production must either be: •made by automation •pre-validated by software •made via an audited break-glass mechanism Cloud Access Maturity Journey Zero Touch Production

Slide 21

Slide 21 text

@ramimacisabird Every change in production must be … • pre-justified Cloud Access Maturity Journey Zero Touch Production

Slide 22

Slide 22 text

@ramimacisabird Safe Automation Rate Limiting Safety Checks Authority Delegation Zero Touch Production Tactics

Slide 23

Slide 23 text

@ramimacisabird Safe Proxies Audit log Fine grained authorization Rate-limiting Zero Touch Production Tactics

Slide 24

Slide 24 text

@ramimacisabird Moving up the maturity curve

Slide 25

Slide 25 text

It’s not this easy:

Slide 26

Slide 26 text

Security is about enabling the business

Slide 27

Slide 27 text

Engineers are often touching production for good reason

Slide 28

Slide 28 text

@ramimacisabird 1. Report & review metrics 2. Make shell access safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access Moving up the maturity curve 6 1 5 2 3 4

Slide 29

Slide 29 text

@ramimacisabird What makes this hard? “Least privilege equals maximum effort” - Eric Brandwine

Slide 30

Slide 30 text

@ramimacisabird What makes this hard? Technical Social

Slide 31

Slide 31 text

@ramimacisabird What makes this hard? Technical Transitive access

Slide 32

Slide 32 text

@ramimacisabird What makes this hard? Technical Audit logging

Slide 33

Slide 33 text

@ramimacisabird What makes this hard? Technical New moving parts

Slide 34

Slide 34 text

@ramimacisabird What makes this hard? Social When ambient privilege exists, expect systems and users to become dependent on it. BeyondCorp and the long tail of Zero Trust

Slide 35

Slide 35 text

@ramimacisabird What makes this hard? Social Prescriptive paths

Slide 36

Slide 36 text

@ramimacisabird What makes this hard? Social Moving the cheese

Slide 37

Slide 37 text

@ramimacisabird Moving up the maturity curve 1. Report & review metrics 2. Make shell access safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access 6 1 5 2 3 4

Slide 38

Slide 38 text

@ramimacisabird Getting a Shell

Slide 39

Slide 39 text

@ramimacisabird Getting a Shell • I’m muddling two concepts • “Authentication” (i.e SSH ) vs. “Networking” (i.e VPN ) • You can mix and match, make sure you have both • Some AWS primitives handle both with a single tool

Slide 40

Slide 40 text

@ramimacisabird SSH This is generally where you start, with 22 open to the world, or maybe IP Allowlisting. launch-wizard-1 haunts my dreams Adopt: Options Buy: Options • Link 1 • Link 2 • Link 1 • Link 2 Docs: Connect to your Linux instance from Linux or macOS using SSH Cons Pros • Familiar to the average developer • … at least you’re using authentication? • Key management • Lack centralized visibility • Risky, long-lived credentials Cons Pros • less internet exposure • less to keep patched and hardened • see: SSH • You’re still dangling a box out on the internet, and racing 0-days • Even more key management Docs: Linux Bastion Hosts on AWS w/ Public Bastion Definitely better than raw SSH

Slide 41

Slide 41 text

@ramimacisabird EC2 Instance Connect Use IAM policies and principals to control SSH access to your instances Docs: Connect to your Linux instance with EC2 Instance Connect Cons Pros • Solves public key management • Use IAM to control access • Audit connection requests • Doesn’t solve networking • Ephemeral public keys can be confusing • Instances need EC2IC installed

Slide 42

Slide 42 text

@ramimacisabird SSH over VPN Docs: Getting started with Client VPN Cons Pros • No more internet exposure! (unless you’re managing a VPN concentrator) • Service/destination agnostic • Now you have to admin a VPN • DevEx hit • $$$

Slide 43

Slide 43 text

@ramimacisabird SSH over VPN ( Wireguard) It’s not a regular VPN, it’s a cool VPN! Docs: Tailscale: AWS reference architecture Cons Pros • VPN, but better • Mesh network • See: VPN • Confuses (bad) auditors

Slide 44

Slide 44 text

@ramimacisabird SSM ( Systems Manager Session Manager) Docs: AWS Systems Manager Session Manager Cons Pros • No need to figure out networking • Centralized access management tied to IAM • Logging and auditing bundled • New to developers, and some weird devex in the CLI • Slight pain in managing the Agent • $free.99

Slide 45

Slide 45 text

@ramimacisabird EC2 Instance Connect Endpoint Identity-aware TCP proxy, letting you SSH and RDP to non-routable instances Docs: Secure Connectivity from Public to Private: Introducing EC2 Instance Connect Endpoint Cons Pros • No need to figure out networking, no public IP • No need for an agent • IAM authorization • Retroactively locked to SSH & RDP ( RIP rdsconn) • Clunky CLI • Limited logging • $free.99 • tcp-over-ssh-over-websocket

Slide 46

Slide 46 text

@ramimacisabird ECS Exec Uses SSM to establish a connection with a running container Docs: Using Amazon ECS Exec to access your containers on AWS Fargate and Amazon EC2 Cons Pros • Logs the commands and commands output • IAM authorization • Comparatively limited compatibility • 1 session per PID • Limited to ECS • $free.99

Slide 47

Slide 47 text

@ramimacisabird Zero-Trust Network Access Provide secure access to corporate applications without a VPN Docs: How Verified Access works Cons Pros • SSO integrated access • Good for internal web apps • Good for non-technical users • Complicated deployment (relatively) • $$$

Slide 48

Slide 48 text

@ramimacisabird Other Options VDI Docs: What is Amazon WorkSpaces? Cons Pros • Flexible, once you have them • Good for non-technical users • Can be turned into a safe “workbench”, but hard to limit data exfiltration • High administrative burden, high complexity • $$ Cloud9 IDE Docs: EKS - Accessing a private only API server Cons Pros • Nice to have options I guess • Good for cloud development environments! • Just don’t do this in prod, please

Slide 49

Slide 49 text

@ramimacisabird tl;dr 1. Use SSM to access instances, definitely by the time you hit Stage 3 : Established •ECS Exec is good enough to add in if you’re using ECS, but isn’t going to replace SSM generically 2. It’s nice to have Wireguard as a flexible option with decent devex and intrinsic security properties •ZTNA or an internal identity-aware proxy is an alternative/ supplement, if you’re dealing with a bunch of internal Web Apps

Slide 50

Slide 50 text

@ramimacisabird Moving up the maturity curve 1. Report & review metrics 2. Make shell access safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access 6 1 5 2 3 4

Slide 51

Slide 51 text

@ramimacisabird 🤔 Why must you touch production Remember: Engineers are often touching production for good reason

Slide 52

Slide 52 text

@ramimacisabird 🤔 Why must you touch production But: They don’t need an interactive session 95% of the time

Slide 53

Slide 53 text

@ramimacisabird Touching Production Thesis: All access to production can be accomplished through one of six capabilities

Slide 54

Slide 54 text

@ramimacisabird Other Production Access Capability Technology Script Running SSM RunCommand Asynchronous Jobs Speci fi c to your application/service AWS Systems Manager Automation Internal tools platform SSM Port Forwarding https://clutch.sh/, Retool, Django Admin, etc. Workbench VDI-based, or custom Read-only Analytics IAM authentication for RDS, or replicate to a DataLake + JIT Breakglass

Slide 55

Slide 55 text

@ramimacisabird Other Production Access Capability Technology Script Running SSM RunCommand Asynchronous Jobs Speci fi c to your application/service AWS Systems Manager Automation Internal tools platform SSM Port Forwarding https://clutch.sh/, Retool, Django Admin, etc. Workbench VDI-based, or custom Read-only Analytics IAM authentication for RDS, or replicate to a DataLake +JIT Breakglass

Slide 56

Slide 56 text

@ramimacisabird Script Running & Internal Tools Example use cases

Slide 57

Slide 57 text

@ramimacisabird Script Running & Internal Tools SSM RunCommand

Slide 58

Slide 58 text

@ramimacisabird Just-in-Time Access

Slide 59

Slide 59 text

@ramimacisabird Just-in-Time Access • Essential to making a smooth migration • Allows you to incrementally ratchet friction against sub-optimal paths while leaving them available • Straightforward implementation of break glass • Easier User Access Reviews

Slide 60

Slide 60 text

@ramimacisabird Just-in-Time Access • Grant access in a scheduled way, or on-demand • Pair with other controls, like multi-party approval • Use as a choke-point for logging • Generates metrics

Slide 61

Slide 61 text

@ramimacisabird Just-in-Time Access rami.wiki/jit-cloud-access/

Slide 62

Slide 62 text

@ramimacisabird Moving up the maturity curve 1. Report & review metrics 2. Make shell access safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access 6 1 5 2 3 4

Slide 63

Slide 63 text

@ramimacisabird Developer Experience

Slide 64

Slide 64 text

@ramimacisabird Browser Extension • Insert reminders to refresh Identity Center sessions • Improve the disgusting AWS SSO login flow / auto click • Manage browser tab containers for multiple simultaneous sessions

Slide 65

Slide 65 text

@ramimacisabird Unified ( CLI ) Wrapper • You need something like aws-vault or granted early • Wrapping it allows company-specific customization: • Smart role selection & integration with JIT access tooling -> improved discoverability • Smarter guidance based on error messages (no access to role vs. role not having access to action) • Single entry point for all your various production access utilities, abstract away the underlying primitives • Allows you to make implementation changes without interface changes • Collect metrics!

Slide 66

Slide 66 text

@ramimacisabird Metrics

Slide 67

Slide 67 text

@ramimacisabird Metrics •Number of employees (or Access Hours) with active production shell access •Usage of legacy shell, as compared to shell alternatives •Mean time to approval (in JIT ) , & auto-approval rate •Incidents attributable to human action Goal Metrics

Slide 68

Slide 68 text

@ramimacisabird Guardrail Metrics Metrics •Incident Mean Time To Repair ( MTTR ) •Break-glass frequency •Denied JIT access request rate •CSAT

Slide 69

Slide 69 text

@ramimacisabird Outstanding Challenges

Slide 70

Slide 70 text

@ramimacisabird Outstanding Challenges How do we offer users a single identity/role with a superset of their effective access?

Slide 71

Slide 71 text

@ramimacisabird Outstanding Challenges How do we help AWS Console users navigate Access Denied errors?

Slide 72

Slide 72 text

@ramimacisabird How do we bring JIT access inline to developer workflows? Outstanding Challenges

Slide 73

Slide 73 text

@ramimacisabird Outstanding Challenges How do we improve discovery, especially with granular tools?

Slide 74

Slide 74 text

@ramimacisabird Outstanding Challenges How do we make it easier to deploy a safe workbench?

Slide 75

Slide 75 text

@ramimacisabird Takeaways

Slide 76

Slide 76 text

@ramimacisabird Takeaways https://speakerdeck.com/ramimac/zero-touch 1. Zero touch production should be your North Star for cloud access 2. But, zero touch production is rarely your highest ROI for security 3. With six capabilities, you can deprecate production shell access 4. JIT and a unified cloud access devex are critical enablers

Slide 77

Slide 77 text

@ramimacisabird https://speakerdeck.com/ramimac/zero-touch and thanks to Christophe Tafani-Dereeper, Daniel Grzelak, Shashwat Sehgal, Dhruv Ahuja, & Ian Mckay for early feedback on the CFP Thank you!