Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Path to Zero Touch Production

The Path to Zero Touch Production

Zero Touch Prod is a Google-ism, and also a good idea. It's common that engineers, even at companies with strong security programs and cloud-native architecture, organically evolve operational processes that require they touch production daily.

As security practitioners, it's our job to keep our companies safe - both from bad actors, and also humans making mistakes. Allowing humans to work directly in production infrastructure introduces mitigable risks.

This talk shares my universal theory of how to incrementally and collaboratively move a cloud-native organization to Zero Touch Prod. We'll talk about why people touch prod, how they touch prod, and what we can do about it. I'll include a summary of the various production access primitives available in AWS, when to use them, and how to do so safely. We'll dive deep on the implementation options for building blocks like JIT/Temporary Access and operational script running.

Wherever you are in your Access Journey, you'll walk away with practical and pragmatic next steps!

Rami McCarthy

June 18, 2024
Tweet

More Decks by Rami McCarthy

Other Decks in Technology

Transcript

  1. @ramimacisabird Cloud Access Maturity Journey 1. Initial 2. Foundational 3.

    Established 4. Advanced 5. Zero Touch Production
  2. @ramimacisabird Cloud Access Maturity Journey Initial • Everybody has Admin

    and can Shell to production • Often using SSH over the internet • Until you hit 10 - 20 engineers (in non-sensitive sectors)
  3. @ramimacisabird Cloud Access Maturity Journey Foundational • Standing RBAC (

    Read, Engineer, Admin) • Shell and Admin broadly available • By the time you hit ~ 50 engineers
  4. @ramimacisabird Cloud Access Maturity Journey Established • More granular RBAC

    (team based or engineering sub-roles) • Shell and Admin start to get broken out • With a few narrow capabilities layered in • Technical controls for user lifecycle, JIT access • By the time you hit ~ 500 engineers
  5. @ramimacisabird Cloud Access Maturity Journey Advanced • Shell and Admin

    going away due to improved automation and observability • Least-privileged standing access, granular RBAC or ABAC • Common administrative actions are implemented with guardrails in internal tools • JIT access is used for specific cloud resources and granular tools, including restrictive shells
  6. @ramimacisabird Every change in production must either be: •made by

    automation •pre-validated by software •made via an audited break-glass mechanism Cloud Access Maturity Journey Zero Touch Production
  7. @ramimacisabird Every change in production must be … • pre-justified

    Cloud Access Maturity Journey Zero Touch Production
  8. @ramimacisabird 1. Report & review metrics 2. Make shell access

    safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access Moving up the maturity curve 6 1 5 2 3 4
  9. @ramimacisabird What makes this hard? Social When ambient privilege exists,

    expect systems and users to become dependent on it. BeyondCorp and the long tail of Zero Trust
  10. @ramimacisabird Moving up the maturity curve 1. Report & review

    metrics 2. Make shell access safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access 6 1 5 2 3 4
  11. @ramimacisabird Getting a Shell • I’m muddling two concepts •

    “Authentication” (i.e SSH ) vs. “Networking” (i.e VPN ) • You can mix and match, make sure you have both • Some AWS primitives handle both with a single tool
  12. @ramimacisabird SSH This is generally where you start, with 22

    open to the world, or maybe IP Allowlisting. launch-wizard-1 haunts my dreams Adopt: Options Buy: Options • Link 1 • Link 2 • Link 1 • Link 2 Docs: Connect to your Linux instance from Linux or macOS using SSH Cons Pros • Familiar to the average developer • … at least you’re using authentication? • Key management • Lack centralized visibility • Risky, long-lived credentials Cons Pros • less internet exposure • less to keep patched and hardened • see: SSH • You’re still dangling a box out on the internet, and racing 0-days • Even more key management Docs: Linux Bastion Hosts on AWS w/ Public Bastion Definitely better than raw SSH
  13. @ramimacisabird EC2 Instance Connect Use IAM policies and principals to

    control SSH access to your instances Docs: Connect to your Linux instance with EC2 Instance Connect Cons Pros • Solves public key management • Use IAM to control access • Audit connection requests • Doesn’t solve networking • Ephemeral public keys can be confusing • Instances need EC2IC installed
  14. @ramimacisabird SSH over VPN Docs: Getting started with Client VPN

    Cons Pros • No more internet exposure! (unless you’re managing a VPN concentrator) • Service/destination agnostic • Now you have to admin a VPN • DevEx hit • $$$
  15. @ramimacisabird SSH over VPN ( Wireguard) It’s not a regular

    VPN, it’s a cool VPN! Docs: Tailscale: AWS reference architecture Cons Pros • VPN, but better • Mesh network • See: VPN • Confuses (bad) auditors
  16. @ramimacisabird SSM ( Systems Manager Session Manager) Docs: AWS Systems

    Manager Session Manager Cons Pros • No need to figure out networking • Centralized access management tied to IAM • Logging and auditing bundled • New to developers, and some weird devex in the CLI • Slight pain in managing the Agent • $free.99
  17. @ramimacisabird EC2 Instance Connect Endpoint Identity-aware TCP proxy, letting you

    SSH and RDP to non-routable instances Docs: Secure Connectivity from Public to Private: Introducing EC2 Instance Connect Endpoint Cons Pros • No need to figure out networking, no public IP • No need for an agent • IAM authorization • Retroactively locked to SSH & RDP ( RIP rdsconn) • Clunky CLI • Limited logging • $free.99 • tcp-over-ssh-over-websocket
  18. @ramimacisabird ECS Exec Uses SSM to establish a connection with

    a running container Docs: Using Amazon ECS Exec to access your containers on AWS Fargate and Amazon EC2 Cons Pros • Logs the commands and commands output • IAM authorization • Comparatively limited compatibility • 1 session per PID • Limited to ECS • $free.99
  19. @ramimacisabird Zero-Trust Network Access Provide secure access to corporate applications

    without a VPN Docs: How Verified Access works Cons Pros • SSO integrated access • Good for internal web apps • Good for non-technical users • Complicated deployment (relatively) • $$$
  20. @ramimacisabird Other Options VDI Docs: What is Amazon WorkSpaces? Cons

    Pros • Flexible, once you have them • Good for non-technical users • Can be turned into a safe “workbench”, but hard to limit data exfiltration • High administrative burden, high complexity • $$ Cloud9 IDE Docs: EKS - Accessing a private only API server Cons Pros • Nice to have options I guess • Good for cloud development environments! • Just don’t do this in prod, please
  21. @ramimacisabird tl;dr 1. Use SSM to access instances, definitely by

    the time you hit Stage 3 : Established •ECS Exec is good enough to add in if you’re using ECS, but isn’t going to replace SSM generically 2. It’s nice to have Wireguard as a flexible option with decent devex and intrinsic security properties •ZTNA or an internal identity-aware proxy is an alternative/ supplement, if you’re dealing with a bunch of internal Web Apps
  22. @ramimacisabird Moving up the maturity curve 1. Report & review

    metrics 2. Make shell access safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access 6 1 5 2 3 4
  23. @ramimacisabird 🤔 Why must you touch production But: They don’t

    need an interactive session 95% of the time
  24. @ramimacisabird Other Production Access Capability Technology Script Running SSM RunCommand

    Asynchronous Jobs Speci fi c to your application/service AWS Systems Manager Automation Internal tools platform SSM Port Forwarding https://clutch.sh/, Retool, Django Admin, etc. Workbench VDI-based, or custom Read-only Analytics IAM authentication for RDS, or replicate to a DataLake + JIT Breakglass
  25. @ramimacisabird Other Production Access Capability Technology Script Running SSM RunCommand

    Asynchronous Jobs Speci fi c to your application/service AWS Systems Manager Automation Internal tools platform SSM Port Forwarding https://clutch.sh/, Retool, Django Admin, etc. Workbench VDI-based, or custom Read-only Analytics IAM authentication for RDS, or replicate to a DataLake +JIT Breakglass
  26. @ramimacisabird Just-in-Time Access • Essential to making a smooth migration

    • Allows you to incrementally ratchet friction against sub-optimal paths while leaving them available • Straightforward implementation of break glass • Easier User Access Reviews
  27. @ramimacisabird Just-in-Time Access • Grant access in a scheduled way,

    or on-demand • Pair with other controls, like multi-party approval • Use as a choke-point for logging • Generates metrics
  28. @ramimacisabird Moving up the maturity curve 1. Report & review

    metrics 2. Make shell access safer 3. Create shell alternatives & decompose roles 4. Reduce standing access 5. Improve DevEx of alternatives 6. Add friction to risky access 6 1 5 2 3 4
  29. @ramimacisabird Browser Extension • Insert reminders to refresh Identity Center

    sessions • Improve the disgusting AWS SSO login flow / auto click • Manage browser tab containers for multiple simultaneous sessions
  30. @ramimacisabird Unified ( CLI ) Wrapper • You need something

    like aws-vault or granted early • Wrapping it allows company-specific customization: • Smart role selection & integration with JIT access tooling -> improved discoverability • Smarter guidance based on error messages (no access to role vs. role not having access to action) • Single entry point for all your various production access utilities, abstract away the underlying primitives • Allows you to make implementation changes without interface changes • Collect metrics!
  31. @ramimacisabird Metrics •Number of employees (or Access Hours) with active

    production shell access •Usage of legacy shell, as compared to shell alternatives •Mean time to approval (in JIT ) , & auto-approval rate •Incidents attributable to human action Goal Metrics
  32. @ramimacisabird Guardrail Metrics Metrics •Incident Mean Time To Repair (

    MTTR ) •Break-glass frequency •Denied JIT access request rate •CSAT
  33. @ramimacisabird Outstanding Challenges How do we offer users a single

    identity/role with a superset of their effective access?
  34. @ramimacisabird Takeaways https://speakerdeck.com/ramimac/zero-touch 1. Zero touch production should be your

    North Star for cloud access 2. But, zero touch production is rarely your highest ROI for security 3. With six capabilities, you can deprecate production shell access 4. JIT and a unified cloud access devex are critical enablers