Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Security & Chaos Engineering: A Novel Approach to Crafting Secure and Resilient Distributed Systems

Security & Chaos Engineering: A Novel Approach to Crafting Secure and Resilient Distributed Systems

Modern systems pose a number of thorny challenges and securing the transformation from legacy monolithic systems to distributed systems demands a change in mindset and engineering toolkit. The security engineering toolkit is unfortunately out-of-style and outdated with today's approach to building, security and operating distributed systems. The speed, scale, and complex operations within microservice architectures make them tremendously difficult for humans to mentally model their behavior. If the latter is even remotely true how is it possible to adequately secure services that are not even fully comprehended by the engineering teams that built them. Security Chaos Engineering helps teams realign the actual state of operational security as well as build confidence that their security actually works the way the think it does. Chaos Engineering allows for security teams to proactively experiment on recurring incident patterns to derive new information about underlying factors that were previously unknown by reversing the postmortem and preparation phases. This is done by developing live fire exercises that can be measured, managed, and automated. It develops teams by building a learning culture around system failure to challenge engineering teams to proactively, safely discover system weakness before they disrupt business outcomes. In this session we will introduce a new concept known as Security Chaos Engineering and how it can be applied to create highly secure, performant, and resilient distributed systems.

Aaron Rinehart

October 16, 2020
Tweet

Other Decks in Technology

Transcript

  1. @aaronrinehart @verica_io #chaosengineering Security & Chaos Engineering

  2. @aaronrinehart @verica_io #chaosengineering • Combating Complexity in Software • Chaos

    Engineering • Resilience Engineering & Security • Security Chaos Engineering Areas Covered
  3. 3 Aaron Rinehart CTO & Co-Founder • Former Chief Security

    Architect @UnitedHealth • Former DoD, NASA Safety & Reliability Engineering • Frequent speaker and author on Chaos Engineering & Security • O’Reilly Author: Chaos Engineering, Security Chaos Engineering Books • Pioneer behind Security Chaos Engineering • Led ChaoSlingr team at UnitedHealth @aaronrinehart @verica_io #chaosengineering
  4. Incidents,Outages, & Breaches are Costly

  5. An Obvious Problem

  6. Why do they seem to be happening more often?

  7. @aaronrinehart @verica_io #chaosengineering Combating Complexity in Software

  8. Our systems have evolved beyond human ability to mentally model

    their behavior. 8
  9. 9 everyone else Our systems have evolved beyond human ability

    to mentally model their behavior.
  10. “The growth of complexity in society has got ahead of

    our understanding of how complex systems work and fail” -Sydney Dekker 10
  11. @aaronrinehart @verica_io #chaosengineering What Do you Mean by Complex Systems?

  12. None
  13. Circuit Breaker Patterns 13 Continuous Delivery Distributed Systems Blue/Green Deployments

    Cloud Computing Service Mesh Containers Immutable Infrastructure Infracode Continuous Integration Microservice Architectures API Auto Canaries CI/CD DevOps Automation Pipelines Complex?
  14. Mostly Monolithic Requires Domain Knowledge Prevention focused Poorly Aligned Defense

    in Depth Stateful in nature DevSecOps not widely adopted Security? Expert Systems Adversary Focused
  15. Simplify?

  16. Software has officially taken over

  17. Software Only Increases in Complexity

  18. Accidental Essential Software Complexity

  19. “As the complexity of a system increases, the accuracy of

    any single agent’s own model of that system decreases” - Dr. David Woods Woods Theorem:
  20. What does this have to do with my systems?

  21. Question How well do you really understand your own systems?

  22. Systems Engineering is Messy In Reality…….

  23. In the beginning...we think it looks like

  24. After a few months…. Hard Coded Passwords Identity Conflicts Lead

    Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 850 Microservices Cloud Provider API Outage WAF Outage -> Disabled Scalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Delayed Features DNS Resolution Errors Expired Certificate Regulatory Audit Rolling Sev1 Outage on Portal Code Freeze
  25. Years?…. Hard Coded Passwords Identity Conflicts Lead Software Engineering finds

    a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 4000 Microservices Cloud Provider API Outage Firewall Outage -> Disabled Scalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Outage Delayed Features DNS Resolution Errors Expired Certificate Regulatory Audit Rolling Sev1 Outages on Portal Code Freeze Hard Coded Passwords Identity Conflicts Lead Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 850 Microservices Cloud Provider API Outage WAF Outage -> Disabled Scalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Outage Delayed Features DNS Resolution Errors Expired Certificate Regulatory Audit Rolling Sev1 Outage on Portal Merger with competitor Misconfigured FW Rule Outage Database Outage Portal Retry Storm Outage Orphaned Documentation Corporate Reorg Budget Freeze Outsource overseas development Exposed Secrets on GithuCode Freeze b Migration to New CSP Upgrade to Java SE 12
  26. Our systems become more complex and messy than we remember

    them
  27. So what does all of this $&%* have to do

    with Security?
  28. The Normal Condition is to FAIL

  29. We need failure to Learn & Grow 29

  30. “things that have never happened before happen all the time”

    –Scott Sagan “The Limits of Safety”
  31. How do we typically discover when our security measures fail?

  32. Security incidents are not effective measures of detection because at

    that point it's already too late 32 Security Incidents
  33. No System is inherently Secure by Default, its Humans that

    make them that way.
  34. People Operate Differently when they expect things to fail

  35. None
  36. None
  37. @aaronrinehart @verica_io #chaosengineering Chaos Engineering

  38. “Chaos Engineering is the discipline of experimenting on a distributed

    system in order to build confidence in the system’s ability to withstand turbulent conditions” Chaos Engineering
  39. Chaos Engineerin g Is about establishing order from Chaos

  40. • Define steady state • Formulate hypothesis • Outline methodology

    • Identify blast radius • Observability is key • Readily abortable Chaos Monkey Story • During Business Hours • Born out of Netflix Cloud Transformation • Put well defined problems in front of engineers. • Terminate VMs on Random VPC Instances
  41. Who is doing Chaos?

  42. None
  43. November 2020

  44. Experimentation Testing vs. Instrumenting Chaos

  45. • Formulate hypothesis • Outline methodology • Identify blast radius

    • Observability is key • Readily abortable Chaos Pitfalls:Breaking things on Purpose “I'm pretty sure I won’t have a job very long if I break things on purpose all day.” -Casey Rosenthal The purpose of Chaos Engineering is NOT to “Break Things on Purpose”. If anything we are trying to “Fix them on Purpose”! Reference: Nora Jones 8 Traps of Chaos Engineering
  46. SECURITY CHAOS ENGINEERING 46

  47. “It worked in Star Wars but it won’t work here”

    Hope is Not an Effective Strategy
  48. “Understand your system and where its security gaps are before

    an adversary does”
  49. WE OFTEN MISREMEMBER WHAT OUR SYSTEMS REALLY ARE, AND AS

    A RESUL T THE OPPORTUNITY FOR ACCIDENTS & MISTAKES INCREASES
  50. Continuous Security Verification

  51. Reduce Uncertainty by Building Confidence in how the system actually

    functions
  52. Use Cases

  53. • Incident Response • Security Control Validation • Security Observability

    • Compliance Monitoring Use Cases
  54. “Response” is the problem with incident response. Incident Response 54

  55. We really don’t know very much No matter how much

    we prepare... Security Incidents are Subjective Where? Why? Who? What? How? 55
  56. Post Mortem = Preparation Flip the Model

  57. 57 An Open Source Tool

  58. • ChatOps Integration • Configuration-as-Code • Example Code & Open

    Framework ChaoSlingr Product Features • Serverless App in AWS • 100% Native AWS • Configurable Operational Mode & Frequency • Opt-In | Opt-Out Model
  59. 59 An Example: PortSlingr Experiment Hypothesis: A Misconfigured or Unauthorized

    Port Change in AWS
  60. Hypothesis: If someone accidentally or maliciously introduced a misconfigured port

    then we would immediately detect, block, and alert on the event. Alert SOC? Config Mgmt? Misconfigured Port Injection IR Triage Log data? Wait... Firewall?
  61. Alert SOC? Config Mgmt? Misconfigured Port Injection IR Triage Log

    data? Wait... Firewall? Experimentation Opportunities
  62. Stop looking for better answers and start asking better questions.

    - John Allspaw
  63. verica.io/book VERICA | CONTINUOUS VERIFICATION Get your copy of the

    O’Reilly Chaos Engineering Book Free Copy Compliments of Verica.io