Slide 1

Slide 1 text

@aaronrinehart @verica_io #chaosengineering Security & Chaos Engineering

Slide 2

Slide 2 text

@aaronrinehart @verica_io #chaosengineering ● Combating Complexity in Software ● Chaos Engineering ● Resilience Engineering & Security ● Security Chaos Engineering Areas Covered

Slide 3

Slide 3 text

3 Aaron Rinehart CTO & Co-Founder ● Former Chief Security Architect @UnitedHealth ● Former DoD, NASA Safety & Reliability Engineering ● Frequent speaker and author on Chaos Engineering & Security ● O’Reilly Author: Chaos Engineering, Security Chaos Engineering Books ● Pioneer behind Security Chaos Engineering ● Led ChaoSlingr team at UnitedHealth @aaronrinehart @verica_io #chaosengineering

Slide 4

Slide 4 text

Incidents,Outages, & Breaches are Costly

Slide 5

Slide 5 text

An Obvious Problem

Slide 6

Slide 6 text

Why do they seem to be happening more often?

Slide 7

Slide 7 text

@aaronrinehart @verica_io #chaosengineering Combating Complexity in Software

Slide 8

Slide 8 text

Our systems have evolved beyond human ability to mentally model their behavior. 8

Slide 9

Slide 9 text

9 everyone else Our systems have evolved beyond human ability to mentally model their behavior.

Slide 10

Slide 10 text

“The growth of complexity in society has got ahead of our understanding of how complex systems work and fail” -Sydney Dekker 10

Slide 11

Slide 11 text

@aaronrinehart @verica_io #chaosengineering What Do you Mean by Complex Systems?

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Circuit Breaker Patterns 13 Continuous Delivery Distributed Systems Blue/Green Deployments Cloud Computing Service Mesh Containers Immutable Infrastructure Infracode Continuous Integration Microservice Architectures API Auto Canaries CI/CD DevOps Automation Pipelines Complex?

Slide 14

Slide 14 text

Mostly Monolithic Requires Domain Knowledge Prevention focused Poorly Aligned Defense in Depth Stateful in nature DevSecOps not widely adopted Security? Expert Systems Adversary Focused

Slide 15

Slide 15 text

Simplify?

Slide 16

Slide 16 text

Software has officially taken over

Slide 17

Slide 17 text

Software Only Increases in Complexity

Slide 18

Slide 18 text

Accidental Essential Software Complexity

Slide 19

Slide 19 text

“As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases” - Dr. David Woods Woods Theorem:

Slide 20

Slide 20 text

What does this have to do with my systems?

Slide 21

Slide 21 text

Question How well do you really understand your own systems?

Slide 22

Slide 22 text

Systems Engineering is Messy In Reality…….

Slide 23

Slide 23 text

In the beginning...we think it looks like

Slide 24

Slide 24 text

After a few months…. Hard Coded Passwords Identity Conflicts Lead Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 850 Microservices Cloud Provider API Outage WAF Outage -> Disabled Scalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Delayed Features DNS Resolution Errors Expired Certificate Regulatory Audit Rolling Sev1 Outage on Portal Code Freeze

Slide 25

Slide 25 text

Years?…. Hard Coded Passwords Identity Conflicts Lead Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 4000 Microservices Cloud Provider API Outage Firewall Outage -> Disabled Scalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Outage Delayed Features DNS Resolution Errors Expired Certificate Regulatory Audit Rolling Sev1 Outages on Portal Code Freeze Hard Coded Passwords Identity Conflicts Lead Software Engineering finds a new job at Google New Security Tool Refactor Pricing 300 Microservices Δ-> 850 Microservices Cloud Provider API Outage WAF Outage -> Disabled Scalability Issues Network is Unreliable Autoscaling Keeps Breaking Large Customer Outage Delayed Features DNS Resolution Errors Expired Certificate Regulatory Audit Rolling Sev1 Outage on Portal Merger with competitor Misconfigured FW Rule Outage Database Outage Portal Retry Storm Outage Orphaned Documentation Corporate Reorg Budget Freeze Outsource overseas development Exposed Secrets on GithuCode Freeze b Migration to New CSP Upgrade to Java SE 12

Slide 26

Slide 26 text

Our systems become more complex and messy than we remember them

Slide 27

Slide 27 text

So what does all of this $&%* have to do with Security?

Slide 28

Slide 28 text

The Normal Condition is to FAIL

Slide 29

Slide 29 text

We need failure to Learn & Grow 29

Slide 30

Slide 30 text

“things that have never happened before happen all the time” –Scott Sagan “The Limits of Safety”

Slide 31

Slide 31 text

How do we typically discover when our security measures fail?

Slide 32

Slide 32 text

Security incidents are not effective measures of detection because at that point it's already too late 32 Security Incidents

Slide 33

Slide 33 text

No System is inherently Secure by Default, its Humans that make them that way.

Slide 34

Slide 34 text

People Operate Differently when they expect things to fail

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

@aaronrinehart @verica_io #chaosengineering Chaos Engineering

Slide 38

Slide 38 text

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s ability to withstand turbulent conditions” Chaos Engineering

Slide 39

Slide 39 text

Chaos Engineerin g Is about establishing order from Chaos

Slide 40

Slide 40 text

● Define steady state ● Formulate hypothesis ● Outline methodology ● Identify blast radius ● Observability is key ● Readily abortable Chaos Monkey Story ● During Business Hours ● Born out of Netflix Cloud Transformation ● Put well defined problems in front of engineers. ● Terminate VMs on Random VPC Instances

Slide 41

Slide 41 text

Who is doing Chaos?

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

November 2020

Slide 44

Slide 44 text

Experimentation Testing vs. Instrumenting Chaos

Slide 45

Slide 45 text

● Formulate hypothesis ● Outline methodology ● Identify blast radius ● Observability is key ● Readily abortable Chaos Pitfalls:Breaking things on Purpose “I'm pretty sure I won’t have a job very long if I break things on purpose all day.” -Casey Rosenthal The purpose of Chaos Engineering is NOT to “Break Things on Purpose”. If anything we are trying to “Fix them on Purpose”! Reference: Nora Jones 8 Traps of Chaos Engineering

Slide 46

Slide 46 text

SECURITY CHAOS ENGINEERING 46

Slide 47

Slide 47 text

“It worked in Star Wars but it won’t work here” Hope is Not an Effective Strategy

Slide 48

Slide 48 text

“Understand your system and where its security gaps are before an adversary does”

Slide 49

Slide 49 text

WE OFTEN MISREMEMBER WHAT OUR SYSTEMS REALLY ARE, AND AS A RESUL T THE OPPORTUNITY FOR ACCIDENTS & MISTAKES INCREASES

Slide 50

Slide 50 text

Continuous Security Verification

Slide 51

Slide 51 text

Reduce Uncertainty by Building Confidence in how the system actually functions

Slide 52

Slide 52 text

Use Cases

Slide 53

Slide 53 text

● Incident Response ● Security Control Validation ● Security Observability ● Compliance Monitoring Use Cases

Slide 54

Slide 54 text

“Response” is the problem with incident response. Incident Response 54

Slide 55

Slide 55 text

We really don’t know very much No matter how much we prepare... Security Incidents are Subjective Where? Why? Who? What? How? 55

Slide 56

Slide 56 text

Post Mortem = Preparation Flip the Model

Slide 57

Slide 57 text

57 An Open Source Tool

Slide 58

Slide 58 text

• ChatOps Integration • Configuration-as-Code • Example Code & Open Framework ChaoSlingr Product Features • Serverless App in AWS • 100% Native AWS • Configurable Operational Mode & Frequency • Opt-In | Opt-Out Model

Slide 59

Slide 59 text

59 An Example: PortSlingr Experiment Hypothesis: A Misconfigured or Unauthorized Port Change in AWS

Slide 60

Slide 60 text

Hypothesis: If someone accidentally or maliciously introduced a misconfigured port then we would immediately detect, block, and alert on the event. Alert SOC? Config Mgmt? Misconfigured Port Injection IR Triage Log data? Wait... Firewall?

Slide 61

Slide 61 text

Alert SOC? Config Mgmt? Misconfigured Port Injection IR Triage Log data? Wait... Firewall? Experimentation Opportunities

Slide 62

Slide 62 text

Stop looking for better answers and start asking better questions. - John Allspaw

Slide 63

Slide 63 text

verica.io/book VERICA | CONTINUOUS VERIFICATION Get your copy of the O’Reilly Chaos Engineering Book Free Copy Compliments of Verica.io