$30 off During Our Annual Pro Sale. View Details »

Adopt chaos engineering techniques in your daily work

DragonBe
February 03, 2020

Adopt chaos engineering techniques in your daily work

What happens when your database server runs out of disk space? Will your customers still be able to purchase on your webshop when your web server is at max capacity? Are your surveys still valid after a bot has filled out a couple of thousand times your forms?

These are just a few of the many things that can and will go wrong in your production environments. Are you confident your systems are still delivering value to your customers when the worst possible thing happens? The only way to know for sure is to adopt chaos engineering techniques. As popularised by Netflix with their open sourced Chaos Monkey and Simian Army tools, we should put our system under constant stress to ensure that we can face disruptive disasters at any given time.

In this talk I walk through some of the disasters we faced in the past decade and how we learned how to build resilience by design in all of our projects. We'll also share with you our learnings and our successes when Armageddon takes place. It will be an exciting experience that makes you become Dr. Evil in your own company.

DragonBe

February 03, 2020
Tweet

More Decks by DragonBe

Other Decks in Technology

Transcript

  1. Michelangelo van Dam (@DragonBe)
    Adopt chaos
    engineering
    techniques
    in your daily work

    View Slide

  2. Michelangelo van Dam (@DragonBe) 2
    Michelangelo van Dam
    I'm a senior #php architect,
    co-founder and #ceo of @in2itvof,
    #community leader at
    @phpbenelux, coach at
    @CoderDojoBelgium,
    #digitalnomad, likes #coffee.

    View Slide

  3. Michelangelo van Dam (@DragonBe) 3
    What could possibly go wrong?

    View Slide

  4. Michelangelo van Dam (@DragonBe)


    Chaos Engineering
    Brief intro to the world of chaos engineering
    Chaos engineering is the discipline of experimenting on a
    software system in production in order to build confidence in
    the system's capability to withstand turbulent and unexpected
    conditions.
    Source: WikiPedia
    4

    View Slide

  5. Michelangelo van Dam (@DragonBe)


    Chaos Engineering
    Brief intro to the world of chaos engineering
    Chaos engineering is the discipline of experimenting on a
    software system in production in order to build confidence in
    the system's capability to withstand turbulent and unexpected
    conditions.
    Source: WikiPedia
    5

    View Slide

  6. Michelangelo van Dam (@DragonBe)


    Chaos Engineering
    Brief intro to the world of chaos engineering
    Chaos engineering is experimenting in production to
    withstand turbulent and unexpected conditions.
    Source: WikiPedia
    6

    View Slide

  7. Michelangelo van Dam (@DragonBe)
    Chaos Engineering
    7
    In other words...
    Doom 5 Artwork, courtesy of id Software

    View Slide

  8. Michelangelo van Dam (@DragonBe)
    Our hard lessons
    8

    View Slide

  9. Michelangelo van Dam (@DragonBe)
    Sep 2009: Power surge destroys hardware
    9
    Not too much power please
    Facts Constraints Solution
    ● There was a brief
    power outage
    ● When the power came
    back on, there was a
    surge in voltage
    ● Immediately destroyed
    all hardware
    ● All devices had retail
    grade power sockets
    with surge protection
    ● UPS system protected
    the server, not other
    devices
    ● Rewired the office with
    build-in, professional
    power surge protection
    ● UPS upgrade to
    support static
    hardware (printer,
    phone, monitors, NAS
    devices)

    View Slide

  10. Michelangelo van Dam (@DragonBe)
    Apr 2011: AWS goes down & everyone else
    10
    Putting all eggs in different baskets ended up in the same basket after all
    Facts Constraints Solution
    ● AWS had issues and
    was down for a whole
    week
    ● With AWS, many other
    services were not
    accessible as well
    ● We used these
    different services to
    avoid this from
    happening
    ● No information given
    by Amazon
    ● No information given
    by our service
    providers
    ● We replicated critical
    services on-prem

    View Slide

  11. Michelangelo van Dam (@DragonBe)
    Mar 2012: Server down & inaccessible
    11
    Running out of wiggle space
    Facts Constraints Solution
    ● Server was under
    heavy load
    ● Logs were filling up
    diskspace quickly
    ● Alert of < 20%
    diskspace came too
    late
    ● Weren’t able to SSH
    into server: no space
    left
    ● Logging was on the
    same server as the
    application
    ● Monitoring was on a
    different server, but
    behind same switch
    ● Setting up centralized
    logging server
    ● Moved monitoring on
    different hosting
    provider
    ● Added mountable
    storage features to our
    servers

    View Slide

  12. Michelangelo van Dam (@DragonBe)
    Feb 2013: Internet outage by national ISP
    12
    Being offline in an online world
    Facts Constraints Solution
    ● No internet for 14
    hours
    ● Deadline to deliver
    ● QA and deploy tools
    were online
    ● Major ISP in Belgium
    ● Alternative locations
    had no internet too
    ● Mobile hotspot in the
    office from competing
    provider

    View Slide

  13. Michelangelo van Dam (@DragonBe)
    Feb 2020: Covid-19 Pandemic (Coronavirus)
    13
    Stay in your home!
    Facts Constraints Solution
    ● Viral infection that
    sickens and kills
    people globally
    ● China, Korea and Italy
    on lock-down
    ● Belgian government
    advises to uphold a
    very good hygiene and
    to avoid unnecessary
    human contacts
    ● Events all over the
    world are cancelled
    ● Businesses are not
    ready for WFH
    workforce
    ● Events are still
    depending on face to
    face interactions, not
    for sessions but for
    sponsors
    ● Adopt a remote-first
    mentality in your
    company, school,
    hobby group and event
    organization

    View Slide

  14. Michelangelo van Dam (@DragonBe)
    2021: Supply-Chain Attacks (Still ongoing)
    14
    Poisoning the well ☠
    Facts Constraints Solution
    ● Successful
    compromise of source
    code used by many
    other applications
    (SolarWinds,
    Codecov)
    ● Failed attempts to
    breach open source
    languages (PHP,
    Python, Ruby)
    ● Active phishing
    campaigns ongoing
    ● A breach in one
    component impacts all
    dependent projects
    ● Hard to detect
    because it is provided
    by a trusted source
    and signatures match
    up
    ● MFA for all code
    repositories required
    ● Commit signing with
    certificate or GPG key
    on hardware token
    (YubiKey)
    ● Library management,
    monitoring and alerting

    View Slide

  15. Michelangelo van Dam (@DragonBe)
    2021: Floods & wildfires (Still ongoing)
    15
    Effects of climate change will affect everyone 🌍
    Facts Constraints Solution
    ● Parts of eastern
    Belgium, The
    Netherlands and
    West-Germany were
    flooded, including
    some data centers
    ● Wildfires in Eastern
    Europe, US and
    Canada causes issues
    for networks and data
    centers
    ● No access to data and
    services because DC
    is not reachable
    ● No backups accessible
    because they were
    kept in DC
    ● Rebuilding
    infrastructure is hard
    when not using
    infrastructure as code
    ● Multi-regional
    replication of
    applications
    ● Backups kept also at a
    different DC
    ● Ensure you can
    recreate your
    infrastructure and
    applications in an
    automated fashion

    View Slide

  16. Michelangelo van Dam (@DragonBe)
    We needed to prepare
    for chaos
    16

    View Slide

  17. Michelangelo van Dam (@DragonBe)
    Inspired by Netflix
    17
    Chaos Monkey & Chaos Engineering
    Don’t wait for chaos, create it
    Observe, learn and remediate

    View Slide

  18. Michelangelo van Dam (@DragonBe)
    Four areas of building resilience
    18
    Networks Infrastructure Applications
    What are the areas where chaos can disrupt our operations?
    Humans

    View Slide

  19. Michelangelo van Dam (@DragonBe)
    What can go wrong with networks?
    19
    No internet connection
    No IP available
    Latency and timeouts
    Wrong or bad encryption/certificates
    Man-in-the-Middle attacks
    Crucial
    Crucial
    Moderate
    Major
    Major

    View Slide

  20. Michelangelo van Dam (@DragonBe)
    What can go wrong with infrastructure?
    20
    Hardware failure
    Infrastructure down/not responsive
    Resource overload
    Bad configuration (public vs private)
    Active hacking and exploits
    Crucial
    Crucial
    Moderate
    Major
    Major

    View Slide

  21. Michelangelo van Dam (@DragonBe)
    What can go wrong with applications?
    21
    Tight coupling of services
    Bad coding practices
    Bugs and vulnerabilities
    Unable to disable compute intense parts
    Active hacking and exploits
    Crucial
    Crucial
    Moderate
    Major
    Major

    View Slide

  22. Michelangelo van Dam (@DragonBe)
    What are our human challenges?
    22
    Bus factor
    Single Point of Failure / Bottleneck
    Diseases and strikes
    Insider threats / (un)voluntary data breach
    Staffing & Contracting
    Crucial
    Crucial
    Major
    Major
    Moderate

    View Slide

  23. Michelangelo van Dam (@DragonBe)
    Our remediation efforts
    23

    View Slide

  24. Michelangelo van Dam (@DragonBe)


    Monitor everything
    24
    If it moves, you track it
    Knowing is good, but knowing everything is better.
    Dave Eggers, quote from his book “The Circle”

    View Slide

  25. Michelangelo van Dam (@DragonBe)
    Monitoring elements
    25
    Disk space CPU Memory
    Provision time Deployment time Network throughput
    Total logins Successful logins Failed logins
    Server requests Page load Data load
    Queue size Cache ratio DB Query times
    Active sessions Total sessions Global ping times
    Alerts HTTP Responses Much more...

    View Slide

  26. Michelangelo van Dam (@DragonBe)
    One of our dashboards
    26

    View Slide

  27. Michelangelo van Dam (@DragonBe)
    Principles of Chaos Engineering
    1. Build a Hypothesis around Steady State Behavior
    2. Vary Real-world Events
    3. Run Experiments in Production
    4. Automate Experiments to Run Continuously
    5. Minimize Blast Radius
    Source: https://principlesofchaos.org/
    27

    View Slide

  28. Michelangelo van Dam (@DragonBe)
    How do we test network failures
    28
    Becoming evil at disrupting networks 👹
    Unplugging the master power from ISP connection
    Configure route table to non-routable gateways
    Set DNS to 127.0.0.1
    Syn-flood the network
    Create a bad SSL/TLS certificate

    View Slide

  29. Michelangelo van Dam (@DragonBe)
    How do we test infrastructure failures
    29
    Have you turned it off and on again? No, just off 🚦
    Switching off the power to devices
    Change credentials for services
    Turning off or destroying services
    Removing payment instructions from service providers
    Changing configurations in production to make services public

    View Slide

  30. Michelangelo van Dam (@DragonBe)
    How do we test application failures
    30
    Giving it all that you got
    Putting applications under constant stress
    Providing wrong values for forms and API calls
    Disabling JavaScript and CSS
    Constantly running automated penetration tests
    Switching configurations (db connection becomes storage connection)

    View Slide

  31. Michelangelo van Dam (@DragonBe)
    How do we challenge the human aspect
    31
    One can only be diverse and welcoming if you have tested it
    Social Engineering Tests
    Table-top Exercises / Business DnD Game nights
    Extract data from systems & score their sensitivity
    Internal Workshop & Certification Programs
    Switching roles

    View Slide

  32. Michelangelo van Dam (@DragonBe)
    What about data?
    32

    View Slide

  33. Michelangelo van Dam (@DragonBe)
    Assume breach, always!
    33
    Adopt a zero-trust data policy
    Zombie Attack, courtesy of Perth Zombie Apocalypse Simulation

    View Slide

  34. Michelangelo van Dam (@DragonBe)
    Less is more
    34
    The less data you have, the better you are protected against loss or corruption of that data
    Empty vault, courtesy of Hang The Bankers

    View Slide

  35. Michelangelo van Dam (@DragonBe)
    Tools and references
    35

    View Slide

  36. Michelangelo van Dam (@DragonBe)
    A look into our chaos kitchen
    36
    GAUNTLT
    BE MEAN TO YOUR CODE AND LIKE IT
    Put your app under constant stress
    PHPever
    PHP Mutation
    Testing Framework

    View Slide

  37. Michelangelo van Dam (@DragonBe)
    Open source tools we use
    37
    Phabricator
    OWASP
    ZED ATTACK PROXY

    View Slide

  38. Michelangelo van Dam (@DragonBe)
    Resources for more information
    38

    View Slide

  39. Michelangelo van Dam (@DragonBe) 39
    Questions?

    View Slide