[Network Automation Meetup] Network Fire Drills with Chaos Engineering

[Network Automation Meetup] Network Fire Drills with Chaos Engineering

Have you had to wake up to tackle late-night on-call incidents? Not again. Network partition. Latency spike. Application systems are so fragile. What can we do about it? What if, we can know how systems behave under turbulent network conditions? Yes we can. HML will introduce the practice of Chaos Engineering, to proactively get ahead of network issues in your environment, so you can sleep better at night.

9fccf1fe0a5da1402f23e0566cb7c2ae?s=128

Ho Ming Li

July 16, 2019
Tweet

Transcript

  1. 1.

    1 Network Fire Drills with Chaos Engineering Ho Ming Li

    Principal Solutions Architect, Gremlin hml@gremlin.com @horeal July 2019
  2. 3.

    @horeal @gremlininc Pretty boring. AWS/VPC/Subnets Bastion Host Split Staging and

    Production Two regions (today, more in the future) Cloudformation/Terraform (Infrastructure as Code) Auditing and Privilege Escalation
  3. 5.
  4. 6.
  5. 7.

    @horeal @gremlininc I’m not Amazon. I’m not Netflix. I’m not

    _________ . I don’t have their problems. … okay.
  6. 8.

    @horeal @gremlininc But… does it mean your system never fail?

    You still need resilience in your systems. Availability? Data Integrity? Deliverability?
  7. 9.

    @horeal @gremlininc Think about Change Management Think also Operational Excellence

    When your service is not resilient, you cannot deliver value to customers.
  8. 10.

    @horeal @gremlininc Why is it so hard? Existing methods take

    reactive approach. Chaos Engineering instead takes a proactive approach to validate the knowns and surface the unknowns.
  9. 12.
  10. 15.

    @horeal @gremlininc Common Failure Scenarios 1. Slowness from Database 2.

    Network Partitions 3. Managed Service Failure 4. External Dependency Failure
  11. 18.

    @horeal @gremlininc External Dependencies - S3? Dyn? Cloudflare? - What

    if THEY fail? - Raise the bar. - Own the user experience.
  12. 19.

    @horeal @gremlininc Network Traffic Control (TC) $ tc qdisc add

    dev eth0 root netem delay 1000ms 500ms Iptable iptables -A OUTPUT -p tcp -d 157.240.0.0/16 -j DROP PF (Mac) block quick from any to 157.240.0.0/16