Chaos Engineering - DevOps @ Nike Day

Chaos Engineering - DevOps @ Nike Day

A talk for the DevOps @ Nike Day.

Learn more about Chaos Engineering: gremlin.com/community
Join the Chaos Engineering Slack: gremlin.com/slack

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bütow

June 18, 2018
Tweet

Transcript

  1. CHAOS ENGINEERING @tammybütow, Gremlin

  2. “Build, test, revise, build. That was our motto.” @tammybütow DEVOPS

    @ NIKE DAY
  3. “People constantly updated testing documents, and every day we made

    changes in the moment” @tammybütow DEVOPS @ NIKE DAY
  4. NIKE HYPERADAPT 1.0 @tammybütow DEVOPS @ NIKE DAY

  5. @tammybütow DEVOPS @ NIKE DAY

  6. People Put Nike Hyperadapt 1.0 To The Test: • On

    The Court - 20 Basketball Research Athletes • Pounding Pavement - running at Nike campus • In The Gym - Nike Employee Training Classes • Walking And Working - 150 Employees @tammybütow DEVOPS @ NIKE DAY
  7. How does this relate to Chaos Engineering? @tammybütow DEVOPS @

    NIKE DAY
  8. You can use Chaos Engineering to ensure your systems are

    as resilient as your sneakers. @tammybütow, Gremlin @tammybütow DEVOPS @ NIKE DAY
  9. @tammybütow, Gremlin @tammybütow DEVOPS @ NIKE DAY

  10. TAMMY BÜTOW Principal SRE, Gremlin Causing chaos in prod since

    2009 @tammybütow @tammybütow DEVOPS @ NIKE DAY !
  11. GREMLIN • We are practitioners of Chaos Engineering • We

    build software that helps engineers build resilient systems • We offer 11 ways to inject chaos for your Chaos Engineering experiments @tammybütow DEVOPS @ NIKE DAY
  12. PART 1: LAYING THE FOUNDATION @tammybütow DEVOPS @ NIKE DAY

  13. It would be silly to give an Olympic pole-vaulter a

    broom and ban them from practicing! @tammybütow DEVOPS @ NIKE DAY
  14. “Thoughtful planned experiments designed to reveal the weaknesses in our

    systems” - Kolton Andrus, Gremlin CEO @tammybütow, Gremlin @tammybütow DEVOPS @ NIKE DAY
  15. Eventually systems will break in many undesired ways. Break them

    first on purpose with controlled chaos! @tammybütow DEVOPS @ NIKE DAY
  16. DOGFOODING • Using your own product. • For us that

    means using Gremlin for our Chaos Engineering experiments. • Failure Fridays @tammybütow DEVOPS @ NIKE DAY
  17. Failure Fridays are dedicated time for teams to collaboratively focus

    on using Chaos Engineering practices to reveal weaknesses in your services. @tammybütow DEVOPS @ NIKE DAY
  18. WHY DO DISTRIBUTED SYSTEMS NEED CHAOS? • Unusual hard to

    debug failures are common • Systems & companies scale rapidly and Chaos Engineering helps you learn along the way @tammybütow DEVOPS @ NIKE DAY
  19. FULL-STACK CHAOS ENGINEERING • You can inject chaos at any

    layer. • API, App, Cache, Database, OS, Host, Network, Power & more. @tammybütow DEVOPS @ NIKE DAY
  20. WHY RUN CHAOS ENGINEERING EXPERIMENTS? @tammybütow DEVOPS @ NIKE DAY

  21. Are you confident that your metrics and alerting are as

    good as they should be? @tammybütow DEVOPS @ NIKE DAY #pagerpain
  22. Are you confident your customers are getting as good an

    experience as they should be? @tammybütow DEVOPS @ NIKE DAY #customerpain
  23. Are you losing money due to downtime and broken features?

    @tammybütow DEVOPS @ NIKE DAY #businesspain
  24. HOW DO YOU RUN CHAOS ENGINEERING EXPERIMENTS? @tammybütow DEVOPS @

    NIKE DAY
  25. HOW TO RUN A CHAOS ENGINEERING EXPERIMENT • Form a

    hypothesis • Consider blast radius • Run experiment • Measure results • Find & fix issues or scale @tammybütow DEVOPS @ NIKE DAY ⚡
  26. Don’t run before you can walk @tammybütow, Gremlin @tammybütow DEVOPS

    @ NIKE DAY
  27. HOW TO CHOOSE A CHAOS EXPERIMENT • Identify top 5

    critical systems • Choose 1 system • Whiteboard the system • Select attack: resource/ state/network • Determine scope @tammybütow DEVOPS @ NIKE DAY ⚡
  28. WHAT SHOULD WE MEASURE? • Availability — 500s • Service

    specific KPIs • System metrics: CPU, IO, Disk • Customer complaints @tammybütow DEVOPS @ NIKE DAY
  29. EXAMPLE SYSTEM: KUBERNETES RETAIL STORE @tammybütow DEVOPS @ NIKE DAY

    User Primary: kube-01 Node: kube-02 Node: kube-03 Node: kube-04
  30. PART 2: RESOURCE CHAOS ENGINEERING @tammybütow DEVOPS @ NIKE DAY

  31. @tammybütow DEVOPS @ NIKE DAY We can increase CPU, Disk,

    IO & Memory consumption to ensure monitoring is setup to catch problems. Important to catch issues before they turn into high severity incidents (unable to purchase new product!) and downtime for customers. RESOURCE CHAOS
  32. CPU CHAOS @tammybütow DEVOPS @ NIKE DAY

  33. @tammybütow DEVOPS @ NIKE DAY https://github.com/tammybutow/chaosengineeringbootcamp LET’S CREATE A “KNOWN-KNOWN”

    EXPERIMENT
  34. @tammybütow DEVOPS @ NIKE DAY CHAOS IN TOP

  35. @tammybütow DEVOPS @ NIKE DAY LET’S KILL THE CHAOS NOW

  36. @tammybütow DEVOPS @ NIKE DAY NO MORE CHAOS IN TOP

  37. DISK CHAOS @tammybütow DEVOPS @ NIKE DAY

  38. @tammybütow DEVOPS @ NIKE DAY DISK CHAOS

  39. MEMORY CHAOS @tammybütow DEVOPS @ NIKE DAY

  40. @tammybütow DEVOPS @ NIKE DAY MEMORY CHAOS free -m

  41. PART 3: STATE CHAOS ENGINEERING @tammybütow DEVOPS @ NIKE DAY

  42. PROCESS CHAOS @tammybütow DEVOPS @ NIKE DAY

  43. @tammybütow DEVOPS @ NIKE DAY Ways to create process chaos

    on purpose: PROCESS CHAOS • Kill one process • Loop kill a process • Spawn new processes • Fork bomb
  44. @tammybütow DEVOPS @ NIKE DAY PROCESS CHAOS pkill -u chaos

  45. SHUTDOWN CHAOS @tammybütow DEVOPS @ NIKE DAY

  46. @tammybütow DEVOPS @ NIKE DAY SHUTDOWN CHAOS shutdown -h

  47. WHAT ARE OTHER WAYS YOU CAN TURN OFF A SERVER?

    WHAT IF YOU WANT TO TURN OFF EVERY SERVER WHEN IT’S ONE WEEK OLD? @tammybütow DEVOPS @ NIKE DAY
  48. @tammybütow DEVOPS @ NIKE DAY HALT, REBOOT & POWEROFF CHAOS

    halt
  49. WHAT ABOUT SHUTTING DOWN
 CONTAINERS AND K8’S PODS? @tammybütow DEVOPS

    @ NIKE DAY
  50. @tammybütow DEVOPS @ NIKE DAY THE MANY WAYS TO KILL

    CONTAINERS • Kill self • Kill a container from the host • Use one container to kill another • Use one container to kills several containers • Use several containers to kill several
  51. The average lifespan of a container is 2.5 days And

    they fail in many unexpected ways. @tammybütow DEVOPS @ NIKE DAY
  52. TIME TRAVEL CHAOS @tammybütow DEVOPS @ NIKE DAY

  53. @tammybütow DEVOPS @ NIKE DAY TIME TRAVEL CHAOS AKA CLOCK

    SKEW ntpq
  54. PART 4: NETWORK CHAOS ENGINEERING @tammybütow DEVOPS @ NIKE DAY

  55. BLACKHOLE CHAOS @tammybütow DEVOPS @ NIKE DAY

  56. @tammybütow DEVOPS @ NIKE DAY BLACKHOLE CHAOS ip route show

  57. DNS CHAOS @tammybütow DEVOPS @ NIKE DAY

  58. @tammybütow DEVOPS @ NIKE DAY DNS CHAOS

  59. @tammybütow DEVOPS @ NIKE DAY DNS CHAOS

  60. LATENCY CHAOS @tammybütow DEVOPS @ NIKE DAY

  61. @tammybütow DEVOPS @ NIKE DAY LATENCY CHAOS mtr google.com

  62. PACKET LOSS CHAOS @tammybütow DEVOPS @ NIKE DAY

  63. @tammybütow DEVOPS @ NIKE DAY PACKET LOSS CHAOS

  64. PART 5: COMPLEX OUTAGES @tammybütow DEVOPS @ NIKE DAY

  65. We can combine different types of chaos engineering experiments to

    reproduce complicated outages. Reproducing outages gives you confidence you can handle it if/when it happens again. @tammybütow DEVOPS @ NIKE DAY
  66. Let’s go back in time to look at some of

    the worst outage stories that kicked off the introduction of chaos engineering. @tammybütow DEVOPS @ NIKE DAY
  67. DROPBOX’S WORST OUTAGE EVER @tammybütow DEVOPS @ NIKE DAY Some

    master-replica pairs were impacted which resulted in the site going down. https://blogs.dropbox.com/tech/2014/01/outage-post-mortem/
  68. UBER’S DATABASE OUTAGE @tammybütow DEVOPS @ NIKE DAY 1.Master log

    replication to S3 failed 2.Logs backed up on the primary 3.Alerts fired to engineer but they are ignored 4.Disk fills up on database primary 5.Engineer deletes unarchived WAL files 6.Error in config prevents promotion — Matt Ranney, Uber, 2015
  69. OUTAGES HAPPEN. @tammybütow DEVOPS @ NIKE DAY

  70. THERE ARE MANY MORE OUTAGES YOU CAN READ ABOUT HERE:

    https://github.com/danluu/post-mortems @tammybütow DEVOPS @ NIKE DAY
  71. HOW CAN YOU CONTINUE YOUR CHAOS ENGINEERING JOURNEY? @tammybütow DEVOPS

    @ NIKE DAY
  72. @tammybütow DEVOPS @ NIKE DAY JOIN THE CHAOS SLACK GREMLIN.COM/CHAOS

  73. @tammybütow DEVOPS @ NIKE DAY LEARN WITH THE GREMLIN COMMUNITY

    GREMLIN.COM/COMMUNITY
  74. THANK YOU DEVOPS @ NIKE DAY @tammybütow #CHAOSENGINEERING