Failing Well

Failing Well

It's a fact of life--software breaks.

But all is not doom and gloom. How we detect and handle errors drastically impacts the quality of both our systems and our lives. Knowing what to track, when to page, and how to find system weaknesses is critical.

You’ll leave this talk with tactics for coping with failures on multiple levels. We'll see how error handling and alerting ground a reliable system. Then we'll automate testing and finally induce problems in live, running code to see where our expectations and reality diverge.

Failure is inevitable, but that doesn't mean you can't fail well!

92e7389893670a1920a4fd98aec0d246?s=128

Jason R Clark

June 23, 2017
Tweet

Transcript

  1. @jasonrclark Failing Well http://bit.ly/failing-well

  2. @jasonrclark Exceptions

  3. ⚠ Ruby Ahead ⚠ 3

  4. 4 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure #

    Cleaning up end
  5. 5 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure #

    Cleaning up end
  6. 6 def methods_are_awesome!(really = true) raise StandardError.new("Oops") rescue # DO

    SOMETHING ensure # Cleaning up end
  7. 7 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure #

    Cleaning up end
  8. 8 begin raise "Heck" # == RuntimeError.new("Heck") rescue # DO

    SOMETHING ensure # Cleaning up end
  9. 9 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure #

    Cleaning up end
  10. 10 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure #

    Cleaning up end
  11. Let's Get More Specific 11

  12. 12 begin raise StandardError.new("Oops") rescue StandardError => e puts e.message

    ensure # Cleaning up end
  13. 13 begin raise StandardError.new("Oops") rescue Exception => e puts e.message

    ensure # Cleaning up end
  14. Ctrl+C => raise Interrupt.new 14

  15. Interrupt > SignalException > Exception 15

  16. Careful What You Catch! 16

  17. 17 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure #

    Every line will # Absolutely, positively # Get executed! end
  18. Nope! 18

  19. 19 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure #

    Every line will # Absolutely, positively # Get executed! end
  20. Thread#raise 20

  21. Timeout#timeout 21

  22. Rack::Timeout 22

  23. Know What's Assured! 23

  24. 24 begin raise StandardError.new("Oops") rescue # DO SOMETHING ensure $!

    # => Exception in flight end
  25. 25 begin raise StandardError.new("Oops") rescue # DO SOMETHING <= What

    goes here? ensure # Cleaning up end
  26. Record It! 26

  27. @jasonrclark Alerting

  28. What To Alert On? 28

  29. Errors Availability Latency 29

  30. 30

  31. 31 Elasticsearch Cluster

  32. 32 Elasticsearch Cluster O_o

  33. 33 Elasticsearch Cluster O_o

  34. Avoid Duplication 34

  35. "Oh, that just happens. Ignore it." 35

  36. 36 Alert Fatigue

  37. 1. Problems you can do something about 37

  38. ALERT! 38

  39. 2. Upstream problems you can't impact 39

  40. WARN 40

  41. 3. Other odd stuff? 41

  42. TRACK 42

  43. TRUST! 43

  44. LEARN! 44

  45. @jasonrclark Gamedays

  46. Practice Failing 46

  47. ... In Production 47

  48. 1. Identify Your Resources 48

  49. Application instances Other services Datastores Caches Files CDN 49

  50. 2. What Can Go Wrong? 50

  51. Missing Slow Errors Corrupt Data 51

  52. Risk Matrix 52

  53. Running the Gameday 53

  54. Reality's best 54

  55. Generate load 55

  56. Break stuff! 56

  57. kill <pid> docker stop <id> 57

  58. kill -STOP <pid> 58

  59. kill -CONT <pid> 59

  60. iptables or tc 60

  61. https://github.com/ tylertreat/Comcast 61

  62. Test Recovery Too! 62

  63. @jasonrclark Automating Failure

  64. 64

  65. Toxiproxy 65 http://toxiproxy.io/

  66. 66 Your App Valuable Resource

  67. 67 Your App Valuable Resource Toxiproxy

  68. 68 Your App Valuable Resource HTTP commands! Toxiproxy

  69. 69 class MyApplication # ... config.elastic_url = ENV["ES_URL"] || "http://localhost:9200"

    end
  70. 70 Toxiproxy.populate([ { name: "elastic_search", listen: "127.0.0.1:22220", upstream: "127.0.0.1:9200" }

    ])
  71. 71 Toxiproxy.populate([ { name: "elastic_search", listen: "127.0.0.1:22220", upstream: "127.0.0.1:9200" }

    ])
  72. 72 Toxiproxy.populate([ { name: "elastic_search", listen: "127.0.0.1:22220", upstream: "127.0.0.1:9200" }

    ])
  73. 73 Toxiproxy.populate([ { name: "elastic_search", listen: "127.0.0.1:22220", upstream: "127.0.0.1:9200" }

    ])
  74. 74 Rails.configuration.elastic_url = "http://127.0.0.1:22220" # Boot toxiproxy if it isn't

    running already mac = RbConfig::CONFIG["host_os"] =~ /darwin/ bin = mac ? "bin/toxiproxy-darwin-amd64" : "bin/toxiproxy-linux-amd64" pid = spawn(bin) Process.detach(pid)
  75. 75 Rails.configuration.elastic_url = "http://127.0.0.1:22220" # Boot toxiproxy if it isn't

    running already mac = RbConfig::CONFIG["host_os"] =~ /darwin/ bin = mac ? "bin/toxiproxy-darwin-amd64" : "bin/toxiproxy-linux-amd64" pid = spawn(bin) Process.detach(pid)
  76. 76 it "returns a clean error when down" do Toxiproxy[:elastic_search].down

    do post action, args assert_response 500 end end
  77. 77 it "returns a clean error on timeout" do Toxiproxy[:elastic_search].

    downstream(:latency, latency: 900).apply do post action, args assert_response 500 end end
  78. @jasonrclark Exceptions Alerting Gamedays Automating Failure ???