Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failing Well

Failing Well

It's a fact of life--software breaks.

But all is not doom and gloom. How we detect and handle errors drastically impacts the quality of both our systems and our lives. Knowing what to track, when to page, and how to find system weaknesses is critical.

You’ll leave this talk with tactics for coping with failures on multiple levels. We'll see how error handling and alerting ground a reliable system. Then we'll automate testing and finally induce problems in live, running code to see where our expectations and reality diverge.

Failure is inevitable, but that doesn't mean you can't fail well!

Jason R Clark

June 23, 2017
Tweet

More Decks by Jason R Clark

Other Decks in Technology

Transcript

  1. @jasonrclark
    Failing
    Well
    http://bit.ly/failing-well

    View Slide

  2. @jasonrclark
    Exceptions

    View Slide

  3. ⚠ Ruby Ahead ⚠
    3

    View Slide

  4. 4
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Cleaning up
    end

    View Slide

  5. 5
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Cleaning up
    end

    View Slide

  6. 6
    def methods_are_awesome!(really = true)
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Cleaning up
    end

    View Slide

  7. 7
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Cleaning up
    end

    View Slide

  8. 8
    begin
    raise "Heck" # == RuntimeError.new("Heck")
    rescue
    # DO SOMETHING
    ensure
    # Cleaning up
    end

    View Slide

  9. 9
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Cleaning up
    end

    View Slide

  10. 10
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Cleaning up
    end

    View Slide

  11. Let's Get More
    Specific
    11

    View Slide

  12. 12
    begin
    raise StandardError.new("Oops")
    rescue StandardError => e
    puts e.message
    ensure
    # Cleaning up
    end

    View Slide

  13. 13
    begin
    raise StandardError.new("Oops")
    rescue Exception => e
    puts e.message
    ensure
    # Cleaning up
    end

    View Slide

  14. Ctrl+C
    =>
    raise Interrupt.new
    14

    View Slide

  15. Interrupt >
    SignalException >
    Exception
    15

    View Slide

  16. Careful What You
    Catch!
    16

    View Slide

  17. 17
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Every line will
    # Absolutely, positively
    # Get executed!
    end

    View Slide

  18. Nope!
    18

    View Slide

  19. 19
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    # Every line will
    # Absolutely, positively
    # Get executed!
    end

    View Slide

  20. Thread#raise
    20

    View Slide

  21. Timeout#timeout
    21

    View Slide

  22. Rack::Timeout
    22

    View Slide

  23. Know What's
    Assured!
    23

    View Slide

  24. 24
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING
    ensure
    $! # => Exception in flight
    end

    View Slide

  25. 25
    begin
    raise StandardError.new("Oops")
    rescue
    # DO SOMETHING <= What goes here?
    ensure
    # Cleaning up
    end

    View Slide

  26. Record It!
    26

    View Slide

  27. @jasonrclark
    Alerting

    View Slide

  28. What To Alert On?
    28

    View Slide

  29. Errors
    Availability
    Latency
    29

    View Slide

  30. 30

    View Slide

  31. 31
    Elasticsearch Cluster

    View Slide

  32. 32
    Elasticsearch Cluster
    O_o

    View Slide

  33. 33
    Elasticsearch Cluster
    O_o

    View Slide

  34. Avoid Duplication
    34

    View Slide

  35. "Oh, that just
    happens. Ignore
    it."
    35

    View Slide

  36. 36
    Alert
    Fatigue

    View Slide

  37. 1. Problems you
    can do something
    about
    37

    View Slide

  38. ALERT!
    38

    View Slide

  39. 2. Upstream
    problems you can't
    impact
    39

    View Slide

  40. WARN
    40

    View Slide

  41. 3. Other odd
    stuff?
    41

    View Slide

  42. TRACK
    42

    View Slide

  43. TRUST!
    43

    View Slide

  44. LEARN!
    44

    View Slide

  45. @jasonrclark
    Gamedays

    View Slide

  46. Practice Failing
    46

    View Slide

  47. ... In Production
    47

    View Slide

  48. 1. Identify Your
    Resources
    48

    View Slide

  49. Application instances
    Other services
    Datastores
    Caches
    Files
    CDN
    49

    View Slide

  50. 2. What Can Go
    Wrong?
    50

    View Slide

  51. Missing
    Slow
    Errors
    Corrupt Data
    51

    View Slide

  52. Risk Matrix
    52

    View Slide

  53. Running the
    Gameday
    53

    View Slide

  54. Reality's best
    54

    View Slide

  55. Generate load
    55

    View Slide

  56. Break stuff!
    56

    View Slide

  57. kill
    docker stop
    57

    View Slide

  58. kill -STOP
    58

    View Slide

  59. kill -CONT
    59

    View Slide

  60. iptables or tc
    60

    View Slide

  61. https://github.com/
    tylertreat/Comcast
    61

    View Slide

  62. Test Recovery Too!
    62

    View Slide

  63. @jasonrclark
    Automating
    Failure

    View Slide

  64. 64

    View Slide

  65. Toxiproxy
    65
    http://toxiproxy.io/

    View Slide

  66. 66
    Your
    App
    Valuable
    Resource

    View Slide

  67. 67
    Your
    App
    Valuable
    Resource
    Toxiproxy

    View Slide

  68. 68
    Your
    App
    Valuable
    Resource
    HTTP
    commands!
    Toxiproxy

    View Slide

  69. 69
    class MyApplication
    # ...
    config.elastic_url = ENV["ES_URL"] ||
    "http://localhost:9200"
    end

    View Slide

  70. 70
    Toxiproxy.populate([
    {
    name: "elastic_search",
    listen: "127.0.0.1:22220",
    upstream: "127.0.0.1:9200"
    }
    ])

    View Slide

  71. 71
    Toxiproxy.populate([
    {
    name: "elastic_search",
    listen: "127.0.0.1:22220",
    upstream: "127.0.0.1:9200"
    }
    ])

    View Slide

  72. 72
    Toxiproxy.populate([
    {
    name: "elastic_search",
    listen: "127.0.0.1:22220",
    upstream: "127.0.0.1:9200"
    }
    ])

    View Slide

  73. 73
    Toxiproxy.populate([
    {
    name: "elastic_search",
    listen: "127.0.0.1:22220",
    upstream: "127.0.0.1:9200"
    }
    ])

    View Slide

  74. 74
    Rails.configuration.elastic_url =
    "http://127.0.0.1:22220"
    # Boot toxiproxy if it isn't running already
    mac = RbConfig::CONFIG["host_os"] =~ /darwin/
    bin = mac ? "bin/toxiproxy-darwin-amd64" :
    "bin/toxiproxy-linux-amd64"
    pid = spawn(bin)
    Process.detach(pid)

    View Slide

  75. 75
    Rails.configuration.elastic_url =
    "http://127.0.0.1:22220"
    # Boot toxiproxy if it isn't running already
    mac = RbConfig::CONFIG["host_os"] =~ /darwin/
    bin = mac ? "bin/toxiproxy-darwin-amd64" :
    "bin/toxiproxy-linux-amd64"
    pid = spawn(bin)
    Process.detach(pid)

    View Slide

  76. 76
    it "returns a clean error when down" do
    Toxiproxy[:elastic_search].down do
    post action, args
    assert_response 500
    end
    end

    View Slide

  77. 77
    it "returns a clean error on timeout" do
    Toxiproxy[:elastic_search].
    downstream(:latency, latency: 900).apply do
    post action, args
    assert_response 500
    end
    end

    View Slide

  78. @jasonrclark
    Exceptions
    Alerting
    Gamedays
    Automating Failure
    ???

    View Slide