$30 off During Our Annual Pro Sale. View Details »

Berlin 2013 - Session - Mark McGranaghan

Monitorama
September 19, 2013
270

Berlin 2013 - Session - Mark McGranaghan

Monitorama

September 19, 2013
Tweet

Transcript

  1. Fewer Better Systems
    Monitorama EU 2013
    Mark McGranaghan

    View Slide

  2. @mmcgrana

    View Slide

  3. Fewer Better Systems

    View Slide

  4. Unix

    View Slide

  5. everything is a file

    View Slide

  6. /var/db
    /usr/lib
    /dev/tcp
    /usr/bin
    /etc

    View Slide

  7. /dev/tcp

    View Slide

  8. problem
    problem
    problem

    View Slide

  9. everything is a ...

    View Slide

  10. failover

    View Slide

  11. primary secondary

    View Slide

  12. primary secondary

    View Slide

  13. primary
    secondary

    View Slide

  14. primary
    secondary?

    View Slide

  15. https://twitter.com/b6n/status/161899319459463168

    View Slide

  16. the best systems are used
    constantly

    View Slide

  17. Fewer Better Systems

    View Slide

  18. everything is a ...

    View Slide

  19. the best systems are used
    constantly

    View Slide

  20. logs / events
    alert criteria / metrics
    integration testing / QoS monitoring
    errors / results

    View Slide

  21. logs / events

    View Slide

  22. logs: stream of unstructured information
    events: stream of structured information

    View Slide

  23. logs
    64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200
    [notice] SQL (0.5ms) SELECT users
    Completed in 64ms (View: 52, DB: 10) | 200 OK [/users/7]

    View Slide

  24. invent ways to encode data in text...

    View Slide

  25. data "data" | data - data [data] (data)

    View Slide

  26. meanwhile...

    View Slide

  27. Apache log parsers / analyzers
    Postgres log parses / analyzers
    Redis log parsers / analyzers
    Heroku log parsers / analyzers
    ...

    View Slide

  28. everything is a ...

    View Slide

  29. events
    {
    :time "2013-09-19 10:27:39"
    :action "users.get"
    :user_id 7
    :method "GET"
    :path "/users/7"
    :ip "64.242.88.10"
    ...
    }

    View Slide

  30. 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200

    View Slide

  31. events
    {
    :time "2013-09-19 10:27:39"
    :action "users.get"
    :user_id 7
    :method "GET"
    :path "/users/7"
    :ip "64.242.88.10"
    ...
    }

    View Slide

  32. encode data as data, uniformly

    View Slide

  33. analyze with general tools

    View Slide

  34. open source

    View Slide

  35. http://fluentd.org

    View Slide

  36. {
    :time "2013-09-19 10:27:39",
    :tag "web.request",
    :record {
    :ip "64.242.88.10",
    :path "/users/7",
    ...
    }
    }

    View Slide

  37. Web apps ---+ +--> file
    | |
    +--> ---+
    /var/log ------> Fluentd ------> mail
    +--> ---+
    | |
    Apache ---- +--> Fluentd
    http://fluentd.org

    View Slide

  38. problem
    problem
    problem

    View Slide

  39. something happened at some time:
    event
    events as data, not text
    general-purpose event processing
    applicable to all information

    View Slide

  40. everything is a ...

    View Slide

  41. alert criteria / metrics

    View Slide

  42. alert criteria: measure, alert if out of bounds
    metrics: measure, store for analysis

    View Slide

  43. measure measure
    alert store

    View Slide

  44. measure measure
    alert store
    steady-state

    View Slide

  45. measure measure
    alert store
    alert!

    View Slide

  46. measure measure
    alert store
    steady-state

    View Slide

  47. measure
    alert
    store

    View Slide

  48. production

    View Slide

  49. View Slide

  50. every alert has time series
    alter time series come from metrics stack
    alert source data stored all the time

    View Slide

  51. the best systems are used
    constantly

    View Slide

  52. integration testing / QoS monitoring

    View Slide

  53. https://plus.google.com/112678702228711889851/posts/eVeouesvaVX

    View Slide

  54. integration testing: is good for production?
    QoS monitoring: is it good in production?

    View Slide

  55. integration testing
    run through common user flows,
    assert no errors,
    ensure performance adequate

    View Slide

  56. quality of service (QoS) monitoring
    users running through flows
    asserting no/minimal errors,
    ensuring performance adequate

    View Slide

  57. integration
    prod
    staging
    user load
    QoS monitoring

    View Slide

  58. Integration
    prod
    staging
    user load
    QoS monitoring

    View Slide

  59. staging prod
    user load
    load
    gen
    QoS monitoring
    QoS
    monitoring
    load gen

    View Slide

  60. invest in load generation/replay
    invest in granular QoS monitoring
    applicable to all environments, all the time

    View Slide

  61. the best systems are used
    constantly

    View Slide

  62. errors / results

    View Slide

  63. raise(“it’s tricky”)

    View Slide

  64. errors: something happened, it was bad
    results: something happened, it was OK

    View Slide

  65. begin
    res = call_fn(arg)
    # handle result
    rescue => err
    # handle error
    end

    View Slide

  66. View Slide

  67. View Slide

  68. exceptions are only exceptional
    at small scale
    “1 in a billion” @ 100k op/s ≃ 10 times a day

    View Slide

  69. begin
    res = call_fn(arg)
    # handle result
    rescue => err
    # handle error
    end

    View Slide

  70. open source

    View Slide

  71. http://golang.org

    View Slide

  72. http://golang.org
    res, err := RunOp(arg)
    if err != nil {
    // handle error
    }
    // handle result

    View Slide

  73. begin
    res = run_op(arg)
    # handle result
    rescue => err
    # handle error
    end

    View Slide

  74. locality?
    in general:
    not local in space - service-level errors etc
    not local in time - defined post hoc!

    View Slide

  75. what even is an error?
    you don’t know at dev-time
    when it’s just a result...
    emit event for later analysis

    View Slide

  76. treat “exceptions” / results symmetrically
    to the greatest extent possible
    expect to define errors at analysis-time,
    not just dev-time or run-time,
    based on results

    View Slide

  77. everything is a ...

    View Slide

  78. logs / events / metrics
    alert criteria / metrics
    integration testing / QoS monitoring
    errors / results

    View Slide

  79. a challenge

    View Slide

  80. everything is a ...

    View Slide

  81. the best systems are used
    constantly

    View Slide

  82. Fewer Better Systems

    View Slide