Pro Yearly is on sale from $80 to $50! »

Berlin 2013 - Session - Mark McGranaghan

0580d500edfdb2e5e80e4732ac8df1ea?s=47 Monitorama
September 19, 2013
230

Berlin 2013 - Session - Mark McGranaghan

0580d500edfdb2e5e80e4732ac8df1ea?s=128

Monitorama

September 19, 2013
Tweet

Transcript

  1. Fewer Better Systems Monitorama EU 2013 Mark McGranaghan

  2. @mmcgrana

  3. Fewer Better Systems

  4. Unix

  5. everything is a file

  6. /var/db /usr/lib /dev/tcp /usr/bin /etc

  7. /dev/tcp

  8. problem problem problem

  9. everything is a ...

  10. failover

  11. primary secondary

  12. primary secondary

  13. primary secondary

  14. primary secondary?

  15. https://twitter.com/b6n/status/161899319459463168

  16. the best systems are used constantly

  17. Fewer Better Systems

  18. everything is a ...

  19. the best systems are used constantly

  20. logs / events alert criteria / metrics integration testing /

    QoS monitoring errors / results
  21. logs / events

  22. logs: stream of unstructured information events: stream of structured information

  23. logs 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200 [notice] SQL

    (0.5ms) SELECT users Completed in 64ms (View: 52, DB: 10) | 200 OK [/users/7]
  24. invent ways to encode data in text...

  25. data "data" | data <data> - data [data] (data)

  26. meanwhile...

  27. Apache log parsers / analyzers Postgres log parses / analyzers

    Redis log parsers / analyzers Heroku log parsers / analyzers ...
  28. everything is a ...

  29. events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method

    "GET" :path "/users/7" :ip "64.242.88.10" ... }
  30. 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200

  31. events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method

    "GET" :path "/users/7" :ip "64.242.88.10" ... }
  32. encode data as data, uniformly

  33. analyze with general tools

  34. open source

  35. http://fluentd.org

  36. { :time "2013-09-19 10:27:39", :tag "web.request", :record { :ip "64.242.88.10",

    :path "/users/7", ... } }
  37. Web apps ---+ +--> file | | +--> ---+ /var/log

    ------> Fluentd ------> mail +--> ---+ | | Apache ---- +--> Fluentd http://fluentd.org
  38. problem problem problem

  39. something happened at some time: event events as data, not

    text general-purpose event processing applicable to all information
  40. everything is a ...

  41. alert criteria / metrics

  42. alert criteria: measure, alert if out of bounds metrics: measure,

    store for analysis
  43. measure measure alert store

  44. measure measure alert store steady-state

  45. measure measure alert store alert!

  46. measure measure alert store steady-state

  47. measure alert store

  48. production

  49. None
  50. every alert has time series alter time series come from

    metrics stack alert source data stored all the time
  51. the best systems are used constantly

  52. integration testing / QoS monitoring

  53. https://plus.google.com/112678702228711889851/posts/eVeouesvaVX

  54. integration testing: is good for production? QoS monitoring: is it

    good in production?
  55. integration testing run through common user flows, assert no errors,

    ensure performance adequate
  56. quality of service (QoS) monitoring users running through flows asserting

    no/minimal errors, ensuring performance adequate
  57. integration prod staging user load QoS monitoring

  58. Integration prod staging user load QoS monitoring

  59. staging prod user load load gen QoS monitoring QoS monitoring

    load gen
  60. invest in load generation/replay invest in granular QoS monitoring applicable

    to all environments, all the time
  61. the best systems are used constantly

  62. errors / results

  63. raise(“it’s tricky”)

  64. errors: something happened, it was bad results: something happened, it

    was OK
  65. begin res = call_fn(arg) # handle result rescue => err

    # handle error end
  66. None
  67. None
  68. exceptions are only exceptional at small scale “1 in a

    billion” @ 100k op/s ≃ 10 times a day
  69. begin res = call_fn(arg) # handle result rescue => err

    # handle error end
  70. open source

  71. http://golang.org

  72. http://golang.org res, err := RunOp(arg) if err != nil {

    // handle error } // handle result
  73. begin res = run_op(arg) # handle result rescue => err

    # handle error end
  74. locality? in general: not local in space - service-level errors

    etc not local in time - defined post hoc!
  75. what even is an error? you don’t know at dev-time

    when it’s just a result... emit event for later analysis
  76. treat “exceptions” / results symmetrically to the greatest extent possible

    expect to define errors at analysis-time, not just dev-time or run-time, based on results
  77. everything is a ...

  78. logs / events / metrics alert criteria / metrics integration

    testing / QoS monitoring errors / results
  79. a challenge

  80. everything is a ...

  81. the best systems are used constantly

  82. Fewer Better Systems