Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Many Moving Parts: Monitoring Complex Systems

Many Moving Parts: Monitoring Complex Systems

Monitoring a simple app with say, a single web server and a database host, is, well, simple. How do you approach monitoring complex systems where the technology stack has more pieces than you can list on both hands? Where the system diagram needs to have memes on it to stop you from crying? Where the business depends on your system being available?

Laura Thomson

October 17, 2014

More Decks by Laura Thomson

Other Decks in Technology


  1. Many Moving Parts Monitoring Complex Systems ! ! laura@mozilla.com @lxt

  2. Many Moving Parts Monitoring Complex Systems (at roflscale)

  3. Confession: I’m not a sysadmin.

  4. (Why does devops == ops, anyway?)

  5. None
  6. webapp& db& simple system, simple monitoring

  7. webapp& db& cache& slave& slightly more complex

  8. Most of us don’t work on simple stuff !

  9. Most of us don’t work on simple stuff ...and if

    you do I hate you.
  10. Most of us don’t work on simple stuff ...and if

    you do I hate you. (just kidding)
  11. Some of our stuff looks like this

  12. Some of our stuff looks like this (avert your eyes

    now if you are easily scared)
  13. 13

  14. and this

  15. None
  16. It’s actually more complicated than that

  17. Socorro Very Large Array at Socorro, New Mexico, USA. Photo

    taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg
  18. 18

  19. 19

  20. None
  21. 21

  22. None
  23. > 120 physical boxes (not cloud) ~8 developers + ops

    + QA + Hadoop ops
  24. 3000 crashes per minute 3 million per day Crash size

    150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS
  25. like many complex systems it’s a data pipeline or firehose,

    if you prefer
  26. None
  27. typically 4k build and 10k test jobs per day topped

    100k/day in August new record for pushes/commits: 690 in a single day (also in August) hg.mozilla.org has: 3,445 unique repos; 32,123,211 total commits; 865,594 unique commits; 1,223 unique root commits ~6000 machines 80% of build and 50% of test in cloud strategic use of AWS spot instances saved 75% on our bill
  28. None
  29. None
  30. First Principles (First: Principles!)

  31. Pull for normal operating conditions Push soft alerts for warnings

    Pages for critical issues
  32. analytics tell you: “is this load normal?” visualize healthy system

    performance makes it easier to find anomalies
  33. None
  34. monitoring coverage is logically equivalent to test coverage (and suffers

    the same issues)
  35. the things that are easy to monitor/test vs the things

    we really care about
  36. just “more” is not better: noise is enemy #1

  37. Thomson’s Hierarchy of Monitoring

  38. Diagnostic Indirect Threshold Trend Performance Business

  39. Diagnostic

  40. Host DOWN 500 ISE Replication lag

  41. You know where to look You have a good idea

    about what to fix Not always simple, but often well-defined
  42. Indirect

  43. FILE_AGE CRITICAL: blah.log is M seconds old Last record written

    N seconds ago
  44. Something is wrong Maybe with the monitored component Maybe somewhere

  45. Why is this useful?

  46. High level exception handlers The thing you don’t know to

    monitor yet The thing you don’t know how to monitor
  47. You know where to start looking You might have to

    look deeper too
  48. Threshold

  49. DISK WARNING - free space: (% used) More files on

    disk than there ought to be Running out of inodes
  50. Sometimes simple (disk space) Sometimes complex root cause (files) Sometimes

    hard to measure
  51. 1% errors = normal, expected 5% errors = something bad

    is happening
  52. Error rates Count errors (statsd, etc) per time window Monitor

    on counts (rate)
  53. Trend

  54. Disk is 85% full Did it get that way over

    months? Did it get that way in one night?
  55. Trends are important Rates of change are important

  56. Top crashes (count) Explosive crashes (trend)

  57. Performance

  58. Page load times Other component response times X items processed/minute

  59. Tooling is improving Traditionally more for dev than ops Needs

    threshold/trend alerting
  60. Business

  61. Transactions/hour Conversion rate Volumes

  62. Performance monitors and multiple-levels-of-indirection monitors: ! Thresholds Trends Alerts

  63. Often these exist in human form AUTOMATE Better a page

    than an angry boss/customer
  64. Testing

  65. None
  66. You’ve probably heard: Monitoring and testing converge

  67. Running tests on prod can be awesome except when it

    isn’t (Knight) (be careful)
  68. two kinds: safe for prod not safe for prod (write,

    load, etc)
  69. Monitor as unit test: When you have an outage, add

    a monitor
  70. Resilience

  71. Gracefully degrade Decouple unreliable components Load/feature shed Automate recovery

  72. What changes? ! Thresholds more interesting than single alerts Many

    alerts should become warnings (or risk alert fatigue)
  73. Questions? laura@mozilla.com @lxt