Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Many Moving Parts: Monitoring Complex Systems

Many Moving Parts: Monitoring Complex Systems

Monitoring a simple app with say, a single web server and a database host, is, well, simple. How do you approach monitoring complex systems where the technology stack has more pieces than you can list on both hands? Where the system diagram needs to have memes on it to stop you from crying? Where the business depends on your system being available?

Laura Thomson

October 17, 2014
Tweet

More Decks by Laura Thomson

Other Decks in Technology

Transcript

  1. Many Moving Parts
    Monitoring Complex Systems

    !
    !
    [email protected]

    @lxt

    View Slide

  2. Many Moving Parts
    Monitoring Complex Systems
    (at roflscale)

    View Slide

  3. Confession: I’m not a sysadmin.

    View Slide

  4. (Why does devops == ops, anyway?)

    View Slide

  5. View Slide

  6. webapp& db&
    simple system, simple monitoring

    View Slide

  7. webapp&
    db&
    cache& slave&
    slightly more complex

    View Slide

  8. Most of us don’t work on simple stuff

    !

    View Slide

  9. Most of us don’t work on simple stuff

    ...and if you do I hate you.

    View Slide

  10. Most of us don’t work on simple stuff

    ...and if you do I hate you.

    (just kidding)

    View Slide

  11. Some of our stuff looks like this

    View Slide

  12. Some of our stuff looks like this

    (avert your eyes now if you are easily scared)

    View Slide

  13. 13

    View Slide

  14. and this

    View Slide

  15. View Slide

  16. It’s actually more complicated than that

    View Slide

  17. Socorro
    Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa
    and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg

    View Slide

  18. 18

    View Slide

  19. 19

    View Slide

  20. View Slide

  21. 21

    View Slide

  22. View Slide

  23. > 120 physical boxes (not cloud)

    ~8 developers + ops + QA + Hadoop ops

    View Slide

  24. 3000 crashes per minute

    3 million per day

    Crash size 150k - 20MB

    ~800GB stored in PostgreSQL

    ~110TB stored in HDFS

    View Slide

  25. like many complex systems

    it’s a data pipeline

    or firehose, if you prefer

    View Slide

  26. View Slide

  27. typically 4k build and 10k test jobs per day

    topped 100k/day in August

    new record for pushes/commits: 690 in a single day (also in August)

    hg.mozilla.org has: 3,445 unique repos; 32,123,211 total commits;
    865,594 unique commits; 1,223 unique root commits

    ~6000 machines

    80% of build and 50% of test in cloud

    strategic use of AWS spot instances saved 75% on our bill

    View Slide

  28. View Slide

  29. View Slide

  30. First Principles

    (First: Principles!)

    View Slide

  31. Pull for normal operating conditions

    Push soft alerts for warnings

    Pages for critical issues

    View Slide

  32. analytics tell you:

    “is this load normal?”

    visualize healthy system performance

    makes it easier to find anomalies

    View Slide

  33. View Slide

  34. monitoring coverage

    is logically equivalent to

    test coverage

    (and suffers the same issues)

    View Slide

  35. the things that are easy to monitor/test

    vs

    the things we really care about

    View Slide

  36. just “more” is not better:

    noise is enemy #1

    View Slide

  37. Thomson’s Hierarchy of Monitoring

    View Slide

  38. Diagnostic

    Indirect

    Threshold

    Trend

    Performance

    Business

    View Slide

  39. Diagnostic

    View Slide

  40. Host DOWN

    500 ISE

    Replication lag


    View Slide

  41. You know where to look

    You have a good idea about what to fix

    Not always simple, but often well-defined

    View Slide

  42. Indirect

    View Slide

  43. FILE_AGE CRITICAL: blah.log is M seconds old

    Last record written N seconds ago

    View Slide

  44. Something is wrong

    Maybe with the monitored component

    Maybe somewhere upstream

    View Slide

  45. Why is this useful?

    View Slide

  46. High level exception handlers

    The thing you don’t know to monitor yet

    The thing you don’t know how to monitor

    View Slide

  47. You know where to start looking

    You might have to look deeper too

    View Slide

  48. Threshold

    View Slide

  49. DISK WARNING - free space: (% used)

    More files on disk than there ought to be

    Running out of inodes

    View Slide

  50. Sometimes simple (disk space)

    Sometimes complex root cause (files)

    Sometimes hard to measure

    View Slide

  51. 1% errors = normal, expected

    5% errors = something bad is happening

    View Slide

  52. Error rates

    Count errors (statsd, etc) per time window

    Monitor on counts (rate)

    View Slide

  53. Trend

    View Slide

  54. Disk is 85% full

    Did it get that way over months?

    Did it get that way in one night?

    View Slide

  55. Trends are important

    Rates of change are important

    View Slide

  56. Top crashes (count)

    Explosive crashes (trend)

    View Slide

  57. Performance

    View Slide

  58. Page load times

    Other component response times

    X items processed/minute

    View Slide

  59. Tooling is improving

    Traditionally more for dev than ops

    Needs threshold/trend alerting

    View Slide

  60. Business

    View Slide

  61. Transactions/hour

    Conversion rate

    Volumes

    View Slide

  62. Performance monitors and

    multiple-levels-of-indirection monitors:

    !
    Thresholds

    Trends

    Alerts

    View Slide

  63. Often these exist in human form

    AUTOMATE

    Better a page than an angry boss/customer

    View Slide

  64. Testing

    View Slide

  65. View Slide

  66. You’ve probably heard:

    Monitoring and testing converge

    View Slide

  67. Running tests on prod can be awesome

    except when it isn’t (Knight)

    (be careful)

    View Slide

  68. two kinds:

    safe for prod

    not safe for prod (write, load, etc)

    View Slide

  69. Monitor as unit test:

    When you have an outage, add a monitor

    View Slide

  70. Resilience

    View Slide

  71. Gracefully degrade

    Decouple unreliable components

    Load/feature shed

    Automate recovery

    View Slide

  72. What changes?

    !
    Thresholds more interesting than single alerts

    Many alerts should become warnings

    (or risk alert fatigue)

    View Slide

  73. Questions?

    [email protected]

    @lxt

    View Slide