Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Many Moving Parts: Monitoring Complex Systems

Many Moving Parts: Monitoring Complex Systems

Monitoring a simple app with say, a single web server and a database host, is, well, simple. How do you approach monitoring complex systems where the technology stack has more pieces than you can list on both hands? Where the system diagram needs to have memes on it to stop you from crying? Where the business depends on your system being available?

Laura Thomson

October 17, 2014
Tweet

More Decks by Laura Thomson

Other Decks in Technology

Transcript

  1. Most of us don’t work on simple stuff ...and if

    you do I hate you. (just kidding)
  2. 13

  3. Socorro Very Large Array at Socorro, New Mexico, USA. Photo

    taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg
  4. 18

  5. 19

  6. 21

  7. 3000 crashes per minute 3 million per day Crash size

    150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS
  8. typically 4k build and 10k test jobs per day topped

    100k/day in August new record for pushes/commits: 690 in a single day (also in August) hg.mozilla.org has: 3,445 unique repos; 32,123,211 total commits; 865,594 unique commits; 1,223 unique root commits ~6000 machines 80% of build and 50% of test in cloud strategic use of AWS spot instances saved 75% on our bill
  9. analytics tell you: “is this load normal?” visualize healthy system

    performance makes it easier to find anomalies
  10. You know where to look You have a good idea

    about what to fix Not always simple, but often well-defined
  11. High level exception handlers The thing you don’t know to

    monitor yet The thing you don’t know how to monitor
  12. DISK WARNING - free space: (% used) More files on

    disk than there ought to be Running out of inodes
  13. Disk is 85% full Did it get that way over

    months? Did it get that way in one night?
  14. What changes? ! Thresholds more interesting than single alerts Many

    alerts should become warnings (or risk alert fatigue)