Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Boston 2013 - Session - Laura Thomson

Monitorama
March 28, 2013
350

Boston 2013 - Session - Laura Thomson

Monitorama

March 28, 2013
Tweet

Transcript

  1. Most of us don’t work on simple stuff ...and if

    you do I hate you. 9 Wednesday, May 29, 13
  2. Most of us don’t work on simple stuff ...and if

    you do I hate you. (just kidding) 10 Wednesday, May 29, 13
  3. Some of our stuff looks like this (avert your eyes

    now if you are easily scared) 12 Wednesday, May 29, 13
  4. Socorro Very Large Array at Socorro, New Mexico, USA. Photo

    taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg 15 Wednesday, May 29, 13
  5. > 120 physical boxes (not cloud) ~10 developers + DBAs

    + sysadmin team + QA + Hadoop ops 22 Wednesday, May 29, 13
  6. 3000 crashes per minute 3 million per day Crash size

    150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS 23 Wednesday, May 29, 13
  7. You know where to look You have a good idea

    about what to fix Not always simple, but often well-defined 30 Wednesday, May 29, 13
  8. FILE_AGE CRITICAL: blah.log is M seconds old Last record in

    database N seconds ago 32 Wednesday, May 29, 13
  9. High level exception handlers The thing you don’t know to

    monitor yet The thing you don’t know how to monitor 35 Wednesday, May 29, 13
  10. You know where to start looking You might have to

    look deeper too 36 Wednesday, May 29, 13
  11. DISK WARNING - free space: (% used) More files on

    disk than there ought to be 38 Wednesday, May 29, 13
  12. 1% errors = normal, expected 5% errors = something bad

    is happening 40 Wednesday, May 29, 13
  13. Error rates Count errors (statsd, etc) per window Monitor on

    counts (rate) 41 Wednesday, May 29, 13
  14. Disk is 85% full Did it get that way over

    months? Did it get that way in one night? 43 Wednesday, May 29, 13
  15. Tooling is improving Traditionally more for dev than ops Needs

    threshold/trend alerting for ops 48 Wednesday, May 29, 13
  16. Often these exist in human form AUTOMATE Better a page

    than an angry boss/customer 52 Wednesday, May 29, 13
  17. Running tests on prod can be awesome except when it

    isn’t (Knight) (be careful) 55 Wednesday, May 29, 13
  18. two kinds: safe for prod not safe for prod (write,

    load, etc) 56 Wednesday, May 29, 13
  19. Monitor as unit test: When you have a failure, add

    a monitor (coverage is hard to measure) 57 Wednesday, May 29, 13