Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Boston 2013 - Session - Laura Thomson

0580d500edfdb2e5e80e4732ac8df1ea?s=47 Monitorama
March 28, 2013
300

Boston 2013 - Session - Laura Thomson

0580d500edfdb2e5e80e4732ac8df1ea?s=128

Monitorama

March 28, 2013
Tweet

Transcript

  1. Many Moving Parts Monitoring Complex Systems laura@mozilla.com 1 Wednesday, May

    29, 13
  2. Many Moving Parts Monitoring Complex Systems (at roflscale) 2 Wednesday,

    May 29, 13
  3. Confession: I’m not a sysadmin. 3 Wednesday, May 29, 13

  4. (Why does devops == ops, anyway?) 4 Wednesday, May 29,

    13
  5. 5 Wednesday, May 29, 13

  6. webapp& db& simple system, simple monitoring 6 Wednesday, May 29,

    13
  7. webapp& db& cache& slave& slightly more complex 7 Wednesday, May

    29, 13
  8. Most of us don’t work on simple stuff 8 Wednesday,

    May 29, 13
  9. Most of us don’t work on simple stuff ...and if

    you do I hate you. 9 Wednesday, May 29, 13
  10. Most of us don’t work on simple stuff ...and if

    you do I hate you. (just kidding) 10 Wednesday, May 29, 13
  11. Some of our stuff looks like this 11 Wednesday, May

    29, 13
  12. Some of our stuff looks like this (avert your eyes

    now if you are easily scared) 12 Wednesday, May 29, 13
  13. 13 Wednesday, May 29, 13

  14. It’s actually more complicated than that 14 Wednesday, May 29,

    13
  15. Socorro Very Large Array at Socorro, New Mexico, USA. Photo

    taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg 15 Wednesday, May 29, 13
  16. 16 Wednesday, May 29, 13

  17. 17 Wednesday, May 29, 13

  18. 18 Wednesday, May 29, 13

  19. Collection collector' crashmover' filesystem' HBase' 19 Wednesday, May 29, 13

  20. Processing HBase& PostgreSQL& Elas1cSearch& monitor& processor& Symbol&store& minidumpstackwalk& 20 Wednesday,

    May 29, 13
  21. Reporting HBase& PostgreSQL& Elas1cSearch& middleware& webapp& memcache& crons& Other&data&sources& 21

    Wednesday, May 29, 13
  22. > 120 physical boxes (not cloud) ~10 developers + DBAs

    + sysadmin team + QA + Hadoop ops 22 Wednesday, May 29, 13
  23. 3000 crashes per minute 3 million per day Crash size

    150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS 23 Wednesday, May 29, 13
  24. like many complex systems it’s a data pipeline or firehose,

    if you prefer 24 Wednesday, May 29, 13
  25. 25 Wednesday, May 29, 13

  26. 26 Wednesday, May 29, 13

  27. Diagnostic Indirect Threshold Trend Performance Business 27 Wednesday, May 29,

    13
  28. Diagnostic 28 Wednesday, May 29, 13

  29. Host DOWN 500 ISE Replication lag 29 Wednesday, May 29,

    13
  30. You know where to look You have a good idea

    about what to fix Not always simple, but often well-defined 30 Wednesday, May 29, 13
  31. Indirect 31 Wednesday, May 29, 13

  32. FILE_AGE CRITICAL: blah.log is M seconds old Last record in

    database N seconds ago 32 Wednesday, May 29, 13
  33. Something is wrong Maybe with the monitored component Maybe somewhere

    upstream 33 Wednesday, May 29, 13
  34. Why is this useful? 34 Wednesday, May 29, 13

  35. High level exception handlers The thing you don’t know to

    monitor yet The thing you don’t know how to monitor 35 Wednesday, May 29, 13
  36. You know where to start looking You might have to

    look deeper too 36 Wednesday, May 29, 13
  37. Threshold 37 Wednesday, May 29, 13

  38. DISK WARNING - free space: (% used) More files on

    disk than there ought to be 38 Wednesday, May 29, 13
  39. Sometimes simple (disk space) Sometimes complex root cause (files) Sometimes

    hard to measure 39 Wednesday, May 29, 13
  40. 1% errors = normal, expected 5% errors = something bad

    is happening 40 Wednesday, May 29, 13
  41. Error rates Count errors (statsd, etc) per window Monitor on

    counts (rate) 41 Wednesday, May 29, 13
  42. Trend 42 Wednesday, May 29, 13

  43. Disk is 85% full Did it get that way over

    months? Did it get that way in one night? 43 Wednesday, May 29, 13
  44. Trends are important Rates of change are important 44 Wednesday,

    May 29, 13
  45. Top crashes (count) Explosive crashes (trend) 45 Wednesday, May 29,

    13
  46. Performance 46 Wednesday, May 29, 13

  47. Page load times Other component response times X items processed/minute

    47 Wednesday, May 29, 13
  48. Tooling is improving Traditionally more for dev than ops Needs

    threshold/trend alerting for ops 48 Wednesday, May 29, 13
  49. Business 49 Wednesday, May 29, 13

  50. Transactions/hour Conversion rate Volumes 50 Wednesday, May 29, 13

  51. Just another performance monitor Thresholds Trends Alerts 51 Wednesday, May

    29, 13
  52. Often these exist in human form AUTOMATE Better a page

    than an angry boss/customer 52 Wednesday, May 29, 13
  53. 53 Wednesday, May 29, 13

  54. You’ve probably heard: Monitoring and testing converge 54 Wednesday, May

    29, 13
  55. Running tests on prod can be awesome except when it

    isn’t (Knight) (be careful) 55 Wednesday, May 29, 13
  56. two kinds: safe for prod not safe for prod (write,

    load, etc) 56 Wednesday, May 29, 13
  57. Monitor as unit test: When you have a failure, add

    a monitor (coverage is hard to measure) 57 Wednesday, May 29, 13
  58. Questions? laura@mozilla.com @lxt 58 Wednesday, May 29, 13