Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graphite & Friends

Graphite & Friends

A talk about Graphite, it’s quirks, some tips for using, xmanaging and scaling it and some systems which integrate with it.

56b5f735e447b2e69149aa04c3639e3c?s=128

Mark Crossfield

April 29, 2014
Tweet

Transcript

  1. MONITORING! GRAPHITE and friends

  2. Structure ☑ Why I’m here & what I’ll cover ☐

    Context ☐ Monitoring ☐ Graphite ☐ Tips ☐ Managing / Scaling ☐ Feeding ☐ Front Ends
  3. context

  4. Mark Crossfield, @mrmanc http://markcrossfield.co.uk Auto Trader Engineer for 6 years

    Continuous Delivery & Web
  5. Hiring! http://careers.autotrader.co.uk/

  6. AWESOME NEW OFFICES

  7. page impressions / month 1 billion

  8. unique monthly users 14.5 million

  9. searches / second at peak 1,500

  10. adverts 405k

  11. product & technology staff 323

  12. servers 2000

  13. code bases 250

  14. monitoring

  15. Velocity 2011 https://www.flickr.com/photos/kellan/5839797269/

  16. why? monitor

  17. None
  18. Edward A. Murphy, Jr.

  19. Anything that can go wrong will go wrong “ ”

    — Murphy’s Law
  20. design for failure

  21. http://blog.tagman.com/2012/03/just-one-second-delay-in-page-load-can-cause-7-loss-in-customer-conversions/ One Second Delay In Page-Load Can Cause 
 7%

    Loss In Customer Conversions “ ” 47% of consumers expect a page to load in 2 seconds or less “
  22. types of
 metric… http://code.flickr.net/2008/10/27/counting-timing/ 3

  23. counters easy as pie to aggregate—just sum

  24. timers harder to aggregate—need percentiles
 this can make scaling difficult

  25. gauges more nuanced—keep last value and use means

  26. dos monitoring don’ts &

  27. user behaviour monitor

  28. business kpis monitor

  29. checkouts monitor

  30. enquiries monitor

  31. user experience monitor

  32. page load time monitor

  33. user errors monitor

  34. major changes monitor

  35. servers monitor

  36. fine grained monitoring

  37. multivariate monitoring

  38. death by dashboards

  39. staring at charts, not code

  40. signal :noise

  41. nothing mean means

  42. percentiles

  43. inquisition the graphite

  44. None
  45. “spikes”

  46. Monitoring != Alerting

  47. ! ! ! ! ! Alerting is complicated

  48. ! ! ! ! ! Alerting Repeat Delay

  49. ! ! ! ! ! Alerting Cross Host Roll Up

  50. ! ! ! ! ! Alerting Thresholds

  51. ! ! ! ! ! Alerting Three Sigma Thresholds

  52. ! ! ! ! ! Alerting Aberration Detection

  53. ! ! ! ! ! Alerting Acknowledgement

  54. ! ! ! ! ! Alerting Escalation

  55. ! ! ! ! ! Alerting Sample Size

  56. ! ! ! ! ! Alerting Timeshift

  57. ! ! ! ! ! AlertingSMS

  58. ! ! ! ! ! Alerting Email

  59. ! ! ! ! ! Alerting Warn

  60. ! ! ! ! ! Alerting Error

  61. ! ! ! ! ! Alerting Priority

  62. ! ! ! ! ! Alerting Scheduling

  63. ! ! ! ! ! Alerting Subscription

  64. ! ! ! ! ! Alerting Management

  65. graphite

  66. None
  67. Die, composer. Die.

  68. THESE ARE NOT THE DOCS YOU ARE LOOKING FOR

  69. Docs Moved

  70. Old, out dated docs New docs

  71. Graphite Architecture

  72. aggregation precision reduces over time One year : daily One

    month: hourly One week: 5min One day: min
  73. tips mostly plagiarised

  74. writing twice overwrites carbon does no aggregation for you

  75. feeding interval == graphite bucket this is no coincidence

  76. xFilesFactor sparse metrics might not appear http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-9

  77. carbon limits writing new metrics avoids swamping disk with write

    IO http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-9
  78. graphite bookmarklet useful to load charts from images if your

    Graphite version is behind the times http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-2
  79. timeShift(series, duration) e.g. show yesterday’s metric against today’s timeShift(“apache.http.requests”, “-1day”)

    http://graphite.readthedocs.org/en/0.9.12/functions.html#graphite.render.functions.timeShift
  80. groupByNode(series, node, aggregate) e.g. aggregate many series using one node

    groupByNode(“collectd.*.cpu.*.value”, “2”, “maxSeries”) http://graphite.readthedocs.org/en/0.9.12/functions.html#graphite.render.functions.groupByNode
  81. cumulative(seriesList) show how a rate (e.g. per sec) adds up

    over the day http://graphite.readthedocs.org/en/0.9.12/functions.html#graphite.render.functions.cumulative
  82. host.cpu-[0-7].cpu-{user,system}.value wild cards allow filtering of nodes http://graphite.readthedocs.org/en/0.9.12/terminology.html#term-series-list

  83. teatime is 4pm Graphite understands the UNIX At time specification

    e.g. noon yesterday, now-2weeks http://oss.oetiker.ch/rrdtool/doc/rrdfetch.en.html#IAT_STYLE_TIME_SPECIFICATION
  84. monitor carbon carbon records it’s own metrics per minute to

    graphite http://obfuscurity.com/2012/06/Watching-the-Carbon-Feed
  85. whisper-*.py create, dump, fetch, info, merge, resize, set-aggregation method, update,

    diff https://github.com/graphite-project/whisper
  86. summarize be suspicious cautious around aggregation boundaries http://graphite.readthedocs.org/en/0.9.12/functions.html#graphite.render.functions.summarize

  87. holt winters intelligent aberration detection http://graphite.readthedocs.org/en/0.9.12/functions.html#graphite.render.functions.holtWintersAberration

  88. events(*tags) number of events matching tags at this point in

    time use to annotate your charts e.g. events(“deploy”, “change”) http://graphite.readthedocs.org/en/0.9.12/functions.html#graphite.render.functions.events
  89. None
  90. None
  91. managing / scaling

  92. sqlite do not run in production

  93. sqlite migrating from is horrendous

  94. sqlite scaling doesn’t work

  95. manage.py helps you migrate django https://docs.djangoproject.com/en/dev/ref/django-admin/

  96. scale with one box while you can…

  97. Graphite Architecture

  98. relay distributes to many caches

  99. sharding relay distributes with consistent hashing

  100. scaling with sharding requires rebalance

  101. None
  102. None
  103. metrics relayed / minute 300,000

  104. logster metrics 30,000

  105. write iops / second / cache (~8% of total activity)

    8,000
  106. metrics to date 193,000

  107. carbon docs leave a lot to be desired

  108. blogs only slightly contradictory

  109. http://bitprophet.org/blog/2013/03/07/graphite/

  110. https://gist.github.com/obfuscurity/63399584ea4d95f921e4

  111. https://answers.launchpad.net/graphite/+question/178969

  112. http://grey-boundary.com/the-architecture-of-clustering-graphite/

  113. carbonate provides missing bits of carbon to assist scaling https://github.com/jssjr/carbonate

  114. carbon > whisper > mega carbon ceres coming in graphite

    v0.10
  115. graphite-api without the front end https://github.com/brutasse/graphite-api functional goodness

  116. @obfuscurity great blogger and an authority on graphite http://obfuscurity.com/

  117. synthesize vagrant environment to experiment with https://github.com/obfuscurity/synthesize

  118. pip apparently easy installation

  119. None
  120. feeding need input

  121. label your axes, before it is too late talk about

    units, frequencies, and include in the metric name
  122. Logster don’t let it be your golden hammer https://github.com/etsy/logster

  123. mmmmm python.

  124. logster invocations / minute 2,700

  125. lines parsed / second 11,000

  126. parsers process log deltas when invoked

  127. MetricLogster generalised log format for counts, times & gauges

  128. METRIC_COUNT metric=web.searches value=1
 METRIC_TIME metric=web.search_time value=12ms

  129. statsd aggregates metrics in real time and sends to graphite

  130. scaled / ha? you’ll need to distribute and mirror the

    packets yourself
  131. event streaming consider something like reimann to decouple https://github.com/jdmaturen/reimann

  132. collectd can collect and send server metrics to graphite

  133. community actively producing lots of collectd plugins apache, cpu, memory,

    disk etc
  134. 10s Frequency! be careful that collectd doesn’t flood graphite

  135. metrics by coda hale.

  136. http://metrics.codahale.com/

  137. graphite support, along with ganglia, slf4j etc

  138. instrument your java components.

  139. aggregates timers, histograms etc

  140. no statsd unless you need cross host aggregation

  141. awesome actually this is and we really ought to use

    it
  142. front ends

  143. image api for individual charts

  144. http://grafana.org/

  145. no backend awesome concept, better with elastic search

  146. replaces composer graphite expression parser and builder

  147. interactive explore values, apply filters

  148. time synchronized windows across charts

  149. None
  150. None
  151. None
  152. None
  153. None
  154. demo ask me later for a

  155. tasseo quick and simple ruby dashboard https://github.com/obfuscurity/tasseo

  156. http://shopify.github.io/dashing/

  157. Skyliner Oculus = Etsy Kale Anomoly detection Corrolation http://codeascraft.com/2013/06/11/introducing-kale/ &

  158. http://zachholman.com/posts/slide-design-for-developers/

  159. QUESTIONS?

  160. MONITORING!

  161. ! Dublin? http://monitorama.eu/#speakers Portland. http://monitorama.com/#speakers