DevOpsPorto Meetup5: The Road to Monitoring Nirvana by Pedro Araújo

DevOpsPorto Meetup5: The Road to Monitoring Nirvana by Pedro Araújo

Talk delivered by Pedro Araújo

A2c14a1c4e16aa337c7d36abe7d1cf8f?s=128

DevOpsPorto

June 13, 2017
Tweet

Transcript

  1. The road to monitoring Nirvana June 2017 Pedro Araújo

  2. Who am I? Studied Computer Engineering Did web development for

    a couple of years Moved to systems administration for a couple of years Had a run at build and automation engineering Landed in SRE, <3 it
  3. 135 million daily transactions 4.7 billion daily API calls most

    weeks (55k/s, 100k/s peak) 2.5 terabytes of daily log data output 250,000-per-second time series data points 14k nagios checks
  4. What this talk is not going to be about

  5. What is monitoring? Different things to different people

  6. Everyone starts with (nagios-style) checks Also, black-box monitoring

  7. (from: https://www.novell.com/coolsolutions/feature/16723.html)

  8. None
  9. Logs give you initial insights at the cost of brittle

    and heavy scripting
  10. RSS_URI="/rss" LOG_FILE="/var/log/httpd/access_log" LOG_DATE_FORMAT="%d/%b/%Y" DATE="-1 day" LOG_FDATE=`date -d "$DATE" "+${LOG_DATE_FORMAT}"` #

    Unique IPs requesting RSS, except those reporting "subscribers": IPSUBS=$( fgrep "$LOG_FDATE" "$LOG_FILE" \ | fgrep " $RSS_URI" \ | egrep -v '[0-9]+ subscribers' \ | cut -d' ' -f 1 \ | sort \ | uniq \ | wc -l ) # Other user-agents reporting "subscribers", for which we'll use the entire # user-agent string for uniqueness: OTHERSUBS=$( fgrep "$LOG_FDATE" "$LOG_FILE" \ | fgrep " $RSS_URI" \ | fgrep -v 'subscribers; feed-id=' \ | egrep '[0-9]+ subscribers' \ | egrep -o '"[^"]+"$' \ | sort -t\( -k2 -sr \ | awk '!x[$1]++' \ | egrep -o '[0-9]+ subscribers' \ | awk '{s+=$1} END {print s}' ) (from: https://gist.github.com/marcoarment/3783146 (abbr.))
  11. Client-side analytics Online Crash reporting Synthetic monitoring RUM Offline Detailed

    logging Round-Robin Databases
  12. (from: https://megalytic.com/blog/tips-for-segmenting-stats-by-geography)

  13. (from: https://piwik.org/docs/piwik-tour/#toc-dashboard-widgets)

  14. Time series are like check results saved continuously But we

    can do so much more
  15. None
  16. Named value at some time. Metric identity name dimensions Metric

    value Timestamp os.filesystem.size 1486469296 961130496 mount=/ type=Used os.filesystem.size 1486469296 8903143424 mount=/ type=Free os.filesystem.size 1486469296 1143103488 mount=/var type=Used os.filesystem.size 1486469296 249068044288 mount=/var type=Free os.filesystem.size 1486469296 0 mount=/dev/shm type=Used os.filesystem.size 1486469296 50682404864 mount=/dev/shm type=Free os.filesystem.size-inodes 1486469296 18862 mount=/ type=Used os.filesystem.size-inodes 1486469296 2602578 mount=/ type=Free os.filesystem.size-inodes 1486469296 14518 mount=/var type=Used os.filesystem.size-inodes 1486469296 66504522 mount=/var type=Free os.filesystem.size-inodes 1486469296 1 mount=/dev/shm type=Used os.filesystem.size-inodes 1486469296 12373633 mount=/dev/shm type=Free
  17. None
  18. Common datapoint types Counters

  19. Counter example

  20. Counter rated example

  21. Common datapoint types Gauges

  22. Gauge example

  23. Advanced datapoint types Histogram

  24. None
  25. Example histogram sample # HELP The number of chunks persisted

    per series. # TYPE prometheus_local_storage_series_chunks_persisted histogram prometheus_local_storage_series_chunks_persisted_bucket{le="1"} 3.205911e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="2"} 3.652375e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="4"} 4.405614e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="8"} 5.66866e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="16"} 8.226382e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="32"} 8.73615e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="64"} 8.770525e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="128"} 8.770525e+06 prometheus_local_storage_series_chunks_persisted_bucket{le="+Inf"} 8.770525e+06 prometheus_local_storage_series_chunks_persisted_sum 5.5495433e+07 prometheus_local_storage_series_chunks_persisted_count 8.770525e+06
  26. Histogram example

  27. White-box monitoring Detailed insight via native instrumentation

  28. You can roll your own instrumentation. @contextmanager def op(what): start

    = time.time() yield increment('hitcount.total_s', value=(time.time() - start), tags=["op:" + what]) while True: with op('receive'): req = queue.pop() with op('compute_route'): route = compute_route(req) with op('update'): db.execute(''' UPDATE hitcount WHERE route = ? SET hits=hits + 1 ''', (route, )) with op('finish'): req.finish() (from: https://honeycomb.io/blog/2017/01/instrumentation-measuring-capacity-through-utilization/)
  29. Aspects help instrumentation. @Controller public class MyController { @RequestMapping("/") @TimeMethod(name

    = "app_duration_seconds", help = "Some helpful info here") public Object handleMain() { // Do something } } c = Counter('request_failure_total', 'Description of counter') h = Histogram('request_latency_seconds', 'Description of histogram') @c.count_exceptions() @h.time() def businessFunction(): # Do something pass
  30. Collectors or, how to get all that interesting data

  31. Event-based monitoring

  32. grep'ing logs across hosts is hard I CAN HAZ LOG

    AGGREGATION
  33. (from: https://dzone.com/articles/getting-started-splunk)

  34. (from: http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/)

  35. Micro-services - perf debugging is hard Distributed tracing to the

    rescue
  36. (from: http://opentracing.io/documentation/)

  37. (from: http://opentracing.io/documentation/)

  38. (from: http://jaeger.readthedocs.io/en/latest/#trace-detail-view)

  39. None
  40. Advanced visualisations

  41. (from: https://www.circonus.com/2012/09/understanding-data-with-histograms/)

  42. (from: http://www.brendangregg.com/frequencytrails.html)

  43. Anomaly detection

  44. (from: https://eng.uber.com/argos/)

  45. Alerting not covered here

  46. Conclusion

  47. None
  48. Thank you Pedro Araújo https://keybase.io/phcrva