Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsPorto Meetup5: The Road to Monitoring Nirvana by Pedro Araújo

DevOpsPorto Meetup5: The Road to Monitoring Nirvana by Pedro Araújo

Talk delivered by Pedro Araújo

DevOpsPorto

June 13, 2017
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. The road to monitoring Nirvana
    June 2017
    Pedro Araújo

    View Slide

  2. Who am I?
    Studied Computer Engineering
    Did web development for a couple of years
    Moved to systems administration for a couple of years
    Had a run at build and automation engineering
    Landed in SRE, <3 it

    View Slide

  3. 135 million daily transactions
    4.7 billion daily API calls most weeks (55k/s, 100k/s peak)
    2.5 terabytes of daily log data output
    250,000-per-second time series data points
    14k nagios checks

    View Slide

  4. What this talk is not going to be about

    View Slide

  5. What is monitoring?
    Different things to different people

    View Slide

  6. Everyone starts with (nagios-style) checks
    Also, black-box monitoring

    View Slide

  7. (from: https://www.novell.com/coolsolutions/feature/16723.html)

    View Slide

  8. View Slide

  9. Logs give you initial insights
    at the cost of brittle and heavy scripting

    View Slide

  10. RSS_URI="/rss"
    LOG_FILE="/var/log/httpd/access_log"
    LOG_DATE_FORMAT="%d/%b/%Y"
    DATE="-1 day"
    LOG_FDATE=`date -d "$DATE" "+${LOG_DATE_FORMAT}"`
    # Unique IPs requesting RSS, except those reporting "subscribers":
    IPSUBS=$(
    fgrep "$LOG_FDATE" "$LOG_FILE" \
    | fgrep " $RSS_URI" \
    | egrep -v '[0-9]+ subscribers' \
    | cut -d' ' -f 1 \
    | sort \
    | uniq \
    | wc -l
    )
    # Other user-agents reporting "subscribers", for which we'll use the entire
    # user-agent string for uniqueness:
    OTHERSUBS=$(
    fgrep "$LOG_FDATE" "$LOG_FILE" \
    | fgrep " $RSS_URI" \
    | fgrep -v 'subscribers; feed-id=' \
    | egrep '[0-9]+ subscribers' \
    | egrep -o '"[^"]+"$' \
    | sort -t\( -k2 -sr \
    | awk '!x[$1]++' \
    | egrep -o '[0-9]+ subscribers' \
    | awk '{s+=$1} END {print s}'
    )
    (from: https://gist.github.com/marcoarment/3783146 (abbr.))

    View Slide

  11. Client-side analytics
    Online
    Crash reporting
    Synthetic monitoring
    RUM
    Offline
    Detailed logging
    Round-Robin Databases

    View Slide

  12. (from: https://megalytic.com/blog/tips-for-segmenting-stats-by-geography)

    View Slide

  13. (from: https://piwik.org/docs/piwik-tour/#toc-dashboard-widgets)

    View Slide

  14. Time series are like check results saved
    continuously
    But we can do so much more

    View Slide

  15. View Slide

  16. Named value at some time.
    Metric identity
    name
    dimensions
    Metric value
    Timestamp
    os.filesystem.size 1486469296 961130496 mount=/ type=Used
    os.filesystem.size 1486469296 8903143424 mount=/ type=Free
    os.filesystem.size 1486469296 1143103488 mount=/var type=Used
    os.filesystem.size 1486469296 249068044288 mount=/var type=Free
    os.filesystem.size 1486469296 0 mount=/dev/shm type=Used
    os.filesystem.size 1486469296 50682404864 mount=/dev/shm type=Free
    os.filesystem.size-inodes 1486469296 18862 mount=/ type=Used
    os.filesystem.size-inodes 1486469296 2602578 mount=/ type=Free
    os.filesystem.size-inodes 1486469296 14518 mount=/var type=Used
    os.filesystem.size-inodes 1486469296 66504522 mount=/var type=Free
    os.filesystem.size-inodes 1486469296 1 mount=/dev/shm type=Used
    os.filesystem.size-inodes 1486469296 12373633 mount=/dev/shm type=Free

    View Slide

  17. View Slide

  18. Common datapoint types
    Counters

    View Slide

  19. Counter example

    View Slide

  20. Counter rated example

    View Slide

  21. Common datapoint types
    Gauges

    View Slide

  22. Gauge example

    View Slide

  23. Advanced datapoint types
    Histogram

    View Slide

  24. View Slide

  25. Example histogram sample
    # HELP The number of chunks persisted per series.
    # TYPE prometheus_local_storage_series_chunks_persisted histogram
    prometheus_local_storage_series_chunks_persisted_bucket{le="1"} 3.205911e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="2"} 3.652375e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="4"} 4.405614e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="8"} 5.66866e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="16"} 8.226382e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="32"} 8.73615e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="64"} 8.770525e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="128"} 8.770525e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="+Inf"} 8.770525e+06
    prometheus_local_storage_series_chunks_persisted_sum 5.5495433e+07
    prometheus_local_storage_series_chunks_persisted_count 8.770525e+06

    View Slide

  26. Histogram example

    View Slide

  27. White-box monitoring
    Detailed insight via native instrumentation

    View Slide

  28. You can roll your own instrumentation.
    @contextmanager
    def op(what):
    start = time.time()
    yield
    increment('hitcount.total_s',
    value=(time.time() - start),
    tags=["op:" + what])
    while True:
    with op('receive'):
    req = queue.pop()
    with op('compute_route'):
    route = compute_route(req)
    with op('update'):
    db.execute('''
    UPDATE hitcount WHERE route = ? SET hits=hits + 1
    ''', (route, ))
    with op('finish'):
    req.finish()
    (from: https://honeycomb.io/blog/2017/01/instrumentation-measuring-capacity-through-utilization/)

    View Slide

  29. Aspects help instrumentation.
    @Controller
    public class MyController {
    @RequestMapping("/")
    @TimeMethod(name = "app_duration_seconds", help = "Some helpful info here")
    public Object handleMain() {
    // Do something
    }
    }
    c = Counter('request_failure_total', 'Description of counter')
    h = Histogram('request_latency_seconds', 'Description of histogram')
    @c.count_exceptions()
    @h.time()
    def businessFunction():
    # Do something
    pass

    View Slide

  30. Collectors
    or, how to get all that interesting data

    View Slide

  31. Event-based monitoring

    View Slide

  32. grep'ing logs across hosts is hard
    I CAN HAZ LOG AGGREGATION

    View Slide

  33. (from: https://dzone.com/articles/getting-started-splunk)

    View Slide

  34. (from: http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/)

    View Slide

  35. Micro-services - perf debugging is hard
    Distributed tracing to the rescue

    View Slide

  36. (from: http://opentracing.io/documentation/)

    View Slide

  37. (from: http://opentracing.io/documentation/)

    View Slide

  38. (from: http://jaeger.readthedocs.io/en/latest/#trace-detail-view)

    View Slide

  39. View Slide

  40. Advanced visualisations

    View Slide

  41. (from: https://www.circonus.com/2012/09/understanding-data-with-histograms/)

    View Slide

  42. (from: http://www.brendangregg.com/frequencytrails.html)

    View Slide

  43. Anomaly detection

    View Slide

  44. (from: https://eng.uber.com/argos/)

    View Slide

  45. Alerting
    not covered here

    View Slide

  46. Conclusion

    View Slide

  47. View Slide

  48. Thank you
    Pedro Araújo
    https://keybase.io/phcrva

    View Slide