Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsPorto Meetup5: The Road to Monitoring Nirvana by Pedro Araújo

DevOpsPorto Meetup5: The Road to Monitoring Nirvana by Pedro Araújo

Talk delivered by Pedro Araújo

DevOpsPorto

June 13, 2017
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. The road to monitoring Nirvana
    June 2017
    Pedro Araújo

    View full-size slide

  2. Who am I?
    Studied Computer Engineering
    Did web development for a couple of years
    Moved to systems administration for a couple of years
    Had a run at build and automation engineering
    Landed in SRE, <3 it

    View full-size slide

  3. 135 million daily transactions
    4.7 billion daily API calls most weeks (55k/s, 100k/s peak)
    2.5 terabytes of daily log data output
    250,000-per-second time series data points
    14k nagios checks

    View full-size slide

  4. What this talk is not going to be about

    View full-size slide

  5. What is monitoring?
    Different things to different people

    View full-size slide

  6. Everyone starts with (nagios-style) checks
    Also, black-box monitoring

    View full-size slide

  7. (from: https://www.novell.com/coolsolutions/feature/16723.html)

    View full-size slide

  8. Logs give you initial insights
    at the cost of brittle and heavy scripting

    View full-size slide

  9. RSS_URI="/rss"
    LOG_FILE="/var/log/httpd/access_log"
    LOG_DATE_FORMAT="%d/%b/%Y"
    DATE="-1 day"
    LOG_FDATE=`date -d "$DATE" "+${LOG_DATE_FORMAT}"`
    # Unique IPs requesting RSS, except those reporting "subscribers":
    IPSUBS=$(
    fgrep "$LOG_FDATE" "$LOG_FILE" \
    | fgrep " $RSS_URI" \
    | egrep -v '[0-9]+ subscribers' \
    | cut -d' ' -f 1 \
    | sort \
    | uniq \
    | wc -l
    )
    # Other user-agents reporting "subscribers", for which we'll use the entire
    # user-agent string for uniqueness:
    OTHERSUBS=$(
    fgrep "$LOG_FDATE" "$LOG_FILE" \
    | fgrep " $RSS_URI" \
    | fgrep -v 'subscribers; feed-id=' \
    | egrep '[0-9]+ subscribers' \
    | egrep -o '"[^"]+"$' \
    | sort -t\( -k2 -sr \
    | awk '!x[$1]++' \
    | egrep -o '[0-9]+ subscribers' \
    | awk '{s+=$1} END {print s}'
    )
    (from: https://gist.github.com/marcoarment/3783146 (abbr.))

    View full-size slide

  10. Client-side analytics
    Online
    Crash reporting
    Synthetic monitoring
    RUM
    Offline
    Detailed logging
    Round-Robin Databases

    View full-size slide

  11. (from: https://megalytic.com/blog/tips-for-segmenting-stats-by-geography)

    View full-size slide

  12. (from: https://piwik.org/docs/piwik-tour/#toc-dashboard-widgets)

    View full-size slide

  13. Time series are like check results saved
    continuously
    But we can do so much more

    View full-size slide

  14. Named value at some time.
    Metric identity
    name
    dimensions
    Metric value
    Timestamp
    os.filesystem.size 1486469296 961130496 mount=/ type=Used
    os.filesystem.size 1486469296 8903143424 mount=/ type=Free
    os.filesystem.size 1486469296 1143103488 mount=/var type=Used
    os.filesystem.size 1486469296 249068044288 mount=/var type=Free
    os.filesystem.size 1486469296 0 mount=/dev/shm type=Used
    os.filesystem.size 1486469296 50682404864 mount=/dev/shm type=Free
    os.filesystem.size-inodes 1486469296 18862 mount=/ type=Used
    os.filesystem.size-inodes 1486469296 2602578 mount=/ type=Free
    os.filesystem.size-inodes 1486469296 14518 mount=/var type=Used
    os.filesystem.size-inodes 1486469296 66504522 mount=/var type=Free
    os.filesystem.size-inodes 1486469296 1 mount=/dev/shm type=Used
    os.filesystem.size-inodes 1486469296 12373633 mount=/dev/shm type=Free

    View full-size slide

  15. Common datapoint types
    Counters

    View full-size slide

  16. Counter example

    View full-size slide

  17. Counter rated example

    View full-size slide

  18. Common datapoint types
    Gauges

    View full-size slide

  19. Gauge example

    View full-size slide

  20. Advanced datapoint types
    Histogram

    View full-size slide

  21. Example histogram sample
    # HELP The number of chunks persisted per series.
    # TYPE prometheus_local_storage_series_chunks_persisted histogram
    prometheus_local_storage_series_chunks_persisted_bucket{le="1"} 3.205911e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="2"} 3.652375e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="4"} 4.405614e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="8"} 5.66866e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="16"} 8.226382e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="32"} 8.73615e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="64"} 8.770525e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="128"} 8.770525e+06
    prometheus_local_storage_series_chunks_persisted_bucket{le="+Inf"} 8.770525e+06
    prometheus_local_storage_series_chunks_persisted_sum 5.5495433e+07
    prometheus_local_storage_series_chunks_persisted_count 8.770525e+06

    View full-size slide

  22. Histogram example

    View full-size slide

  23. White-box monitoring
    Detailed insight via native instrumentation

    View full-size slide

  24. You can roll your own instrumentation.
    @contextmanager
    def op(what):
    start = time.time()
    yield
    increment('hitcount.total_s',
    value=(time.time() - start),
    tags=["op:" + what])
    while True:
    with op('receive'):
    req = queue.pop()
    with op('compute_route'):
    route = compute_route(req)
    with op('update'):
    db.execute('''
    UPDATE hitcount WHERE route = ? SET hits=hits + 1
    ''', (route, ))
    with op('finish'):
    req.finish()
    (from: https://honeycomb.io/blog/2017/01/instrumentation-measuring-capacity-through-utilization/)

    View full-size slide

  25. Aspects help instrumentation.
    @Controller
    public class MyController {
    @RequestMapping("/")
    @TimeMethod(name = "app_duration_seconds", help = "Some helpful info here")
    public Object handleMain() {
    // Do something
    }
    }
    c = Counter('request_failure_total', 'Description of counter')
    h = Histogram('request_latency_seconds', 'Description of histogram')
    @c.count_exceptions()
    @h.time()
    def businessFunction():
    # Do something
    pass

    View full-size slide

  26. Collectors
    or, how to get all that interesting data

    View full-size slide

  27. Event-based monitoring

    View full-size slide

  28. grep'ing logs across hosts is hard
    I CAN HAZ LOG AGGREGATION

    View full-size slide

  29. (from: https://dzone.com/articles/getting-started-splunk)

    View full-size slide

  30. (from: http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/)

    View full-size slide

  31. Micro-services - perf debugging is hard
    Distributed tracing to the rescue

    View full-size slide

  32. (from: http://opentracing.io/documentation/)

    View full-size slide

  33. (from: http://opentracing.io/documentation/)

    View full-size slide

  34. (from: http://jaeger.readthedocs.io/en/latest/#trace-detail-view)

    View full-size slide

  35. Advanced visualisations

    View full-size slide

  36. (from: https://www.circonus.com/2012/09/understanding-data-with-histograms/)

    View full-size slide

  37. (from: http://www.brendangregg.com/frequencytrails.html)

    View full-size slide

  38. Anomaly detection

    View full-size slide

  39. (from: https://eng.uber.com/argos/)

    View full-size slide

  40. Alerting
    not covered here

    View full-size slide

  41. Thank you
    Pedro Araújo
    https://keybase.io/phcrva

    View full-size slide