Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Teach your application eloquence. Logs, metrics...

ShaD
February 16, 2019

Teach your application eloquence. Logs, metrics, traces.

Most modern applications live in a close cooperation with each other. We will talk about the ways to effectively use the modern techniques for monitoring the health of applications and look on tasks and typical implementation mistakes through the eyes of an infrastructure engineer. And we will also consider the Ruby libraries that help to implement all of this.

ShaD

February 16, 2019
Tweet

More Decks by ShaD

Other Decks in Programming

Transcript

  1. Who we are? • UK-based data security products and services

    company
 • Building security tools to prevent sensitive data leakage and to comply with data security regulations
 • Cryptographic tools, security consulting, training
 • We are cryptographers, system engineers, applied engineers, infrastructure engineers
 • We support community, speak, teach, open source a lot
  2. What we are going to talk • Why do we

    need telemetry? • What are the different kinds of telemetry? • Borders of applicability of various types of telemetry • Approaches and mistakes • Implementation
  3. What is telemetry? «Gathering data on the use of applications

    and application components, measurements of start-up time and processing time, hardware, application crashes, and general usage statistics.»
  4. Why do we need telemetry at all? Who are the

    consumers?
 − developers
 − devops/sysadmins
 − analysts
 − security staff What purposes?
 − debug
 − monitor state and health
 − measure and tune performance
 − business analysis
 − intrusion detection
  5. It is worthwhile, indeed • speed up developing process •

    increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning
  6. It is worthwhile, indeed • speed up developing process •

    increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning • COST of development • COST of use
  7. What data do we have to export? … we can

    ask any specialist. — ALL! … will be their answer.
  8. Classification of information technical:
 − state
 − health
 − errors


    − performance
 − debug
 − events business:
 − SLI
 − user actions
  9. Classification of information technical:
 − state
 − health
 − errors


    − performance
 − debug
 − events business:
 − SLI
 − user actions developers devops/sysadmins
  10. Classification of information technical:
 − state
 − health
 − errors


    − performance
 − debug
 − events business:
 − SLI
 − user actions developers devops/sysadmins analysts
  11. Classification of information technical:
 − state
 − health
 − errors


    − performance
 − debug
 − events business:
 − SLI
 − user actions developers devops/sysadmins analysts security staff
  12. SIEM — security staff’s main instrument Complex analyze:
 − correlation


    − threats
 − patterns
 − compliance Applications Network devices Servers Environment
  13. Telemetry evolution Logs • each application has an individual log

    file • syslog:
 − message standard (RFC 3164, 2001)
 − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf
  14. Telemetry evolution Logs • each application has an individual log

    file • syslog:
 − message standard (RFC 3164, 2001)
 − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf Metrics • reports into logs • agents, collectors, stores with proprietary protocols • SNMP • HTTP, protobuf • custom implementations
  15. Telemetry evolution Logs • each application has an individual log

    file • syslog:
 − message standard (RFC 3164, 2001)
 − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf Metrics • reports into logs • agents, collectors, stores with proprietary protocols • SNMP • HTTP, protobuf • custom implementations Traces • reports into logs • agents, collectors, stores with proprietary protocols • custom implementations
  16. Telemetry applicability Logs • simplest • no external tools required

    • human readable • UNIX-style • compatible with a tons of tools • queries • alerts
  17. Telemetry applicability Logs • simplest • no external tools required

    • human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts
  18. Telemetry applicability Logs • simplest • no external tools required

    • human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts Traces • minimal store size • low performance impact • per-query metrics • low-level information • precise debugging and performance tuning
  19. Telemetry applicability Logs • simplest • no external tools required

    • human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts Traces • minimal store size • low performance impact • per-query metrics • low-level information • precise debugging and performance tuning + SIEM systems
  20. Logs : kinds of data • initial information about the

    application • state changes (start/ready/…/stop) • health changes • audit trail (security-relevant list of activities: financial operations, health care data transactions, changing keys, changing configuration) • user sessions (sign-in attempts, sign-out, actions) • not expected actions (wrong URLs, sign-in fails, etc.) • various information in string format
  21. Logs : on start • new state: starting • application

    name • component name • commit hash / build number • configuration in use • deprecation warnings • running mode
  22. Logs : on ready • new state: ready • listen

    interfaces, ports and sockets • health
  23. Logs : on state or health change • new state

    • reason • URL to documentation
  24. Logs : on state or health change • new state

    • reason • URL to documentation Use traffic-light highlight system for health states:
 • — completely unhealthy
 • — partially healthy, reduced functionality
 • — completely healthy
  25. Logs : on shutdown • reason • status of preparing

    to shutdown • new state: stopped (final goodbye)
  26. Logs : each line • timestamps (ISO8601, TZ, reasonable precission)

    • PID • application/component short name • application version (JSON, CEF, protobuf) • severity (CEF: 0→10, rfc5427: 7→0) • event code (HTTP style) • human-readable message
  27. Logs : do not export! • passwords, tokens, any sensitive

    data — security risks • private data — legal risks Use: − masking − anonymisation / pseudonymisation
  28. Logs : consumers • Console • Files • General purpose

    collector/store/alert/search system. • SIEM
  29. Logs : consumers and formats console, STDERR file syslog ELK

    SIEM socket, HTTP, custom plain ✓ syslog (RFC3164) ✓ ✓ ✓ ✓ ✓ ✓ JSON ✓ ✓ ✓ ✓ ✓ ✓ CEF ✓ ✓ ✓ ✓ ✓ ✓ protobuf ✓ ✓
  30. Logs : CEF • old (2009), but widely used standard

    • simple: easy to generate, easy to parse (supported even by devices without powerful CPUs) • well documented:
 − field name dictionaries
 − field types CEF:Version|Device Vendor|Device Product|Device Version| Signature ID|Name|Severity|Extension Sep 19 08:26:10 host CEF:0|security|threatmanager|1.0|100| worm successfully stopped|10|src=10.0.0.1 dst=2.1.2.2 spt=1232
  31. Logs : bear in mind [1/3] • Logs will be

    read by humans. Often, when failure happens. With limited time to reaction. Be brief and eloquent. Give information that may help to solve a problem. • Logs will be searched. Don’t be a poet, be a technical specialist. Use expected words. • Logs will be parsed automatically; indeed, they will. There are too many different systems that want telemetry from your application. • Carefully classify the severity of events. Many error messages instead of warnings in non-critical situations will lead to ignoring information from the logs.
  32. Logs : bear in mind [2/3] • Whenever it possible,

    base on existing standards. Grouping event codes according to the HTTP error code table is not bad idea. • Logs are the first resource to analyze security incidents. • Logs will be archived and stored for a long period of time. It will be almost impossible to cut off some pieces of data. • Should be configurable: formats, transport protocols, paths, severity.
  33. Logs : bear in mind [3/3] • Your application may

    run in many different environments with different standards of logging (VM, docker). Application should be able to direct all logs into one channel. Splitting may be an option. • Do not implement log files rotation. Give possibility to inform your application when it needs to gracefully recreate the log file after being rotated by an external service. • When big trouble occurs and nothing works, your application should be able to print readable logs in the simplest manner — to stderr/stdout.
  34. Logs : implementation • native Ruby methods • semantic_logger
 https://github.com/rocketjob/semantic_logger


    (a lot of destinations: DBs, HTTP, UDP, syslog) • ougai
 https://github.com/tilfin/ougai
 (JSON) • httplog
 https://github.com/trusche/httplog
 (HTTP logging, JSON support)
  35. Metrics : approaches • USE method
 Utilization, Saturation, Errors •

    Google SRE book
 Latency, Traffic, Errors, Saturation • RED method
 Rate, Errors, Duration
  36. Metrics : utilization • Hardware resources: CPU, disk system, network

    intefaces • File system: capacity, usage • Memory: capacity, cache, heap, queue • Resources: file descriptors, threads, sockets, connections The average time that the resource was busy servicing work. Usage of resource.
  37. Metrics : traffic, rate • normal operations: 
 − requests


    − queries
 − transactions
 − sending network packets
 − processing flow bytes A measure of how much demand is being placed on your system. (Google SRE book) The number of requests, per second, you services are serving. (RED Method)
  38. Metrics : latency, duration The time it takes to service

    a request. (Google SRE book) • latency of operations: 
 − requests
 − queries
 − transactions
 − sending network packets
 − processing flow bytes
  39. Metrics : errors • error events:
 − hardware errors
 −

    software exceptions
 − invalid requests / input
 − authentication fails
 − invalid URLs The count of error events. (USE Method) The rate of requests that fail, either explicitly, implicitly, or by policy. (Google SRE book)
  40. Metrics : saturation • calculated value, measure of current load

    The degree to which the resource has extra work which it can't service, often queued. (USE Method) How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained. (Google SRE book)
  41. Metrics : saturation • can be calculated internally or measured

    externally • high utilization is a problem • high saturation is a problem • low utilization level does not guarantee that everything is OK • low saturation (in the case of a correct calculation) most likely indicates that everything is OK
  42. OpenMetrics : based on Prometheus metric types • Gauge
 single

    numerical value
 − memory used
 − fan speed
 − connections count • Counter
 single monotonically increasing counter
 − operations done
 − errors occured
 − requests processed • Histogram
 increment counter per buckets
 − requests count per latency buckets
 − CPU load values count per range buckets • Summary
 similar to the Histogram, but φ-quantiles are calculated on client-side; calculating of other quantiles is not possible https://openmetrics.io/ https://prometheus.io/docs/concepts/metric_types/
  43. Metrics : buckets <10 < 20 < 30 < 40

    < 50 < 60 < 70 < 80 < 90 < 100
  44. Metrics : buckets <10 < 20 < 30 < 40

    < 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1
  45. Metrics : buckets <10 < 20 < 30 < 40

    < 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1 90 percentile 50 percentile
  46. Metrics : export data • current state • current health

    • event counters:
 − AAA events
 − not expected actions (wrong URLs, sign-in fails)
 − errors during normal operations • performance metrics
 − normal operations
 − queues
 − utilization, saturation
 − query latency • application info:
 − version
 − warnings/notifications gauge
  47. Metrics : formats • suggest using Prometheus format
 − native

    for Prometheus
 − OpenMetrics — open source specification
 − simple and clear
 − HTTP-based
 − can be easily converted
 − libraries exist • Influx or similar format if you really need to implement push model • protobuf / gRPC
 − custom
 − high load

  48. Metrics : bear in mind [1/2] • Split statistic by

    types. For example, the aggregation of successful (relatively long) and failed (relatively short) durations may lead to the illusion of performance increase when multiple failures occur. • Whenever it possible use Saturation to determine load of system. Utilization is not complete information. • Be sure to export the metrics of the component closest to the user. This will allow to evaluate the SLI. • Implement configurable buckets sizes.
  49. Metrics : bear in mind [2/2] • Export appropriate metrics

    as buckets. It lower polling rate and makes possible to get statistics in percentiles. • Add units to metric names. • Whenever it possible, use SI units. • Follow the naming standard. Prometheus “Metric and label naming” document is a good base.
  50. Traces : definition In software engineering, tracing involves a specialized

    use of logging to record information about a program's execution. … There is not always a clear distinction between tracing and other forms of logging, except that the term tracing is almost never applied to logging that is a functional requirement of a program. — Wikipedia
  51. Traces : use cases • Debugging during development • Measuring

    and tuning performance • Analyze failures and security incidents https://www.cossacklabs.com/blog/how- to-implement-distributed-tracing.html • Approaches • Library comparison • Implementation example • Use cases
  52. Traces : kinds of data • trace id • span

    id • parent span id • application info (product, component) • module name • method name • context data (session/request id, user id, …) • operation name and code • start time • end time Per request/query tracking:
  53. Traces : consumers • General purpose collectors:
 − Jaeger
 −

    Zipkin • Cloud collectors:
 − Google StackDriver
 − AWS X-Ray
 − Azure Application Insights • SIEM
  54. Traces : formats • Proprietary protocols:
 − Jaeger
 − Zipkin


    − Google StackDriver
 − AWS X-Ray
 − Azure Application Insights • JSON:
 − SIEM • protobuf/gRPC:
 − custom
  55. Traces : implementation • OpenCensus
 https://www.rubydoc.info/gems/opencensus
 (Zipkin, GC Stackdriver, JSON)

    • OpenTracing
 https://opentracing.io/guides/ruby/ • Jaeger client
 https://github.com/salemove/jaeger-client-ruby
  56. Checklist : Logs □ Each line:
 □ timestamps (ISO8601, TZ,

    reasonable precission)
 □ PID
 □ component name
 □ severity
 □ event code
 □ human-readable message □ Events to log:
 □ state changes (start/ready/pause/stop)
 □ health changes (new state, reason, doc URL)
 □ user sign-in attempts (including failed with reasons), actions, sign-out
 □ audit trail
 □ errors □ On start:
 □ product name, component name
 □ version (+build, +commit hash)
 □ running mode (debug/normal, daemon/)
 □ deprecation warnings
 □ which configuration in use (ENV, file, configuration service) □ On ready: communication sockets and ports □ On exit: reason □ Do not log:
 □ passwords, tokens
 □ personal data
  57. Checklist : Metrics □ Data to export:
 □ application (version,

    warning/notification)
 □ utilization (resources, capacities, usage)
 □ saturation (internally calculated or appropriate metrics)
 □ rate (operations)
 □ errors
 □ latencies □ Split metrics by types □ Export as buckets when reasonable □ Configure size of buckets □ Export metrics for SLI □ Determine required resolution □ Normalize, use SI units, add units to names □ Prefer poll model if it possible □ Clear counters on restart
  58. Links [1/2] • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure


    https://static.googleusercontent.com/media/ research.google.com/uk//pubs/archive/36356.pdf • How to Implement Tracing in a Modern Distributed Application
 https://www.cossacklabs.com/blog/how-to-implement- distributed-tracing.html • OpenTracing
 https://opentracing.io/ • OpenMetrics
 https://github.com/RichiH/OpenMetrics • OpenCensus
 https://opencensus.io
  59. Links [2/2] • CEF
 https://kc.mcafee.com/resources/sites/MCAFEE/content/live/ CORP_KNOWLEDGEBASE/78000/KB78712/en_US/ CEF_White_Paper_20100722.pdf • Metrics :

    USE method
 http://www.brendangregg.com/usemethod.html • Google SRE book
 https://landing.google.com/sre/sre-book/chapters/monitoring-distributed- systems/ • Metrics : RED method
 https://www.weave.works/blog/the-red-method-key-metrics-for-microservices- architecture/ • MS Azure : monitoring and diagnostic
 https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring • Prometheus : Metrics and label names
 https://prometheus.io/docs/practices/naming/