Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Teach your application eloquence. Logs, metrics, traces.

ShaD
February 16, 2019

Teach your application eloquence. Logs, metrics, traces.

Most modern applications live in a close cooperation with each other. We will talk about the ways to effectively use the modern techniques for monitoring the health of applications and look on tasks and typical implementation mistakes through the eyes of an infrastructure engineer. And we will also consider the Ruby libraries that help to implement all of this.

ShaD

February 16, 2019
Tweet

More Decks by ShaD

Other Decks in Programming

Transcript

  1. Teach your application
    eloquence.
    Logs, metrics, traces.
    Dmytro Shapovalov
    Infrastructure Engineer @ Cossack Labs

    View Slide

  2. Who we are?
    • UK-based data security products and services
    company

    • Building security tools to prevent sensitive data
    leakage and to comply with data security
    regulations

    • Cryptographic tools, security consulting, training

    • We are cryptographers, system engineers,
    applied engineers, infrastructure engineers

    • We support community, speak, teach, open
    source a lot

    View Slide

  3. What we are going to talk
    • Why do we need telemetry?
    • What are the different kinds of telemetry?
    • Borders of applicability of various types of
    telemetry
    • Approaches and mistakes
    • Implementation

    View Slide

  4. What is telemetry?
    «Gathering data on the use of applications and
    application components, measurements of start-up
    time and processing time, hardware, application
    crashes, and general usage statistics.»

    View Slide

  5. Why do we need telemetry at all?
    Who are the consumers?

    − developers

    − devops/sysadmins

    − analysts

    − security staff
    What purposes?

    − debug

    − monitor state and health

    − measure and tune performance

    − business analysis

    − intrusion detection

    View Slide

  6. It is worthwhile, indeed
    • speed up developing process
    • increase overall stability
    • reduce the reaction time on crashes and intrusions
    • adequate business planning

    View Slide

  7. It is worthwhile, indeed
    • speed up developing process
    • increase overall stability
    • reduce the reaction time on crashes and intrusions
    • adequate business planning
    • COST of development
    • COST of use

    View Slide

  8. What data do we have to export?
    … we can ask any specialist.

    View Slide

  9. What data do we have to export?
    … we can ask any specialist.
    — ALL!
    … will be their answer.

    View Slide

  10. Classification of information
    technical:

    − state

    − health

    − errors

    − performance

    − debug

    − events

    View Slide

  11. Classification of information
    technical:

    − state

    − health

    − errors

    − performance

    − debug

    − events
    business:

    − SLI

    − user actions

    View Slide

  12. Classification of information
    technical:

    − state

    − health

    − errors

    − performance

    − debug

    − events
    business:

    − SLI

    − user actions
    developers
    devops/sysadmins

    View Slide

  13. Classification of information
    technical:

    − state

    − health

    − errors

    − performance

    − debug

    − events
    business:

    − SLI

    − user actions
    developers
    devops/sysadmins
    analysts

    View Slide

  14. Classification of information
    technical:

    − state

    − health

    − errors

    − performance

    − debug

    − events
    business:

    − SLI

    − user actions
    developers
    devops/sysadmins
    analysts
    security staff

    View Slide

  15. SIEM — security staff’s main instrument
    Complex analyze:

    − correlation

    − threats

    − patterns

    − compliance
    Applications
    Network devices
    Servers
    Environment

    View Slide

  16. Telemetry evolution
    Logs
    • each application has
    an individual log file
    • syslog:

    − message standard
    (RFC 3164, 2001)

    − aggregation
    • ELK (agents,
    collectors)
    • HTTP, JSON,
    protobuf

    View Slide

  17. Telemetry evolution
    Logs
    • each application has
    an individual log file
    • syslog:

    − message standard
    (RFC 3164, 2001)

    − aggregation
    • ELK (agents,
    collectors)
    • HTTP, JSON,
    protobuf
    Metrics
    • reports into logs
    • agents, collectors,
    stores with
    proprietary protocols
    • SNMP
    • HTTP, protobuf
    • custom
    implementations

    View Slide

  18. Telemetry evolution
    Logs
    • each application has
    an individual log file
    • syslog:

    − message standard
    (RFC 3164, 2001)

    − aggregation
    • ELK (agents,
    collectors)
    • HTTP, JSON,
    protobuf
    Metrics
    • reports into logs
    • agents, collectors,
    stores with
    proprietary protocols
    • SNMP
    • HTTP, protobuf
    • custom
    implementations
    Traces
    • reports into logs
    • agents, collectors,
    stores with
    proprietary protocols
    • custom
    implementations

    View Slide

  19. Telemetry applicability
    Logs
    • simplest
    • no external tools
    required
    • human readable
    • UNIX-style
    • compatible with a
    tons of tools
    • queries
    • alerts

    View Slide

  20. Telemetry applicability
    Logs
    • simplest
    • no external tools
    required
    • human readable
    • UNIX-style
    • compatible with a
    tons of tools
    • queries
    • alerts
    Metrics
    • minimal store size
    • low performance
    impact
    • performance
    measuring
    • health and state
    observing
    • special structures
    • queries
    • alerts

    View Slide

  21. Telemetry applicability
    Logs
    • simplest
    • no external tools
    required
    • human readable
    • UNIX-style
    • compatible with a
    tons of tools
    • queries
    • alerts
    Metrics
    • minimal store size
    • low performance
    impact
    • performance
    measuring
    • health and state
    observing
    • special structures
    • queries
    • alerts
    Traces
    • minimal store size
    • low performance
    impact
    • per-query metrics
    • low-level
    information
    • precise debugging
    and performance
    tuning

    View Slide

  22. Telemetry applicability
    Logs
    • simplest
    • no external tools
    required
    • human readable
    • UNIX-style
    • compatible with a
    tons of tools
    • queries
    • alerts
    Metrics
    • minimal store size
    • low performance
    impact
    • performance
    measuring
    • health and state
    observing
    • special structures
    • queries
    • alerts
    Traces
    • minimal store size
    • low performance
    impact
    • per-query metrics
    • low-level
    information
    • precise debugging
    and performance
    tuning
    + SIEM systems

    View Slide

  23. Telemetry flow
    creation

    View Slide

  24. Telemetry flow
    creation
    transport
    aggregation
    normalization
    store
    analyze + alerting
    visualize
    archive

    View Slide

  25. Logs

    View Slide

  26. Logs : kinds of data
    • initial information about the application
    • state changes (start/ready/…/stop)
    • health changes
    • audit trail (security-relevant list of activities: financial
    operations, health care data transactions, changing keys,
    changing configuration)
    • user sessions (sign-in attempts, sign-out, actions)
    • not expected actions (wrong URLs, sign-in fails, etc.)
    • various information in string format

    View Slide

  27. Logs : on start
    • new state: starting
    • application name
    • component name
    • commit hash / build number
    • configuration in use
    • deprecation warnings
    • running mode

    View Slide

  28. Logs : on ready
    • new state: ready
    • listen interfaces, ports and sockets
    • health

    View Slide

  29. Logs : on state or health change
    • new state
    • reason
    • URL to documentation

    View Slide

  30. Logs : on state or health change
    • new state
    • reason
    • URL to documentation
    Use traffic-light highlight system for health states:

    ● — completely unhealthy

    ● — partially healthy, reduced functionality

    ● — completely healthy

    View Slide

  31. Logs : on shutdown
    • reason
    • status of preparing to shutdown
    • new state: stopped (final goodbye)

    View Slide

  32. Logs : each line
    • timestamps (ISO8601, TZ, reasonable precission)
    • PID
    • application/component short name
    • application version (JSON, CEF, protobuf)
    • severity (CEF: 0→10, rfc5427: 7→0)
    • event code (HTTP style)
    • human-readable message

    View Slide

  33. Logs : do not export!
    • passwords, tokens, any sensitive data — security risks
    • private data — legal risks
    Use:
    − masking
    − anonymisation / pseudonymisation

    View Slide

  34. Logs : consumers
    • Console
    • Files
    • General purpose collector/store/alert/search
    system.
    • SIEM

    View Slide

  35. Logs : consumers and formats
    console,
    STDERR
    file syslog ELK SIEM
    socket,
    HTTP,
    custom
    plain

    syslog
    (RFC3164)
    ✓ ✓ ✓ ✓ ✓ ✓
    JSON
    ✓ ✓ ✓ ✓ ✓ ✓
    CEF
    ✓ ✓ ✓ ✓ ✓ ✓
    protobuf
    ✓ ✓

    View Slide

  36. Logs : CEF
    • old (2009), but widely used standard
    • simple: easy to generate, easy to parse (supported
    even by devices without powerful CPUs)
    • well documented:

    − field name dictionaries

    − field types
    CEF:Version|Device Vendor|Device Product|Device Version|
    Signature ID|Name|Severity|Extension
    Sep 19 08:26:10 host CEF:0|security|threatmanager|1.0|100|
    worm successfully stopped|10|src=10.0.0.1 dst=2.1.2.2
    spt=1232

    View Slide

  37. CEF naming, data formats
    +
    JSON/protobuf/… transport
    =
    painless logging

    View Slide

  38. Logs : bear in mind [1/3]
    • Logs will be read by humans. Often, when failure
    happens. With limited time to reaction. Be brief and
    eloquent. Give information that may help to solve a
    problem.
    • Logs will be searched. Don’t be a poet, be a technical
    specialist. Use expected words.
    • Logs will be parsed automatically; indeed, they will.
    There are too many different systems that want telemetry
    from your application.
    • Carefully classify the severity of events. Many error
    messages instead of warnings in non-critical situations
    will lead to ignoring information from the logs.

    View Slide

  39. Logs : bear in mind [2/3]
    • Whenever it possible, base on existing standards.
    Grouping event codes according to the HTTP
    error code table is not bad idea.
    • Logs are the first resource to analyze security
    incidents.
    • Logs will be archived and stored for a long period
    of time. It will be almost impossible to cut off
    some pieces of data.
    • Should be configurable: formats, transport
    protocols, paths, severity.

    View Slide

  40. Logs : bear in mind [3/3]
    • Your application may run in many different
    environments with different standards of logging (VM,
    docker). Application should be able to direct all logs
    into one channel. Splitting may be an option.
    • Do not implement log files rotation. Give possibility to
    inform your application when it needs to gracefully
    recreate the log file after being rotated by an external
    service.
    • When big trouble occurs and nothing works, your
    application should be able to print readable logs in the
    simplest manner — to stderr/stdout.

    View Slide

  41. Logs : implementation
    • native Ruby methods
    • semantic_logger

    https://github.com/rocketjob/semantic_logger

    (a lot of destinations: DBs, HTTP, UDP, syslog)
    • ougai

    https://github.com/tilfin/ougai

    (JSON)
    • httplog

    https://github.com/trusche/httplog

    (HTTP logging, JSON support)

    View Slide

  42. Metrics

    View Slide

  43. Metrics : approaches
    • USE method

    Utilization, Saturation, Errors
    • Google SRE book

    Latency, Traffic, Errors, Saturation
    • RED method

    Rate, Errors, Duration

    View Slide

  44. Metrics : utilization
    • Hardware resources: CPU, disk system, network
    intefaces
    • File system: capacity, usage
    • Memory: capacity, cache, heap, queue
    • Resources: file descriptors, threads, sockets, connections
    The average time that the resource was busy
    servicing work.
    Usage of resource.

    View Slide

  45. Metrics : traffic, rate
    • normal operations: 

    − requests

    − queries

    − transactions

    − sending network packets

    − processing flow bytes
    A measure of how much demand is being placed
    on your system. (Google SRE book)
    The number of requests, per second, you services
    are serving. (RED Method)

    View Slide

  46. Metrics : latency, duration
    The time it takes to service a request. (Google SRE
    book)
    • latency of operations: 

    − requests

    − queries

    − transactions

    − sending network packets

    − processing flow bytes

    View Slide

  47. Metrics : errors
    • error events:

    − hardware errors

    − software exceptions

    − invalid requests / input

    − authentication fails

    − invalid URLs
    The count of error events. (USE Method)
    The rate of requests that fail, either explicitly,
    implicitly, or by policy. (Google SRE book)

    View Slide

  48. Metrics : saturation
    • calculated value, measure of current load
    The degree to which the resource has extra work
    which it can't service, often queued. (USE Method)
    How "full" your service is. A measure of your
    system fraction, emphasizing the resources that are
    most constrained. (Google SRE book)

    View Slide

  49. Metrics : saturation
    • can be calculated internally or measured
    externally
    • high utilization is a problem
    • high saturation is a problem
    • low utilization level does not guarantee that
    everything is OK
    • low saturation (in the case of a correct calculation)
    most likely indicates that everything is OK

    View Slide

  50. OpenMetrics : based on Prometheus metric types
    • Gauge

    single numerical value

    − memory used

    − fan speed

    − connections count
    • Counter

    single monotonically increasing counter

    − operations done

    − errors occured

    − requests processed
    • Histogram

    increment counter per buckets

    − requests count per latency buckets

    − CPU load values count per range buckets
    • Summary

    similar to the Histogram, but φ-quantiles are calculated on client-side;
    calculating of other quantiles is not possible
    https://openmetrics.io/
    https://prometheus.io/docs/concepts/metric_types/

    View Slide

  51. OpenMetrics : Average vs Percentile
    Average

    View Slide

  52. OpenMetrics : Average vs Percentile
    Average

    View Slide

  53. OpenMetrics : Average vs Percentile
    Average
    99 percentile

    View Slide

  54. OpenMetrics : Average vs Percentile
    Average
    99 percentile

    View Slide

  55. Metrics : buckets
    <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100

    View Slide

  56. Metrics : buckets
    <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100
    1
    1
    1
    1
    1
    1 1
    1
    1
    1

    View Slide

  57. Metrics : buckets
    <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100
    1
    1
    1
    1
    1
    1 1
    1
    1
    1
    90 percentile
    50 percentile

    View Slide

  58. Metrics : export data
    • current state
    • current health
    • event counters:

    − AAA events

    − not expected actions (wrong URLs, sign-in fails)

    − errors during normal operations
    • performance metrics

    − normal operations

    − queues

    − utilization, saturation

    − query latency
    • application info:

    − version

    − warnings/notifications gauge

    View Slide

  59. Metrics : formats
    • suggest using Prometheus format

    − native for Prometheus

    − OpenMetrics — open source specification

    − simple and clear

    − HTTP-based

    − can be easily converted

    − libraries exist
    • Influx or similar format if you really need to implement
    push model
    • protobuf / gRPC

    − custom

    − high load


    View Slide

  60. Metrics : implementation
    • Prometheus Ruby client

    https://github.com/prometheus/client_ruby
    • native Ruby methods

    View Slide

  61. Metrics : bear in mind [1/2]
    • Split statistic by types. For example, the aggregation
    of successful (relatively long) and failed (relatively
    short) durations may lead to the illusion of
    performance increase when multiple failures occur.
    • Whenever it possible use Saturation to determine
    load of system. Utilization is not complete
    information.
    • Be sure to export the metrics of the component
    closest to the user. This will allow to evaluate the SLI.
    • Implement configurable buckets sizes.

    View Slide

  62. Metrics : bear in mind [2/2]
    • Export appropriate metrics as buckets. It lower
    polling rate and makes possible to get statistics
    in percentiles.
    • Add units to metric names.
    • Whenever it possible, use SI units.
    • Follow the naming standard. Prometheus
    “Metric and label naming” document is a good
    base.

    View Slide

  63. Traces

    View Slide

  64. Traces : definition
    In software engineering, tracing involves a
    specialized use of logging to record information
    about a program's execution.

    There is not always a clear distinction between
    tracing and other forms of logging, except that the
    term tracing is almost never applied to logging that is
    a functional requirement of a program.
    — Wikipedia

    View Slide

  65. Traces : use cases
    • Debugging during development
    • Measuring and tuning performance
    • Analyze failures and security incidents
    https://www.cossacklabs.com/blog/how-
    to-implement-distributed-tracing.html
    • Approaches
    • Library comparison
    • Implementation example
    • Use cases

    View Slide

  66. Traces : principles
    • Low overhead
    • Application-level transparency
    • Scalability

    View Slide

  67. Traces : spans in trace tree
    https://static.googleusercontent.com/media/research.google.com/uk/pubs/archive/36356.pdf

    View Slide

  68. Traces : kinds of data
    • trace id
    • span id
    • parent span id
    • application info (product, component)
    • module name
    • method name
    • context data (session/request id, user id, …)
    • operation name and code
    • start time
    • end time
    Per request/query tracking:

    View Slide

  69. Traces : what it looks like

    View Slide

  70. Traces : consumers
    • General purpose collectors:

    − Jaeger

    − Zipkin
    • Cloud collectors:

    − Google StackDriver

    − AWS X-Ray

    − Azure Application Insights
    • SIEM

    View Slide

  71. Traces : formats
    • Proprietary protocols:

    − Jaeger

    − Zipkin

    − Google StackDriver

    − AWS X-Ray

    − Azure Application Insights
    • JSON:

    − SIEM
    • protobuf/gRPC:

    − custom

    View Slide

  72. Traces : implementation
    • OpenCensus

    https://www.rubydoc.info/gems/opencensus

    (Zipkin, GC Stackdriver, JSON)
    • OpenTracing

    https://opentracing.io/guides/ruby/
    • Jaeger client

    https://github.com/salemove/jaeger-client-ruby

    View Slide

  73. Checklists

    View Slide

  74. Checklist : Logs
    □ Each line:

    □ timestamps (ISO8601, TZ, reasonable precission)

    □ PID

    □ component name

    □ severity

    □ event code

    □ human-readable message
    □ Events to log:

    □ state changes (start/ready/pause/stop)

    □ health changes (new state, reason, doc URL)

    □ user sign-in attempts (including failed with reasons), actions, sign-out

    □ audit trail

    □ errors
    □ On start:

    □ product name, component name

    □ version (+build, +commit hash)

    □ running mode (debug/normal, daemon/)

    □ deprecation warnings

    □ which configuration in use (ENV, file, configuration service)
    □ On ready: communication sockets and ports
    □ On exit: reason
    □ Do not log:

    □ passwords, tokens

    □ personal data

    View Slide

  75. Checklist : Metrics
    □ Data to export:

    □ application (version, warning/notification)

    □ utilization (resources, capacities, usage)

    □ saturation (internally calculated or appropriate metrics)

    □ rate (operations)

    □ errors

    □ latencies
    □ Split metrics by types
    □ Export as buckets when reasonable
    □ Configure size of buckets
    □ Export metrics for SLI
    □ Determine required resolution
    □ Normalize, use SI units, add units to names
    □ Prefer poll model if it possible
    □ Clear counters on restart

    View Slide

  76. Links [1/2]
    • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

    https://static.googleusercontent.com/media/
    research.google.com/uk//pubs/archive/36356.pdf
    • How to Implement Tracing in a Modern Distributed Application

    https://www.cossacklabs.com/blog/how-to-implement-
    distributed-tracing.html
    • OpenTracing

    https://opentracing.io/
    • OpenMetrics

    https://github.com/RichiH/OpenMetrics
    • OpenCensus

    https://opencensus.io

    View Slide

  77. Links [2/2]
    • CEF

    https://kc.mcafee.com/resources/sites/MCAFEE/content/live/
    CORP_KNOWLEDGEBASE/78000/KB78712/en_US/
    CEF_White_Paper_20100722.pdf
    • Metrics : USE method

    http://www.brendangregg.com/usemethod.html
    • Google SRE book

    https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-
    systems/
    • Metrics : RED method

    https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-
    architecture/
    • MS Azure : monitoring and diagnostic

    https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring
    • Prometheus : Metrics and label names

    https://prometheus.io/docs/practices/naming/

    View Slide

  78. Dmytro Shapovalov
    Infrastructure Engineer @ Cossack Labs
    Thank you!
    shadinua
    shad.in.ua
    shad.in.ua

    View Slide