Teach your application eloquence. Logs, metrics, traces.

Teach your application eloquence. Logs, metrics, traces. Dmytro Shapovalov Infrastructure
Engineer @ Cossack Labs

Who we are? • UK-based data security products and services
company  • Building security tools to prevent sensitive data leakage and to comply with data security regulations  • Cryptographic tools, security consulting, training  • We are cryptographers, system engineers, applied engineers, infrastructure engineers  • We support community, speak, teach, open source a lot

What we are going to talk • Why do we
need telemetry? • What are the different kinds of telemetry? • Borders of applicability of various types of telemetry • Approaches and mistakes • Implementation

What is telemetry? «Gathering data on the use of applications
and application components, measurements of start-up time and processing time, hardware, application crashes, and general usage statistics.»

Why do we need telemetry at all? Who are the
consumers?  − developers  − devops/sysadmins  − analysts  − security staff What purposes?  − debug  − monitor state and health  − measure and tune performance  − business analysis  − intrusion detection

It is worthwhile, indeed • speed up developing process •
increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning

It is worthwhile, indeed • speed up developing process •
increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning • COST of development • COST of use

What data do we have to export? … we can
ask any specialist.

What data do we have to export? … we can
ask any specialist. — ALL! … will be their answer.

Classification of information technical:  − state  − health  − errors 
− performance  − debug  − events

− performance  − debug  − events business:  − SLI  − user actions

− performance  − debug  − events business:  − SLI  − user actions developers devops/sysadmins

− performance  − debug  − events business:  − SLI  − user actions developers devops/sysadmins analysts

− performance  − debug  − events business:  − SLI  − user actions developers devops/sysadmins analysts security staff

SIEM — security staff’s main instrument Complex analyze:  − correlation 
− threats  − patterns  − compliance Applications Network devices Servers Environment

Telemetry evolution Logs • each application has an individual log
file • syslog:  − message standard (RFC 3164, 2001)  − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf

file • syslog:  − message standard (RFC 3164, 2001)  − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf Metrics • reports into logs • agents, collectors, stores with proprietary protocols • SNMP • HTTP, protobuf • custom implementations

file • syslog:  − message standard (RFC 3164, 2001)  − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf Metrics • reports into logs • agents, collectors, stores with proprietary protocols • SNMP • HTTP, protobuf • custom implementations Traces • reports into logs • agents, collectors, stores with proprietary protocols • custom implementations

Telemetry applicability Logs • simplest • no external tools required
• human readable • UNIX-style • compatible with a tons of tools • queries • alerts

• human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts

• human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts Traces • minimal store size • low performance impact • per-query metrics • low-level information • precise debugging and performance tuning

• human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts Traces • minimal store size • low performance impact • per-query metrics • low-level information • precise debugging and performance tuning + SIEM systems

Telemetry flow creation

Telemetry flow creation transport aggregation normalization store analyze + alerting
visualize archive

Logs : kinds of data • initial information about the
application • state changes (start/ready/…/stop) • health changes • audit trail (security-relevant list of activities: financial operations, health care data transactions, changing keys, changing configuration) • user sessions (sign-in attempts, sign-out, actions) • not expected actions (wrong URLs, sign-in fails, etc.) • various information in string format

Logs : on start • new state: starting • application
name • component name • commit hash / build number • configuration in use • deprecation warnings • running mode

Logs : on ready • new state: ready • listen
interfaces, ports and sockets • health

Logs : on state or health change • new state
• reason • URL to documentation

Logs : on state or health change • new state
• reason • URL to documentation Use traffic-light highlight system for health states:  • — completely unhealthy  • — partially healthy, reduced functionality  • — completely healthy

Logs : on shutdown • reason • status of preparing
to shutdown • new state: stopped (final goodbye)

Logs : each line • timestamps (ISO8601, TZ, reasonable precission)
• PID • application/component short name • application version (JSON, CEF, protobuf) • severity (CEF: 0→10, rfc5427: 7→0) • event code (HTTP style) • human-readable message

Logs : do not export! • passwords, tokens, any sensitive
data — security risks • private data — legal risks Use: − masking − anonymisation / pseudonymisation

Logs : consumers • Console • Files • General purpose
collector/store/alert/search system. • SIEM

Logs : consumers and formats console, STDERR ﬁle syslog ELK
SIEM socket, HTTP, custom plain ✓ syslog (RFC3164) ✓ ✓ ✓ ✓ ✓ ✓ JSON ✓ ✓ ✓ ✓ ✓ ✓ CEF ✓ ✓ ✓ ✓ ✓ ✓ protobuf ✓ ✓

CEF naming, data formats + JSON/protobuf/… transport = painless logging

Logs : bear in mind [1/3] • Logs will be
read by humans. Often, when failure happens. With limited time to reaction. Be brief and eloquent. Give information that may help to solve a problem. • Logs will be searched. Don’t be a poet, be a technical specialist. Use expected words. • Logs will be parsed automatically; indeed, they will. There are too many different systems that want telemetry from your application. • Carefully classify the severity of events. Many error messages instead of warnings in non-critical situations will lead to ignoring information from the logs.

Logs : bear in mind [2/3] • Whenever it possible,
base on existing standards. Grouping event codes according to the HTTP error code table is not bad idea. • Logs are the first resource to analyze security incidents. • Logs will be archived and stored for a long period of time. It will be almost impossible to cut off some pieces of data. • Should be configurable: formats, transport protocols, paths, severity.

Logs : bear in mind [3/3] • Your application may
run in many different environments with different standards of logging (VM, docker). Application should be able to direct all logs into one channel. Splitting may be an option. • Do not implement log files rotation. Give possibility to inform your application when it needs to gracefully recreate the log file after being rotated by an external service. • When big trouble occurs and nothing works, your application should be able to print readable logs in the simplest manner — to stderr/stdout.

Logs : implementation • native Ruby methods • semantic_logger  https://github.com/rocketjob/semantic_logger 
(a lot of destinations: DBs, HTTP, UDP, syslog) • ougai  https://github.com/tilfin/ougai  (JSON) • httplog  https://github.com/trusche/httplog  (HTTP logging, JSON support)

Metrics

Metrics : approaches • USE method  Utilization, Saturation, Errors •
Google SRE book  Latency, Traffic, Errors, Saturation • RED method  Rate, Errors, Duration

Metrics : utilization • Hardware resources: CPU, disk system, network
intefaces • File system: capacity, usage • Memory: capacity, cache, heap, queue • Resources: file descriptors, threads, sockets, connections The average time that the resource was busy servicing work. Usage of resource.

Metrics : traffic, rate • normal operations:   − requests 
− queries  − transactions  − sending network packets  − processing flow bytes A measure of how much demand is being placed on your system. (Google SRE book) The number of requests, per second, you services are serving. (RED Method)

Metrics : latency, duration The time it takes to service
a request. (Google SRE book) • latency of operations:   − requests  − queries  − transactions  − sending network packets  − processing flow bytes

Metrics : errors • error events:  − hardware errors  −
software exceptions  − invalid requests / input  − authentication fails  − invalid URLs The count of error events. (USE Method) The rate of requests that fail, either explicitly, implicitly, or by policy. (Google SRE book)

Metrics : saturation • calculated value, measure of current load
The degree to which the resource has extra work which it can't service, often queued. (USE Method) How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained. (Google SRE book)

Metrics : saturation • can be calculated internally or measured
externally • high utilization is a problem • high saturation is a problem • low utilization level does not guarantee that everything is OK • low saturation (in the case of a correct calculation) most likely indicates that everything is OK

OpenMetrics : based on Prometheus metric types • Gauge  single
numerical value  − memory used  − fan speed  − connections count • Counter  single monotonically increasing counter  − operations done  − errors occured  − requests processed • Histogram  increment counter per buckets  − requests count per latency buckets  − CPU load values count per range buckets • Summary  similar to the Histogram, but φ-quantiles are calculated on client-side; calculating of other quantiles is not possible https://openmetrics.io/ https://prometheus.io/docs/concepts/metric_types/

OpenMetrics : Average vs Percentile Average

OpenMetrics : Average vs Percentile Average 99 percentile

Metrics : buckets <10 < 20 < 30 < 40
< 50 < 60 < 70 < 80 < 90 < 100

Metrics : buckets <10 < 20 < 30 < 40
< 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1

Metrics : buckets <10 < 20 < 30 < 40
< 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1 90 percentile 50 percentile

Metrics : export data • current state • current health
• event counters:  − AAA events  − not expected actions (wrong URLs, sign-in fails)  − errors during normal operations • performance metrics  − normal operations  − queues  − utilization, saturation  − query latency • application info:  − version  − warnings/notifications gauge

Metrics : formats • suggest using Prometheus format  − native
for Prometheus  − OpenMetrics — open source specification  − simple and clear  − HTTP-based  − can be easily converted  − libraries exist • Influx or similar format if you really need to implement push model • protobuf / gRPC  − custom  − high load 

Metrics : implementation • Prometheus Ruby client  https://github.com/prometheus/client_ruby • native
Ruby methods

Metrics : bear in mind [1/2] • Split statistic by
types. For example, the aggregation of successful (relatively long) and failed (relatively short) durations may lead to the illusion of performance increase when multiple failures occur. • Whenever it possible use Saturation to determine load of system. Utilization is not complete information. • Be sure to export the metrics of the component closest to the user. This will allow to evaluate the SLI. • Implement configurable buckets sizes.

Metrics : bear in mind [2/2] • Export appropriate metrics
as buckets. It lower polling rate and makes possible to get statistics in percentiles. • Add units to metric names. • Whenever it possible, use SI units. • Follow the naming standard. Prometheus “Metric and label naming” document is a good base.

Traces

Traces : definition In software engineering, tracing involves a specialized
use of logging to record information about a program's execution. … There is not always a clear distinction between tracing and other forms of logging, except that the term tracing is almost never applied to logging that is a functional requirement of a program. — Wikipedia

Traces : use cases • Debugging during development • Measuring
and tuning performance • Analyze failures and security incidents https://www.cossacklabs.com/blog/how- to-implement-distributed-tracing.html • Approaches • Library comparison • Implementation example • Use cases

Traces : principles • Low overhead • Application-level transparency •
Scalability

Traces : spans in trace tree https://static.googleusercontent.com/media/research.google.com/uk/pubs/archive/36356.pdf

Traces : kinds of data • trace id • span
id • parent span id • application info (product, component) • module name • method name • context data (session/request id, user id, …) • operation name and code • start time • end time Per request/query tracking:

Traces : what it looks like

Traces : consumers • General purpose collectors:  − Jaeger  −
Zipkin • Cloud collectors:  − Google StackDriver  − AWS X-Ray  − Azure Application Insights • SIEM

Traces : formats • Proprietary protocols:  − Jaeger  − Zipkin 
− Google StackDriver  − AWS X-Ray  − Azure Application Insights • JSON:  − SIEM • protobuf/gRPC:  − custom

Traces : implementation • OpenCensus  https://www.rubydoc.info/gems/opencensus  (Zipkin, GC Stackdriver, JSON)
• OpenTracing  https://opentracing.io/guides/ruby/ • Jaeger client  https://github.com/salemove/jaeger-client-ruby

Checklists

Checklist : Logs □ Each line:  □ timestamps (ISO8601, TZ,
reasonable precission)  □ PID  □ component name  □ severity  □ event code  □ human-readable message □ Events to log:  □ state changes (start/ready/pause/stop)  □ health changes (new state, reason, doc URL)  □ user sign-in attempts (including failed with reasons), actions, sign-out  □ audit trail  □ errors □ On start:  □ product name, component name  □ version (+build, +commit hash)  □ running mode (debug/normal, daemon/)  □ deprecation warnings  □ which configuration in use (ENV, file, configuration service) □ On ready: communication sockets and ports □ On exit: reason □ Do not log:  □ passwords, tokens  □ personal data

Checklist : Metrics □ Data to export:  □ application (version,
warning/notification)  □ utilization (resources, capacities, usage)  □ saturation (internally calculated or appropriate metrics)  □ rate (operations)  □ errors  □ latencies □ Split metrics by types □ Export as buckets when reasonable □ Configure size of buckets □ Export metrics for SLI □ Determine required resolution □ Normalize, use SI units, add units to names □ Prefer poll model if it possible □ Clear counters on restart

Links [1/2] • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure 
https://static.googleusercontent.com/media/ research.google.com/uk//pubs/archive/36356.pdf • How to Implement Tracing in a Modern Distributed Application  https://www.cossacklabs.com/blog/how-to-implement- distributed-tracing.html • OpenTracing  https://opentracing.io/ • OpenMetrics  https://github.com/RichiH/OpenMetrics • OpenCensus  https://opencensus.io

Links [2/2] • CEF  https://kc.mcafee.com/resources/sites/MCAFEE/content/live/ CORP_KNOWLEDGEBASE/78000/KB78712/en_US/ CEF_White_Paper_20100722.pdf • Metrics :
USE method  http://www.brendangregg.com/usemethod.html • Google SRE book  https://landing.google.com/sre/sre-book/chapters/monitoring-distributed- systems/ • Metrics : RED method  https://www.weave.works/blog/the-red-method-key-metrics-for-microservices- architecture/ • MS Azure : monitoring and diagnostic  https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring • Prometheus : Metrics and label names  https://prometheus.io/docs/practices/naming/

Dmytro Shapovalov Infrastructure Engineer @ Cossack Labs Thank you! shadinua
shad.in.ua shad.in.ua

Teach your application eloquence. Logs, metrics...

Teach your application eloquence. Logs, metrics, traces.

More Decks by ShaD

Other Decks in Programming

Featured

Transcript