Teach your application eloquence. Logs, metrics, traces.

Slide 1

Slide 1 text

Teach your application eloquence. Logs, metrics, traces. Dmytro Shapovalov Infrastructure Engineer @ Cossack Labs

Slide 2

Slide 2 text

Who we are? • UK-based data security products and services company  • Building security tools to prevent sensitive data leakage and to comply with data security regulations  • Cryptographic tools, security consulting, training  • We are cryptographers, system engineers, applied engineers, infrastructure engineers  • We support community, speak, teach, open source a lot

Slide 3

Slide 3 text

What we are going to talk • Why do we need telemetry? • What are the different kinds of telemetry? • Borders of applicability of various types of telemetry • Approaches and mistakes • Implementation

Slide 4

Slide 4 text

What is telemetry? «Gathering data on the use of applications and application components, measurements of start-up time and processing time, hardware, application crashes, and general usage statistics.»

Slide 5

Slide 5 text

Why do we need telemetry at all? Who are the consumers?  − developers  − devops/sysadmins  − analysts  − security staff What purposes?  − debug  − monitor state and health  − measure and tune performance  − business analysis  − intrusion detection

Slide 6

Slide 6 text

It is worthwhile, indeed • speed up developing process • increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning

Slide 7

Slide 7 text

It is worthwhile, indeed • speed up developing process • increase overall stability • reduce the reaction time on crashes and intrusions • adequate business planning • COST of development • COST of use

Slide 8

Slide 8 text

What data do we have to export? … we can ask any specialist.

Slide 9

Slide 9 text

What data do we have to export? … we can ask any specialist. — ALL! … will be their answer.

Slide 10

Slide 10 text

Classification of information technical:  − state  − health  − errors  − performance  − debug  − events

Slide 11

Slide 11 text

Classification of information technical:  − state  − health  − errors  − performance  − debug  − events business:  − SLI  − user actions

Slide 12

Slide 12 text

Classification of information technical:  − state  − health  − errors  − performance  − debug  − events business:  − SLI  − user actions developers devops/sysadmins

Slide 13

Slide 13 text

Classification of information technical:  − state  − health  − errors  − performance  − debug  − events business:  − SLI  − user actions developers devops/sysadmins analysts

Slide 14

Slide 14 text

Classification of information technical:  − state  − health  − errors  − performance  − debug  − events business:  − SLI  − user actions developers devops/sysadmins analysts security staff

Slide 15

Slide 15 text

SIEM — security staff’s main instrument Complex analyze:  − correlation  − threats  − patterns  − compliance Applications Network devices Servers Environment

Slide 16

Slide 16 text

Telemetry evolution Logs • each application has an individual log file • syslog:  − message standard (RFC 3164, 2001)  − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf

Slide 17

Slide 17 text

Telemetry evolution Logs • each application has an individual log file • syslog:  − message standard (RFC 3164, 2001)  − aggregation • ELK (agents, collectors) • HTTP, JSON, protobuf Metrics • reports into logs • agents, collectors, stores with proprietary protocols • SNMP • HTTP, protobuf • custom implementations

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Telemetry applicability Logs • simplest • no external tools required • human readable • UNIX-style • compatible with a tons of tools • queries • alerts

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Telemetry applicability Logs • simplest • no external tools required • human readable • UNIX-style • compatible with a tons of tools • queries • alerts Metrics • minimal store size • low performance impact • performance measuring • health and state observing • special structures • queries • alerts Traces • minimal store size • low performance impact • per-query metrics • low-level information • precise debugging and performance tuning

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Telemetry flow creation

Slide 24

Slide 24 text

Telemetry flow creation transport aggregation normalization store analyze + alerting visualize archive

Slide 25

Slide 25 text

Logs

Slide 26

Slide 26 text

Logs : kinds of data • initial information about the application • state changes (start/ready/…/stop) • health changes • audit trail (security-relevant list of activities: financial operations, health care data transactions, changing keys, changing configuration) • user sessions (sign-in attempts, sign-out, actions) • not expected actions (wrong URLs, sign-in fails, etc.) • various information in string format

Slide 27

Slide 27 text

Logs : on start • new state: starting • application name • component name • commit hash / build number • configuration in use • deprecation warnings • running mode

Slide 28

Slide 28 text

Logs : on ready • new state: ready • listen interfaces, ports and sockets • health

Slide 29

Slide 29 text

Logs : on state or health change • new state • reason • URL to documentation

Slide 30

Slide 30 text

Logs : on state or health change • new state • reason • URL to documentation Use traffic-light highlight system for health states:  ● — completely unhealthy  ● — partially healthy, reduced functionality  ● — completely healthy

Slide 31

Slide 31 text

Logs : on shutdown • reason • status of preparing to shutdown • new state: stopped (final goodbye)

Slide 32

Slide 32 text

Logs : each line • timestamps (ISO8601, TZ, reasonable precission) • PID • application/component short name • application version (JSON, CEF, protobuf) • severity (CEF: 0→10, rfc5427: 7→0) • event code (HTTP style) • human-readable message

Slide 33

Slide 33 text

Logs : do not export! • passwords, tokens, any sensitive data — security risks • private data — legal risks Use: − masking − anonymisation / pseudonymisation

Slide 34

Slide 34 text

Logs : consumers • Console • Files • General purpose collector/store/alert/search system. • SIEM

Slide 35

Slide 35 text

Logs : consumers and formats console, STDERR ﬁle syslog ELK SIEM socket, HTTP, custom plain ✓ syslog (RFC3164) ✓ ✓ ✓ ✓ ✓ ✓ JSON ✓ ✓ ✓ ✓ ✓ ✓ CEF ✓ ✓ ✓ ✓ ✓ ✓ protobuf ✓ ✓

Slide 36

Slide 36 text

Slide 37

Slide 37 text

CEF naming, data formats + JSON/protobuf/… transport = painless logging

Slide 38

Slide 38 text

Logs : bear in mind [1/3] • Logs will be read by humans. Often, when failure happens. With limited time to reaction. Be brief and eloquent. Give information that may help to solve a problem. • Logs will be searched. Don’t be a poet, be a technical specialist. Use expected words. • Logs will be parsed automatically; indeed, they will. There are too many different systems that want telemetry from your application. • Carefully classify the severity of events. Many error messages instead of warnings in non-critical situations will lead to ignoring information from the logs.

Slide 39

Slide 39 text

Logs : bear in mind [2/3] • Whenever it possible, base on existing standards. Grouping event codes according to the HTTP error code table is not bad idea. • Logs are the first resource to analyze security incidents. • Logs will be archived and stored for a long period of time. It will be almost impossible to cut off some pieces of data. • Should be configurable: formats, transport protocols, paths, severity.

Slide 40

Slide 40 text

Logs : bear in mind [3/3] • Your application may run in many different environments with different standards of logging (VM, docker). Application should be able to direct all logs into one channel. Splitting may be an option. • Do not implement log files rotation. Give possibility to inform your application when it needs to gracefully recreate the log file after being rotated by an external service. • When big trouble occurs and nothing works, your application should be able to print readable logs in the simplest manner — to stderr/stdout.

Slide 41

Slide 41 text

Logs : implementation • native Ruby methods • semantic_logger  https://github.com/rocketjob/semantic_logger  (a lot of destinations: DBs, HTTP, UDP, syslog) • ougai  https://github.com/tilfin/ougai  (JSON) • httplog  https://github.com/trusche/httplog  (HTTP logging, JSON support)

Slide 42

Slide 42 text

Metrics

Slide 43

Slide 43 text

Metrics : approaches • USE method  Utilization, Saturation, Errors • Google SRE book  Latency, Traffic, Errors, Saturation • RED method  Rate, Errors, Duration

Slide 44

Slide 44 text

Metrics : utilization • Hardware resources: CPU, disk system, network intefaces • File system: capacity, usage • Memory: capacity, cache, heap, queue • Resources: file descriptors, threads, sockets, connections The average time that the resource was busy servicing work. Usage of resource.

Slide 45

Slide 45 text

Metrics : traffic, rate • normal operations:   − requests  − queries  − transactions  − sending network packets  − processing flow bytes A measure of how much demand is being placed on your system. (Google SRE book) The number of requests, per second, you services are serving. (RED Method)

Slide 46

Slide 46 text

Metrics : latency, duration The time it takes to service a request. (Google SRE book) • latency of operations:   − requests  − queries  − transactions  − sending network packets  − processing flow bytes

Slide 47

Slide 47 text

Metrics : errors • error events:  − hardware errors  − software exceptions  − invalid requests / input  − authentication fails  − invalid URLs The count of error events. (USE Method) The rate of requests that fail, either explicitly, implicitly, or by policy. (Google SRE book)

Slide 48

Slide 48 text

Metrics : saturation • calculated value, measure of current load The degree to which the resource has extra work which it can't service, often queued. (USE Method) How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained. (Google SRE book)

Slide 49

Slide 49 text

Metrics : saturation • can be calculated internally or measured externally • high utilization is a problem • high saturation is a problem • low utilization level does not guarantee that everything is OK • low saturation (in the case of a correct calculation) most likely indicates that everything is OK

Slide 50

Slide 50 text

OpenMetrics : based on Prometheus metric types • Gauge  single numerical value  − memory used  − fan speed  − connections count • Counter  single monotonically increasing counter  − operations done  − errors occured  − requests processed • Histogram  increment counter per buckets  − requests count per latency buckets  − CPU load values count per range buckets • Summary  similar to the Histogram, but φ-quantiles are calculated on client-side; calculating of other quantiles is not possible https://openmetrics.io/ https://prometheus.io/docs/concepts/metric_types/

Slide 51

Slide 51 text

OpenMetrics : Average vs Percentile Average

Slide 52

Slide 52 text

OpenMetrics : Average vs Percentile Average

Slide 53

Slide 53 text

OpenMetrics : Average vs Percentile Average 99 percentile

Slide 54

Slide 54 text

OpenMetrics : Average vs Percentile Average 99 percentile

Slide 55

Slide 55 text

Metrics : buckets <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100

Slide 56

Slide 56 text

Metrics : buckets <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1

Slide 57

Slide 57 text

Metrics : buckets <10 < 20 < 30 < 40 < 50 < 60 < 70 < 80 < 90 < 100 1 1 1 1 1 1 1 1 1 1 90 percentile 50 percentile

Slide 58

Slide 58 text

Metrics : export data • current state • current health • event counters:  − AAA events  − not expected actions (wrong URLs, sign-in fails)  − errors during normal operations • performance metrics  − normal operations  − queues  − utilization, saturation  − query latency • application info:  − version  − warnings/notifications gauge

Slide 59

Slide 59 text

Metrics : formats • suggest using Prometheus format  − native for Prometheus  − OpenMetrics — open source specification  − simple and clear  − HTTP-based  − can be easily converted  − libraries exist • Influx or similar format if you really need to implement push model • protobuf / gRPC  − custom  − high load 

Slide 60

Slide 60 text

Metrics : implementation • Prometheus Ruby client  https://github.com/prometheus/client_ruby • native Ruby methods

Slide 61

Slide 61 text

Metrics : bear in mind [1/2] • Split statistic by types. For example, the aggregation of successful (relatively long) and failed (relatively short) durations may lead to the illusion of performance increase when multiple failures occur. • Whenever it possible use Saturation to determine load of system. Utilization is not complete information. • Be sure to export the metrics of the component closest to the user. This will allow to evaluate the SLI. • Implement configurable buckets sizes.

Slide 62

Slide 62 text

Metrics : bear in mind [2/2] • Export appropriate metrics as buckets. It lower polling rate and makes possible to get statistics in percentiles. • Add units to metric names. • Whenever it possible, use SI units. • Follow the naming standard. Prometheus “Metric and label naming” document is a good base.

Slide 63

Slide 63 text

Traces

Slide 64

Slide 64 text

Traces : definition In software engineering, tracing involves a specialized use of logging to record information about a program's execution. … There is not always a clear distinction between tracing and other forms of logging, except that the term tracing is almost never applied to logging that is a functional requirement of a program. — Wikipedia

Slide 65

Slide 65 text

Traces : use cases • Debugging during development • Measuring and tuning performance • Analyze failures and security incidents https://www.cossacklabs.com/blog/how- to-implement-distributed-tracing.html • Approaches • Library comparison • Implementation example • Use cases

Slide 66

Slide 66 text

Traces : principles • Low overhead • Application-level transparency • Scalability

Slide 67

Slide 67 text

Traces : spans in trace tree https://static.googleusercontent.com/media/research.google.com/uk/pubs/archive/36356.pdf

Slide 68

Slide 68 text

Traces : kinds of data • trace id • span id • parent span id • application info (product, component) • module name • method name • context data (session/request id, user id, …) • operation name and code • start time • end time Per request/query tracking:

Slide 69

Slide 69 text

Traces : what it looks like

Slide 70

Slide 70 text

Traces : consumers • General purpose collectors:  − Jaeger  − Zipkin • Cloud collectors:  − Google StackDriver  − AWS X-Ray  − Azure Application Insights • SIEM

Slide 71

Slide 71 text

Traces : formats • Proprietary protocols:  − Jaeger  − Zipkin  − Google StackDriver  − AWS X-Ray  − Azure Application Insights • JSON:  − SIEM • protobuf/gRPC:  − custom

Slide 72

Slide 72 text

Traces : implementation • OpenCensus  https://www.rubydoc.info/gems/opencensus  (Zipkin, GC Stackdriver, JSON) • OpenTracing  https://opentracing.io/guides/ruby/ • Jaeger client  https://github.com/salemove/jaeger-client-ruby

Slide 73

Slide 73 text

Checklists

Slide 74

Slide 74 text

Checklist : Logs □ Each line:  □ timestamps (ISO8601, TZ, reasonable precission)  □ PID  □ component name  □ severity  □ event code  □ human-readable message □ Events to log:  □ state changes (start/ready/pause/stop)  □ health changes (new state, reason, doc URL)  □ user sign-in attempts (including failed with reasons), actions, sign-out  □ audit trail  □ errors □ On start:  □ product name, component name  □ version (+build, +commit hash)  □ running mode (debug/normal, daemon/)  □ deprecation warnings  □ which configuration in use (ENV, file, configuration service) □ On ready: communication sockets and ports □ On exit: reason □ Do not log:  □ passwords, tokens  □ personal data

Slide 75

Slide 75 text

Checklist : Metrics □ Data to export:  □ application (version, warning/notification)  □ utilization (resources, capacities, usage)  □ saturation (internally calculated or appropriate metrics)  □ rate (operations)  □ errors  □ latencies □ Split metrics by types □ Export as buckets when reasonable □ Configure size of buckets □ Export metrics for SLI □ Determine required resolution □ Normalize, use SI units, add units to names □ Prefer poll model if it possible □ Clear counters on restart

Slide 76

Slide 76 text

Links [1/2] • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure  https://static.googleusercontent.com/media/ research.google.com/uk//pubs/archive/36356.pdf • How to Implement Tracing in a Modern Distributed Application  https://www.cossacklabs.com/blog/how-to-implement- distributed-tracing.html • OpenTracing  https://opentracing.io/ • OpenMetrics  https://github.com/RichiH/OpenMetrics • OpenCensus  https://opencensus.io

Slide 77

Slide 77 text

Links [2/2] • CEF  https://kc.mcafee.com/resources/sites/MCAFEE/content/live/ CORP_KNOWLEDGEBASE/78000/KB78712/en_US/ CEF_White_Paper_20100722.pdf • Metrics : USE method  http://www.brendangregg.com/usemethod.html • Google SRE book  https://landing.google.com/sre/sre-book/chapters/monitoring-distributed- systems/ • Metrics : RED method  https://www.weave.works/blog/the-red-method-key-metrics-for-microservices- architecture/ • MS Azure : monitoring and diagnostic  https://docs.microsoft.com/en-us/azure/architecture/best-practices/monitoring • Prometheus : Metrics and label names  https://prometheus.io/docs/practices/naming/

Slide 78

Slide 78 text

Dmytro Shapovalov Infrastructure Engineer @ Cossack Labs Thank you! shadinua shad.in.ua shad.in.ua