Observability: more data, more questions

Observability: more data, more questions Albert W. Alberts Amsterdam, June
19th 2019

Albert Alberts architect KPN since January 1999: - Previous functions:
developer, designer, software architect - Worked on several KPN patents - Currently architect for VPC, SDN and the KPN API Store Meetup Organizer: - devNetNoord, a developer community (589 members) - domoticaGrunn, a home automation community (240 members) Spare time: (Open water) swimming, water polo, cycling [email protected]

developer.kpn.com

Observability “Observability”, is a superset of “monitoring”, providing certain benefits
and insights that “monitoring” tools cannot deliver.

Observability Logging Tracing Metrics

Observability Metrics Logging Tracing Alerting

Data Information Information on your platform

SLA Service Level Agreement SLO Service Level Objective SLI Service
Level Indicator Service Levels the base of Site Reliability Engineering (SRE) Contract between customer and supplier Availability and reliability goals Probes measured over time

is about known-unknowns and actionable alerts Monitoring is for symptom
based Alerting: Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. Monitoring

is about unknown-unknowns Empowering you to ask new questions and
answer questions yet to be formulated … Observability

Metrics Metrics are raw measurements of resource usage or behavior
that can be observed and collected over time.

Monitoring Monitoring is the process of collecting, aggregating, and analyzing
metrics to improve awareness of your components' characteristics and behavior.

Logging A log is an immutable, timestamped record of discrete
events that happened over time.

Tracing A trace is a representation of a series of
causally related distributed events that encode the end-to-end request flow through a distributed system.

Alerting Alerting is the responsive component of a monitoring system
that performs actions based on changes in metric values.

Monitoring

Paessler PRTG stream (metrics/monitoring) virtual machine bandwidth system stats connectivity
certificates aggregation alerting notification action dashboard

Paessler dashboard virtual machine bandwidth system stats connectivity certificates aggregation
alerting notification action dashboard

Logging

Logging stream (logging) rsyslog log files virtual machines central log
aggregation local log aggregation

Tracing failed attempts

Tracing with JavaScript Object Notation {JSON} Elastic process info virtual
machine (nested) JSON fluentd Kibana

Tracing with JavaScript Object Notation {JSON} Elastic process info virtual
machine (nested) JSON fluentd Kibana Grafana InfluxDB

Tracing with JavaScript Object Notation {JSON} process info virtual machine
(nested) JSON Nesting JSON and dynamic structure are hard to ”read” by the backend systems Keep it simple, understand TSDB and close the dev-ops gap

Tracing

Tracing data flow rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process
info log files system stats virtual machine Dashboard

What did work and what didn’t? rsyslog Telegraf InfluxDB Elastic
Filebeat Grafana process info log files system stats virtual machine Dashboard

Notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info log
files system stats virtual machine Dashboard Notification

Slack notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info
log files system stats virtual machine Dashboard Notification

Example NGINX data flow log files process info rsyslog Telegraf
InfluxDB Elastic Filebeat Grafana system stats Dashboard Notification

NGINX in production

Getting Apigee Management info management API Apigee Management Grafana InfluxDB
Python scripts

Wrap up

Monitoring: Hardware metrics alone is not enough. Logging: Logging required
for security and auditing but also very useful for analytics and the team dashboard. Tracing: Tracing is hard. Keep it simple, understand TSDB and close the dev-ops gap Alerting: Slack is for none-critical alerting Critical alerting via SMS, Pager or Phone What did the 4 pillars teach us?

• Get a platform dashboard on-screen. • Beware of the
anti-pattern of monitoring everything. • Understand the social and financial implications. • From unknown-unknowns to known-unknowns and more questions Observability lessons

Data Information Knowledge From unknown-unknowns to known-unknowns and more questions
???

Thank you for your attention

KPN: KPN API Store Medium: “Monitoring and Observability” O’Reilly: “Distributed
Systems Observability” Twitter: Observability at Twitter Part 1 & Part 2 Google: “SRE fundamentals: SLIs, SLAs and SLOs” Google: “SRE vs. DevOps: competing standards or close friends?“ Google: “Site Reliability Engineering” & “The Site Reliability Workbook” (free online books) Forrester: “How To Apply Google's Site Reliability Engineering Approach To Your Infrastructure” Digital Ocean: “An Introduction to Metrics, Monitoring, and Alerting” Honeycomb: “Observability: A Manifesto” InfoQ: “Three Pillars with Zero Answers: Rethinking Observability” Scalyr: “Observability?! – Where do we go from here?” Netflix Tech Blog: “Lessons from Building Observability Tools at Netflix“ OpenAPM: “OpenAPM Landscape” Paessler: “paessler.com” References:

Observability: more data, more questions

Observability: more data, more questions

More Decks by Albert W. Alberts

Other Decks in Technology

Featured

Transcript