Observability: more data, more questions

Slide 1

Slide 1 text

Observability: more data, more questions Albert W. Alberts Amsterdam, June 19th 2019

Slide 2

Slide 2 text

Albert Alberts architect KPN since January 1999: - Previous functions: developer, designer, software architect - Worked on several KPN patents - Currently architect for VPC, SDN and the KPN API Store Meetup Organizer: - devNetNoord, a developer community (589 members) - domoticaGrunn, a home automation community (240 members) Spare time: (Open water) swimming, water polo, cycling [email protected]

Slide 3

Slide 3 text

developer.kpn.com

Slide 4

Slide 4 text

Observability “Observability”, is a superset of “monitoring”, providing certain benefits and insights that “monitoring” tools cannot deliver.

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Observability Logging Tracing Metrics

Slide 7

Slide 7 text

Observability Metrics Logging Tracing Alerting

Slide 8

Slide 8 text

Data Information Information on your platform

Slide 9

Slide 9 text

SLA Service Level Agreement SLO Service Level Objective SLI Service Level Indicator Service Levels the base of Site Reliability Engineering (SRE) Contract between customer and supplier Availability and reliability goals Probes measured over time

Slide 10

Slide 10 text

is about known-unknowns and actionable alerts Monitoring is for symptom based Alerting: Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. Monitoring

Slide 11

Slide 11 text

is about unknown-unknowns Empowering you to ask new questions and answer questions yet to be formulated … Observability

Slide 12

Slide 12 text

Metrics Metrics are raw measurements of resource usage or behavior that can be observed and collected over time.

Slide 13

Slide 13 text

Monitoring Monitoring is the process of collecting, aggregating, and analyzing metrics to improve awareness of your components' characteristics and behavior.

Slide 14

Slide 14 text

Logging A log is an immutable, timestamped record of discrete events that happened over time.

Slide 15

Slide 15 text

Tracing A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system.

Slide 16

Slide 16 text

Alerting Alerting is the responsive component of a monitoring system that performs actions based on changes in metric values.

Slide 17

Slide 17 text

Monitoring

Slide 18

Slide 18 text

Paessler PRTG stream (metrics/monitoring) virtual machine bandwidth system stats connectivity certificates aggregation alerting notification action dashboard

Slide 19

Slide 19 text

Paessler dashboard virtual machine bandwidth system stats connectivity certificates aggregation alerting notification action dashboard

Slide 20

Slide 20 text

Logging

Slide 21

Slide 21 text

Logging stream (logging) rsyslog log files virtual machines central log aggregation local log aggregation

Slide 22

Slide 22 text

Tracing failed attempts

Slide 23

Slide 23 text

Tracing with JavaScript Object Notation {JSON} Elastic process info virtual machine (nested) JSON fluentd Kibana

Slide 24

Slide 24 text

Tracing with JavaScript Object Notation {JSON} Elastic process info virtual machine (nested) JSON fluentd Kibana Grafana InfluxDB

Slide 25

Slide 25 text

Tracing with JavaScript Object Notation {JSON} process info virtual machine (nested) JSON Nesting JSON and dynamic structure are hard to ”read” by the backend systems Keep it simple, understand TSDB and close the dev-ops gap

Slide 26

Slide 26 text

Tracing

Slide 27

Slide 27 text

Tracing data flow rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info log files system stats virtual machine Dashboard

Slide 28

Slide 28 text

What did work and what didn’t? rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info log files system stats virtual machine Dashboard

Slide 29

Slide 29 text

Notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info log files system stats virtual machine Dashboard Notification

Slide 30

Slide 30 text

Slack notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info log files system stats virtual machine Dashboard Notification

Slide 31

Slide 31 text

Example NGINX data flow log files process info rsyslog Telegraf InfluxDB Elastic Filebeat Grafana system stats Dashboard Notification

Slide 32

Slide 32 text

NGINX in production

Slide 33

Slide 33 text

Getting Apigee Management info management API Apigee Management Grafana InfluxDB Python scripts

Slide 34

Slide 34 text

Wrap up

Slide 35

Slide 35 text

Monitoring: Hardware metrics alone is not enough. Logging: Logging required for security and auditing but also very useful for analytics and the team dashboard. Tracing: Tracing is hard. Keep it simple, understand TSDB and close the dev-ops gap Alerting: Slack is for none-critical alerting Critical alerting via SMS, Pager or Phone What did the 4 pillars teach us?

Slide 36

Slide 36 text

• Get a platform dashboard on-screen. • Beware of the anti-pattern of monitoring everything. • Understand the social and financial implications. • From unknown-unknowns to known-unknowns and more questions Observability lessons

Slide 37

Slide 37 text

Data Information Knowledge From unknown-unknowns to known-unknowns and more questions ???

Slide 38

Slide 38 text

Thank you for your attention

Slide 39

Slide 39 text

KPN: KPN API Store Medium: “Monitoring and Observability” O’Reilly: “Distributed Systems Observability” Twitter: Observability at Twitter Part 1 & Part 2 Google: “SRE fundamentals: SLIs, SLAs and SLOs” Google: “SRE vs. DevOps: competing standards or close friends?“ Google: “Site Reliability Engineering” & “The Site Reliability Workbook” (free online books) Forrester: “How To Apply Google's Site Reliability Engineering Approach To Your Infrastructure” Digital Ocean: “An Introduction to Metrics, Monitoring, and Alerting” Honeycomb: “Observability: A Manifesto” InfoQ: “Three Pillars with Zero Answers: Rethinking Observability” Scalyr: “Observability?! – Where do we go from here?” Netflix Tech Blog: “Lessons from Building Observability Tools at Netflix“ OpenAPM: “OpenAPM Landscape” Paessler: “paessler.com” References: