Observability: more data, more questions

Observability: more data, more questions

This presentation was given at the second API-days event in Amsterdam.

In this presentation I talk about KPN's API-Store and our observability experiences with the platform.

107bf3f27c9ed8aa6bfa3ef4d718df04?s=128

Albert W. Alberts

June 19, 2019
Tweet

Transcript

  1. Observability: more data, more questions Albert W. Alberts Amsterdam, June

    19th 2019
  2. Albert Alberts architect KPN since January 1999: - Previous functions:

    developer, designer, software architect - Worked on several KPN patents - Currently architect for VPC, SDN and the KPN API Store Meetup Organizer: - devNetNoord, a developer community (589 members) - domoticaGrunn, a home automation community (240 members) Spare time: (Open water) swimming, water polo, cycling albert.alberts@kpn.com
  3. developer.kpn.com

  4. Observability “Observability”, is a superset of “monitoring”, providing certain benefits

    and insights that “monitoring” tools cannot deliver.
  5. None
  6. Observability Logging Tracing Metrics

  7. Observability Metrics Logging Tracing Alerting

  8. Data Information Information on your platform

  9. SLA Service Level Agreement SLO Service Level Objective SLI Service

    Level Indicator Service Levels the base of Site Reliability Engineering (SRE) Contract between customer and supplier Availability and reliability goals Probes measured over time
  10. is about known-unknowns and actionable alerts Monitoring is for symptom

    based Alerting: Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. Monitoring
  11. is about unknown-unknowns Empowering you to ask new questions and

    answer questions yet to be formulated … Observability
  12. Metrics Metrics are raw measurements of resource usage or behavior

    that can be observed and collected over time.
  13. Monitoring Monitoring is the process of collecting, aggregating, and analyzing

    metrics to improve awareness of your components' characteristics and behavior.
  14. Logging A log is an immutable, timestamped record of discrete

    events that happened over time.
  15. Tracing A trace is a representation of a series of

    causally related distributed events that encode the end-to-end request flow through a distributed system.
  16. Alerting Alerting is the responsive component of a monitoring system

    that performs actions based on changes in metric values.
  17. Monitoring

  18. Paessler PRTG stream (metrics/monitoring) virtual machine bandwidth system stats connectivity

    certificates aggregation alerting notification action dashboard
  19. Paessler dashboard virtual machine bandwidth system stats connectivity certificates aggregation

    alerting notification action dashboard
  20. Logging

  21. Logging stream (logging) rsyslog log files virtual machines central log

    aggregation local log aggregation
  22. Tracing failed attempts

  23. Tracing with JavaScript Object Notation {JSON} Elastic process info virtual

    machine (nested) JSON fluentd Kibana
  24. Tracing with JavaScript Object Notation {JSON} Elastic process info virtual

    machine (nested) JSON fluentd Kibana Grafana InfluxDB
  25. Tracing with JavaScript Object Notation {JSON} process info virtual machine

    (nested) JSON Nesting JSON and dynamic structure are hard to ”read” by the backend systems Keep it simple, understand TSDB and close the dev-ops gap
  26. Tracing

  27. Tracing data flow rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process

    info log files system stats virtual machine Dashboard
  28. What did work and what didn’t? rsyslog Telegraf InfluxDB Elastic

    Filebeat Grafana process info log files system stats virtual machine Dashboard
  29. Notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info log

    files system stats virtual machine Dashboard Notification
  30. Slack notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info

    log files system stats virtual machine Dashboard Notification
  31. Example NGINX data flow log files process info rsyslog Telegraf

    InfluxDB Elastic Filebeat Grafana system stats Dashboard Notification
  32. NGINX in production

  33. Getting Apigee Management info management API Apigee Management Grafana InfluxDB

    Python scripts
  34. Wrap up

  35. Monitoring: Hardware metrics alone is not enough. Logging: Logging required

    for security and auditing but also very useful for analytics and the team dashboard. Tracing: Tracing is hard. Keep it simple, understand TSDB and close the dev-ops gap Alerting: Slack is for none-critical alerting Critical alerting via SMS, Pager or Phone What did the 4 pillars teach us?
  36. • Get a platform dashboard on-screen. • Beware of the

    anti-pattern of monitoring everything. • Understand the social and financial implications. • From unknown-unknowns to known-unknowns and more questions Observability lessons
  37. Data Information Knowledge From unknown-unknowns to known-unknowns and more questions

    ???
  38. Thank you for your attention

  39. KPN: KPN API Store Medium: “Monitoring and Observability” O’Reilly: “Distributed

    Systems Observability” Twitter: Observability at Twitter Part 1 & Part 2 Google: “SRE fundamentals: SLIs, SLAs and SLOs” Google: “SRE vs. DevOps: competing standards or close friends?“ Google: “Site Reliability Engineering” & “The Site Reliability Workbook” (free online books) Forrester: “How To Apply Google's Site Reliability Engineering Approach To Your Infrastructure” Digital Ocean: “An Introduction to Metrics, Monitoring, and Alerting” Honeycomb: “Observability: A Manifesto” InfoQ: “Three Pillars with Zero Answers: Rethinking Observability” Scalyr: “Observability?! – Where do we go from here?” Netflix Tech Blog: “Lessons from Building Observability Tools at Netflix“ OpenAPM: “OpenAPM Landscape” Paessler: “paessler.com” References: