Observability: more data, more questions

Observability: more data, more questions

This presentation was given at the second API-days event in Amsterdam.

In this presentation I talk about KPN's API-Store and our observability experiences with the platform.

107bf3f27c9ed8aa6bfa3ef4d718df04?s=128

Albert W. Alberts

June 19, 2019
Tweet

Transcript

  1. 2.

    Albert Alberts architect KPN since January 1999: - Previous functions:

    developer, designer, software architect - Worked on several KPN patents - Currently architect for VPC, SDN and the KPN API Store Meetup Organizer: - devNetNoord, a developer community (589 members) - domoticaGrunn, a home automation community (240 members) Spare time: (Open water) swimming, water polo, cycling albert.alberts@kpn.com
  2. 5.
  3. 9.

    SLA Service Level Agreement SLO Service Level Objective SLI Service

    Level Indicator Service Levels the base of Site Reliability Engineering (SRE) Contract between customer and supplier Availability and reliability goals Probes measured over time
  4. 10.

    is about known-unknowns and actionable alerts Monitoring is for symptom

    based Alerting: Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. Monitoring
  5. 11.

    is about unknown-unknowns Empowering you to ask new questions and

    answer questions yet to be formulated … Observability
  6. 12.

    Metrics Metrics are raw measurements of resource usage or behavior

    that can be observed and collected over time.
  7. 13.

    Monitoring Monitoring is the process of collecting, aggregating, and analyzing

    metrics to improve awareness of your components' characteristics and behavior.
  8. 15.

    Tracing A trace is a representation of a series of

    causally related distributed events that encode the end-to-end request flow through a distributed system.
  9. 16.

    Alerting Alerting is the responsive component of a monitoring system

    that performs actions based on changes in metric values.
  10. 18.

    Paessler PRTG stream (metrics/monitoring) virtual machine bandwidth system stats connectivity

    certificates aggregation alerting notification action dashboard
  11. 20.
  12. 24.

    Tracing with JavaScript Object Notation {JSON} Elastic process info virtual

    machine (nested) JSON fluentd Kibana Grafana InfluxDB
  13. 25.

    Tracing with JavaScript Object Notation {JSON} process info virtual machine

    (nested) JSON Nesting JSON and dynamic structure are hard to ”read” by the backend systems Keep it simple, understand TSDB and close the dev-ops gap
  14. 26.
  15. 27.

    Tracing data flow rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process

    info log files system stats virtual machine Dashboard
  16. 28.

    What did work and what didn’t? rsyslog Telegraf InfluxDB Elastic

    Filebeat Grafana process info log files system stats virtual machine Dashboard
  17. 29.

    Notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info log

    files system stats virtual machine Dashboard Notification
  18. 30.

    Slack notification rsyslog Telegraf InfluxDB Elastic Filebeat Grafana process info

    log files system stats virtual machine Dashboard Notification
  19. 31.

    Example NGINX data flow log files process info rsyslog Telegraf

    InfluxDB Elastic Filebeat Grafana system stats Dashboard Notification
  20. 34.
  21. 35.

    Monitoring: Hardware metrics alone is not enough. Logging: Logging required

    for security and auditing but also very useful for analytics and the team dashboard. Tracing: Tracing is hard. Keep it simple, understand TSDB and close the dev-ops gap Alerting: Slack is for none-critical alerting Critical alerting via SMS, Pager or Phone What did the 4 pillars teach us?
  22. 36.

    • Get a platform dashboard on-screen. • Beware of the

    anti-pattern of monitoring everything. • Understand the social and financial implications. • From unknown-unknowns to known-unknowns and more questions Observability lessons
  23. 39.

    KPN: KPN API Store Medium: “Monitoring and Observability” O’Reilly: “Distributed

    Systems Observability” Twitter: Observability at Twitter Part 1 & Part 2 Google: “SRE fundamentals: SLIs, SLAs and SLOs” Google: “SRE vs. DevOps: competing standards or close friends?“ Google: “Site Reliability Engineering” & “The Site Reliability Workbook” (free online books) Forrester: “How To Apply Google's Site Reliability Engineering Approach To Your Infrastructure” Digital Ocean: “An Introduction to Metrics, Monitoring, and Alerting” Honeycomb: “Observability: A Manifesto” InfoQ: “Three Pillars with Zero Answers: Rethinking Observability” Scalyr: “Observability?! – Where do we go from here?” Netflix Tech Blog: “Lessons from Building Observability Tools at Netflix“ OpenAPM: “OpenAPM Landscape” Paessler: “paessler.com” References: