Beyond Dashboarding The Grafana Observability Stack by Steve Caron

Slide 1

Slide 1 text

Beyond Dashboarding The Grafana Observability Stack Steve Caron Staff Solutions Engineer

Slide 2

Slide 2 text

How most people started with Grafana

Slide 3

Slide 3 text

Loki for logs Grafana for visualizations Tempo for traces Mimir for metrics

Slide 4

Slide 4 text

Grafana Mimir

Slide 5

Slide 5 text

Running Prometheus at scale Prometheus is great but… Out of the box, Prometheus scales only vertically. Centralised view of metrics can only be achieved using hierarchical federation or cross-service federation. Prometheus use local storage. Traditionally retention is set to 15 days and rarely above 30 days. No authentication mechanism or role based access controls for protecting your data. Limited horizontal scalability No robust federation Not designed for long term retention No security model

Slide 6

Slide 6 text

Prometheus on steroids: Mimir = Mimir + Durable storage Blazing fast query performance Production-proven dashboards, alerts, and playbooks High availability Horizontal scalability Real multi-tenancy Prometheus

Slide 7

Slide 7 text

Running Prometheus at scale Application 1 Application 2 Application N Region A Application 1 Application 2 Application N Region B Remote write Remote write Remote write Remote write For Prometheus users ● Leverage your existing investment by using Prometheus as a Metrics forwarder. ● 100% compatible with your existing queries, alerts and recording rules are . For all ● Get started in a few clicks using the Grafana agent (embeds the Prometheus agent). ● Query your Mimir metrics using Grafana. Query

Slide 8

Slide 8 text

Grafana Loki

Slide 9

Slide 9 text

Who did we make Loki for? Effective Debugging and troubleshooting of applications Visualise and alert on services/apps performance metrics Build actionable insights from log data and other supported data sources DevOps SRE DataEng

Slide 10

Slide 10 text

Format agnostic Efficient at scale Why do they like Loki? Built for correlation Logs as metrics

Slide 11

Slide 11 text

Under the hood 2019-12-11T10:01:02.123456789Z {app=”nginx”, env=”dev”} Timestamp with nanosecond precision Log content JSON, logfmt, custom, etc. Labels/Selectors key-value pairs Indexed Unindexed GET /about 1034 Debug “page not found”

Slide 12

Slide 12 text

` Get the most out of your logs with LogQL ● Inspired from PromQL syntax for effortless correlations between Metrics and Logs. ● Build Metrics from Logs and unlock new use cases. ● Use your LogQL queries for creating advanced alerting rules. {app=”nginx”,instance=”1.1.1.1”} Label matchers != "Googlebot/" Line ﬁlters | json Parser | request_time >= 100 and status == 200 Label ﬁlters *Successful requests with a latency superior to 100ms (Googlebot requests excluded)

Slide 13

Slide 13 text

Promtail Makes logs collection easy with... ● Targets discovery for Kubernetes, Syslog, files and more ● Automatically attach labels to your log lines ● Advanced pipeline mechanism for parsing, transforming and filtering your logs ● Build and expose custom metrics from your logs data But Loki is open. logstash Lambda

Slide 14

Slide 14 text

Grafana Tempo

Slide 15

Slide 15 text

What is distributed tracing? A way to observe requests as they propagate through a distributed system

Slide 16

Slide 16 text

How to get started with distributed tracing? Instrument your code using agents and libraries to generate spans for your services. Use tracing pipelines to collect, transform and enrich spans. Store all the traces for querying and building more insights. Use Grafana to detect and investigate service issues. Correlate your traces with metrics and log data. Instrument Collect Store Visualize

Slide 17

Slide 17 text

● traces_spanmetrics_calls_total - Counter, Total count of the span (Rate, Error) ● traces_spanmetrics_latency - Histogram, Duration of the span (Duration) (includes Exemplars) ● traces_spanmetrics_size_total - Counter, Total size of spans ingested (Volume) Metrics Generation

Slide 18

Slide 18 text

| {} TraceQL { .namespace = “prod” } > { .service.name = “auth” && { .http.status_code = 500 } { .http.status_code = 500 } | count() > 1 Inspired by PromQL and LogQL Extract insights from traces interactively Analyze traces based on their structure >

Slide 19

Slide 19 text

Monolithic mode ● Simplest deployment mode ● All components in one single process ● Great for testing Microservices mode ● Maximum scalability ● Separate Read/Write paths ● Recommended for production deployments and large volumes How to run the Grafana stack? Grafana Cloud ● Fully managed by Grafana ● Available in 7 regions ● Free-forever tier (50GB logs and traces per month, 10K active series, 3 users.

Slide 20

Slide 20 text

Simplified architecture

Slide 21

Slide 21 text

How to collect your telemetry data The community way Or The Grafana way

Slide 22

Slide 22 text

Anatomy of the Grafana Agent Metrics - Shares the same codebase as the Prometheus Agent. Logs - Embeds Promtail, the log forwarder built by Grafana, for Loki. Traces - Based on OpenTelemetry Collector.

Slide 23

Slide 23 text

Demo

Slide 24

Slide 24 text

+ An open source, highly scalable and cost efficient continuous profiling database

Slide 25

Slide 25 text

An open source web SDK for frontend application observability 1.5M+ NPM Downloads

Slide 26

Slide 26 text

Open source eBPF auto-instrumentation for application observability

Slide 27

Slide 27 text

Have more questions? Join us at community.grafana.com or Grafana public slack: slack.grafana.com #channel grafana/ community.grafana.com Get involved: Thank you!