Making users happy with Service Level Indicators and observability

Making users happy with service level indicators and observability Alexander
Schwartz | Principal Software Engineer | Red Hat We Are Developers World Congress | Berlin (DE) | 2025-07-09

The single sign on challenge • All applications depend on
it • All users depend on it • There are usage spikes

Keycloak is an Open Source Identity and Access Management Solution
🎂 Initial commit 2013-07-02 🏆 Cloud Native Computing Foundation Incubating project since April 2023 📜 Apache License, Version 2.0 ⭐ 28k GitHub stars KeyConf Amsterdam 28 Aug 2025 keyconf.dev

Keycloak ﬁrst contact

Recipe to make users happy • Login works • Login
is fast Plus a lot of small details like token refresh, user data self management, password reset, auditing, … Spoiler: Keycloak now contains great observability features of Quarkus!

Measuring user happiness • Availability 👍 • Error Rate ⚖
• Latency ⏱

Availability 👍 Measured by external tests to an application endpoint
SLO: “Keycloak should be available X% of the time within 30 days”. Negatively affected by: • Software upgrades of your service • Unavailable infrastructure • Denial of service attacks • … https://www.keycloak.org/observability/keycloak-service-level-indicators

Reported by the application or a load balancer SLO: “The
rate of errors due to server problems for authentication requests should be less than X% within 30 days.” Negatively affected by: • Problems with neighboring systems • … Error Rate ⚖

Reported by the application or a load balancer SLO: “X%
of authentication related requests should be faster than Y ms within 30 days.” Negatively affected by: • Database or other infrastructure load • Application load • Lack of scalability • … Latency ⏱

Measuring Availability 👍 1. Enable metrics. Prometheus will report the
“up” metric. count_over_time( sum (up{ container="keycloak", namespace="$namespace" } > 0)[30d:15s] ) / count_over_time(vector(1)[30d:15s])

Measuring Error Rate ⚖ 1. Identify relevant URLs and response
code http_server_requests_seconds_count{method="POST",outcome="REDIRECTION", status="302",uri="/realms/{realm}/login-actions/authenticate"} 1.0 sum( rate( http_server_requests_seconds_count{ uri=~"/realms/...", outcome="SERVER_ERROR", container="keycloak", namespace="$namespace"} [30d] ) ) without (...) / sum( // without filtering by outcome ) without (...)

1. Enable histograms for URL requests and add service level
objective 2. Use same URLs and response codes as before Measuring Latency ⏱ http_server_requests_seconds_bucket{method="POST",outcome="REDIRECTION",status="302" ,uri="/realms/{realm}/login-actions/authenticate",le="0.25"} 1.0 sum( rate( http_server_requests_seconds_bucket{ uri=~"/realms/...", le="0.25", container="keycloak", namespace="$namespace"} [30d] ) ) without (...) / sum( // use http_server_requests_seconds_count instead ) without (...)

• Keycloak provides an example Grafana dashboard • Drill down
to essential troubleshooting metrics ◦ Keycloak metrics including user event metrics ◦ JVM metrics ◦ Database Metrics ◦ HTTP metrics ◦ Infinispan metrics Create a Dashboard with SLOs https://www.keycloak.org/observability/grafana-dashboards https://www.keycloak.org/observability/metrics-for-troubleshooting

Analyzing root causes Find out why SLOs have not been
met, and resolve the root causes

You will find information by: • Analyzing application logs •
Visualizing of metrics for correlation • … • Enabling tracing and exemplars Analyzing Errors

You will find information by: • Additional logging (for example
hibernate slow queries) • Visualizing of metrics for correlation • … • Enabling tracing and exemplars Analyzing Latencies

Tracing records all activities in a call with context and
timing • Enable tracing for your application. • In production, apply sampling to trace for example 1% of all request to reduce overhead and limit collected data. • Independent of the sampling, 100% of the logs will contain trace IDs, and it will propagate between systems. • Set up firewalls and proxies to limit propagation of trace information • Set up a destination for the tracing data (for example Jaeger or Tempo). • Use web UIs to filter by duration, errors, URIs, and export/import single traces Enabling tracing https://www.keycloak.org/observability/tracing

• Added custom tags for searching (Client ID, Realm name,
User session ID, Token ID, Authentication Session ID, …) • Default spans for REST, HTTP client calls and SQL • Custom spans for LDAP and password handling Tracing implementation in Keycloak

Exemplars connect metrics to recorded traces • Enable metrics and
tracing for your application. • Activate exemplar handling in Prometheus (or similar). • Scrape the metrics in a format with exemplar support (OpenMetrics). • Set up metrics datasource in Grafana (or similar) where to link for traces • Set up dashboards in Grafana (or similar) to show exemplars Enabling Exemplars https://www.keycloak.org/observability/exemplars http_server_requests_seconds_bucket{method="POST", ...} 1.0 # {span_id="ca8881cc77d54b5b",trace_id="884e409792e6a2c78a1b7d7fb8 0ddb13"} 0.272186118 1752144128.685

Happy users, happy life!

• Track the metrics that matter for your users •
Make SLOs the base of your alertings • Chase tail latencies and errors with exemplars Happy users, happy devops team!

• Have one set of numbers to share with your
management • Optimize your infrastructure costs with the given latency target • Balance feature development vs. SLI optimization Simple numbers, happy management

• Keycloak https://www.keycloak.org/ • Keycloak Observability https://www.keycloak.org/guides#observability • KeyConf 2025
Amsterdam / 28 August 2025 https://keyconf.dev/ Kudos to the Quarkus observability eﬀorts (thank you to Bruno Baptista and Erin Schnabel) and the Keycloak team (Martin Bartosz, Michal Hajas, Ryan Emerson, Kamesh Akella) Links Slides:

Contact Alexander Schwartz Principal Software Engineer [email protected] https://www.ahus1.de @ahus1.de @[email protected]

Making users happy with Service Level Indicator...

Making users happy with Service Level Indicators and observability

Alexander Schwartz

More Decks by Alexander Schwartz

Other Decks in Technology

Featured

Transcript