Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Highly available Identity and Access Management with multi-site Keycloak deployments in the cloud

Highly available Identity and Access Management with multi-site Keycloak deployments in the cloud

A single sign on solution for your customers and employees shouldn't be a single-point-of-failure in your architecture. Keycloak, a popular Open Source Identity and Access Management solution that provides single sign on, amongst other capabilities, is no exception to this.
A clustered Keycloak deployment in a single site or datacenter provides sufficient availability for many. An increasing number of organizations need to utilize multiple sites for improved resiliency or to meet legal requirements. In 2023, Keycloak overhauled its multi-site capabilities for public cloud infrastructures, tested them thoroughly and provided deployment blueprints to the community. They show how to set up an AWS infrastructure and deploy Keycloak across multiple sites.
This talk presents, from an architects and developer perspective, how we approached the problem, which architecture we chose, the challenges we faced and which tools helped us along the way. Expect to dive into concepts like load shedding, cache stampedes, and automated failover. See tools like Gatling, Helm, OpenTelemetry, Kubernetes Operators and AWS infrastructure in action. We will also provide an outlook for the next steps in our journey.
These insights will help you to improve your Keycloak deployments as well as design and test your own applications so they can withstand high load and site failures.

Alexander Schwartz

April 18, 2024

More Decks by Alexander Schwartz

Other Decks in Technology


  1. Highly available Identity and Access Management with multi-site Keycloak deployments

    in the cloud Ryan Emerson, Alexander Schwartz | Principal Software Engineers | Red Hat Devoxx France | 2024-04-18
  2. Authenticate and authorize users for services Login Request Verify token

    < Token > API Cloud Services • AuthZ + AuthN • Manage users, credentials, permissions, ... • Handle user registration, password reset, … • Integrate to existing security infrastructure
  3. Day 1: Single-Sign-On is cool! • Users need to remember

    only one password • Authenticate only once per day • Add second factor for authentication for security • Theme the frontend to match your needs Makes sense already for a single application!
  4. Day 2: Become flexible in your setup • Integrate LDAP

    and Kerberos • Brokerage to existing SAML services • Brokerage to existing OIDC services • Integrate existing custom stores Reuse existing user stores!
  5. Day 3: Eliminate daily churn • User password recovery (even

    when using LDAP) • Self-registration for users • User data self-management Resolve the need for calls and tickets!
  6. Single Process User sessions cached locally in-memory Database Persistent State

    Keycloak Users • Sessions frequently accessed, cached in-memory for reduced latency • Infrequently updated data persisted: ◦ Users, Groups, Roles…
  7. N Keycloak Processes Clustered session state Database Persistent State Single

    Availability Zone • Deploy with Kubernetes • Sessions replicated between Keycloak processes • Can tolerate K8s node/pod failures • Increased performance and resilience
  8. Multi Site Availability Zone 1 Availability Zone 2 Persistent State

    ? ? • Deploy Keycloak to multiple availability zones • How to maintain session and persistent state?
  9. Multi Site - Active/Passive AZ-1 Active AZ-2 Passive • Keycloak

    site “Active” or “Passive” • All user requests forwarded to Active site • Greatly simplifies write semantics • Avoid data contention
  10. Multi Site - Active/Passive AZ-1 Active AZ-2 Active • Passive

    becomes Active after failover • Users connect to backup only after failover
  11. User Connections Health Check Health Check • All users requests

    via Route53 hostname • Route53 determines which site is “Active” • Requests always routed to Active site • Periodic Health checks determine site health AZ-1 AZ-2
  12. User Connections AZ-1 AZ-2 Health Check Health Check • On

    failover DNS routing updated and new Active site established • DNS caching can lead to longer failover times for some users
  13. Infinispan • In-memory Key/Value Cache • Advanced clustering capabilities •

    Independent Project • Kubernetes Operator • Spring Boot, Quarkus and more! • Apache 2.0 License App App • Embedded Mode • Client/Server
  14. Active/Passive Session Replication AZ-1 Active AZ-2 Passive Session writes replicated

    across availability zones • Utilise Infinispan Server Read/Writes Writes • Infinispan Cross-Site replication used to sync session state • Infinispan server supports advanced admin operations for Cross-Site failover management • Session data now survives Keycloak pod restarts
  15. AWS Region Aurora DB Aurora Availability Zone 1 Writer Availability

    Zone 2 Reader Reads Writes Reads • All read/writes handled by a single “Writer” instance • Writer instance hosted in same AZ as Active site • Data written to both Availability Zones to allow failover
  16. AWS Region Aurora DB Aurora Availability Zone 1 Writer Availability

    Zone 2 Writer Reads Writes Reads • Election of new writer instance managed by Aurora • Failover takes ~ 1m • AWS JDBC wrapper ensures Keycloak pods drop old connections on failover • No additional Keycloak semantics required Writes
  17. AWS Region Availability Zone 1 Availability Zone 2 HA Architecture

    Connect to Route53 DNS Forward req to Active Keycloak Sessions replicated across AZs Aurora multi-az replication Failover to Passive
  18. Active/Passive Benefits • Greatly simplifies Database semantics • Session keys

    synchronously replicated, A/P prevents contention • Split brain resolved by taking passive site offline ◦ SRE input required to bring site online
  19. Overload situations requiring load shedding • Overload: More requests incoming

    than can be handled • Observation: Requests queue up, memory usage increases, client requests time out. • Remedy: Drop requests by replying with 503 Queue length > max? Eneue and process request Respond with result 503 immediately no yes
  20. Cache stampede protection when restarting pods • Cache stampede: When

    restarting under high load, parallel requests access the database while the cache is empty • Observation: Timeouts, exhaustion of DB connections. • Remedy: JVM locking if the same resource is about to be fetched from the database. Pending request? Fetch information from database Block for pending request to return no yes
  21. Tackling blocking probes and metrics • Overload: Too many requests

    make responses slow or lead to load shedding. • Observation: Blocking probes stop working and Pod restart. Metrics become unavailable. • Remedy: Use non-blocking probes that don’t enqueue in an overload situation. Disable load-shedding for metrics. Liveness probe failed / Timeout / n times in the last x seconds Container xxx failed liveness probe, will be restarted Symptoms: Kubernetes events similar to:
  22. Ephemeral environments You can have as many environments as you

    want, but they will be deleted automatically at the end of the day.
  23. Measure, record and repeat • Add Metrics collection and searchable

    logs as early as possible. • Capture all insights after a run automatically for an ephemeral environment is hard but safes time in the long run. • Nightly runs for performance and functional tests show if there are regressions.
  24. Java profiler to analyze performance • Flame graphs help to

    capture activity. • Wall-clock async profiling works even in containers with reduced profiling permissions. • Cryostat.io for simplified integration in container environments.
  25. OpenTelemetry Java agent for metrics and traces Adds instrumentation to

    a lot of well-known libraries, even if your Java application doesn’t support tracing out-of-the box.
  26. GitHub actions for automation • Nightly CI testing • Recording

    the results • Setting up environment • Workflow dispatch via CLI and UI
  27. Running Gatling on multiple nodes • Gatling running on multiple

    ephemeral nodes to overcome OS network stack limitations under high load. (~ 250 connections per second per load driving host) https://github.com/keycloak/keycloak-benchmark/tree/main/ansible
  28. Kubernetes for scheduling deployments • Utilize status information in resources

    and from Operators: kubectl wait --for=condition=... --timeout=1200s <resource> • Red Hat OpenShift Service on AWS (ROSA) for ephemeral environments (logging, metrics, etc. as bundled add-ons)
  29. Grafana for interactive dashboards • create dashboards and store them

    as JSON to share them with the team. • Plot histograms as heat maps to visualize SLOs. • Jump from metrics to traces to logs.
  30. Helm for fast turnaround in deployment • Simple templating language

    • Try out changes to your charts in seconds • Our charts grown in variants for development and performance testing, no longer suitable for a production deployment
  31. Future roadmap for Keycloak HA and the cloud Ideas for

    enhancements (but no planned releases): • Support active-active deployments and evaluate possible benefits • Simplify deployment and upgrades • Support more cloud platform and databases • Provide security-hardened blueprints Join the Keycloak Birds of a Feather session today at 8:00 p.m.
  32. • Keycloak https://www.keycloak.org/ • Keycloak Benchmark Project https://www.keycloak.org/keycloak-benchmark/ • Keycloak

    High Availability Guide https://www.keycloak.org/high-availability/introduction • Keycloak Book 2nd Edition https://www.packtpub.com/product/kc/9781804616444 • Infinispan https://infinispan.org/ Links