Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Running a highly available Identity and Access ...

Running a highly available Identity and Access Management with Keycloak

A single sign on solution for your customers and employees should be designed for high availability without a single point of failure. Keycloak is no exception to this.
A clustered Keycloak deployment in a single site provides sufficient availability for many. An increasing number of organizations need to utilize multiple sites for improved resiliency or to meet legal requirements. Keycloak overhauled its capabilities and now provides deployment blueprints to the community.
This talk presents how we approached the problem, and the challenges we faced. Expect to dive into concepts like load shedding, cache stampedes, and automated failover. See tools like Gatling, Helm, OpenTelemetry, Kubernetes Operators and cloud infrastructure in action. We will also provide an outlook for the next steps in our journey.
These insights will help you to improve your Keycloak deployments as well as design and test your own applications so they can withstand high load and site failures.

Alexander Schwartz

July 09, 2024
Tweet

More Decks by Alexander Schwartz

Other Decks in Technology

Transcript

  1. Running a highly available Identity and Access Management with Keycloak

    Alexander Schwartz | Principal Software Engineer | Red Hat INNOQ technology night | 2024-10-23
  2. Authenticate and authorize users for services Login Request Verify token

    < Token > API Cloud Services • AuthZ + AuthN • Manage users, credentials, permissions, ... • Handle user registration, password reset, … • Integrate to existing security infrastructure
  3. Day 1: Single-Sign-On is cool! • Users need to remember

    only one password • Authenticate only once per day • Add second factor for authentication for security • Theme the frontend to match your needs Makes sense already for a single application!
  4. Day 2: Become flexible in your setup • Integrate LDAP

    and Kerberos • Brokerage to existing SAML services • Brokerage to existing OIDC services • Integrate existing custom stores • Organisations for B2B and B2B2C setups Reuse existing user stores!
  5. Day 3: Eliminate daily churn • User password recovery (even

    when using LDAP) • Self-registration for users • User data self-management Resolve the need for calls and tickets!
  6. Single Process User sessions cached locally in-memory Database Persistent State

    Keycloak Users • Sessions frequently accessed, cached in-memory for reduced latency • Infrequently updated data persisted: ◦ Users, Groups, Roles…
  7. N Keycloak Processes Clustered session state Database Persistent State Single

    Availability Zone • Deploy with Kubernetes • Sessions replicated between Keycloak processes • Can tolerate K8s node/pod failures • Increased performance and resilience
  8. Multi Site Availability Zone 1 Availability Zone 2 Persistent State

    ? ? • Deploy Keycloak to multiple availability zones • How to maintain session and persistent state?
  9. Multi Site - Active/Active AZ-1 AZ-2 • Both sites are

    active • Data is replicated synchronously to the other site
  10. Infinispan • In-memory Key/Value Cache • Advanced clustering capabilities •

    Independent Project • Kubernetes Operator • Spring Boot, Quarkus and more! • Apache 2.0 License App App • Embedded Mode • Client/Server
  11. Session Replication and Cache Invalidation AZ-1 AZ-2 Session writes replicated

    across availability zones • Utilise Infinispan Server Read/Writes Writes • Infinispan Cross-Site replication used to replicate the session state • Infinispan server supports advanced admin operations for Cross-Site failover management • Additional monitoring to see act when the connection is broken and to do fencing to determine the active site
  12. AWS Region Aurora DB Aurora Availability Zone 1 Writer Availability

    Zone 2 Reader Reads Writes Reads • All read/writes handled by a single “Writer” instance • Writer instance hosted in same AZ as Active site • Data written to both Availability Zones to allow failover
  13. AWS Region Aurora DB Aurora Availability Zone 1 Writer Availability

    Zone 2 Writer Reads Writes Reads • Election of new writer instance managed by Aurora • Failover takes ~ 1m • AWS JDBC wrapper ensures Keycloak pods drop old connections on failover • No additional Keycloak semantics required Writes
  14. AWS Region Availability Zone 1 Availability Zone 2 HA Architecture

    Connect to LB Forward req to Active Keycloak Sessions replicated across AZs Aurora multi-az replication Failover to Passive
  15. Overload situations requiring load shedding • Overload: More requests incoming

    than can be handled • Observation: Requests queue up, memory usage increases, client requests time out. • Remedy: Drop requests by replying with 503 Queue length > max? Eneue and process request Respond with result 503 immediately no yes
  16. Cache stampede protection when restarting pods • Cache stampede: When

    restarting under high load, parallel requests access the database while the cache is empty • Observation: Timeouts, exhaustion of DB connections. • Remedy: JVM locking if the same resource is about to be fetched from the database. Pending request? Fetch information from database Block for pending request to return no yes
  17. Tackling blocking probes and metrics • Overload: Too many requests

    make responses slow or lead to load shedding. • Observation: Blocking probes stop working and Pod restart. Metrics become unavailable. • Remedy: Use non-blocking probes that don’t enqueue in an overload situation. Disable load-shedding for metrics. Liveness probe failed / Timeout / n times in the last x seconds Container xxx failed liveness probe, will be restarted Symptoms: Kubernetes events similar to:
  18. Ephemeral environments You can have as many environments as you

    want, but they will be deleted automatically at the end of the day.
  19. Measure, record and repeat • Add Metrics collection and searchable

    logs as early as possible. • Capture all insights after a run automatically for an ephemeral environment is hard but safes time in the long run. • Nightly runs for performance and functional tests show if there are regressions.
  20. Java profiler to analyze performance • Flame graphs help to

    capture activity. • Wall-clock async profiling works even in containers with reduced profiling permissions. • Cryostat.io for simplified integration in container environments.
  21. OpenTelemetry Java agent for metrics and traces Adds instrumentation to

    a lot of well-known libraries, even if your Java application doesn’t support tracing out-of-the box.
  22. GitHub actions for automation • Nightly CI testing • Recording

    the results • Setting up environment • Workflow dispatch via CLI and UI
  23. Running Gatling on multiple nodes • Gatling running on multiple

    ephemeral nodes to overcome OS network stack limitations under high load. (~ 250 connections per second per load driving host) https://github.com/keycloak/keycloak-benchmark/tree/main/ansible
  24. Kubernetes for scheduling deployments • Utilize status information in resources

    and from Operators: kubectl wait --for=condition=... --timeout=1200s <resource> • Red Hat OpenShift Service on AWS (ROSA) for ephemeral environments (logging, metrics, etc. as bundled add-ons)
  25. Grafana for interactive dashboards • create dashboards and store them

    as JSON to share them with the team. • Plot histograms as heat maps to visualize SLOs. • Jump from metrics to traces to logs.
  26. Helm for fast turnaround in deployment • Simple templating language

    • Try out changes to your charts in seconds • Our charts grown in variants for development and performance testing, no longer suitable for a production deployment
  27. • Keycloak https://www.keycloak.org/ • Keycloak Benchmark Project https://www.keycloak.org/keycloak-benchmark/ • Keycloak

    High Availability Guide https://www.keycloak.org/high-availability/introduction • Keycloak Book 2nd Edition https://www.packtpub.com/product/kc/9781804616444 • Infinispan https://infinispan.org/ Links Slides:
  28. Conferences & Events KubeCon North America 🏠 Salt Lake City

    (US) 📅 2024-11-12…15 https://events.linuxfoundation.org/ KeyConf24 🏠 Vienna (AT) & Online 📅 2024-09-19 https://keyconf.dev/ Keycloak DevDay 🏠 Darmstadt (DE) 📅 2025-03-06 https://keycloak-day.dev/ Meetup Keycloak Hour of Code 🏠 Online 📅 Every 1-2 months https://www.meetup.com/ keycloak-hour-of-code/