Highly available Identity and Access Management with multi-site Keycloak deployments in the cloud

Slide 1

Slide 1 text

Highly available Identity and Access Management with multi-site Keycloak deployments in the cloud Ryan Emerson, Alexander Schwartz | Principal Software Engineers | Red Hat Devoxx France | 2024-04-18

Slide 2

Slide 2 text

What is Identity and Access Management (IAM), and do I need one?

Slide 3

Slide 3 text

Authenticate and authorize users for services Login Request Verify token < Token > API Cloud Services ● AuthZ + AuthN ● Manage users, credentials, permissions, ... ● Handle user registration, password reset, … ● Integrate to existing security infrastructure

Slide 4

Slide 4 text

Day 1: Single-Sign-On is cool! ● Users need to remember only one password ● Authenticate only once per day ● Add second factor for authentication for security ● Theme the frontend to match your needs Makes sense already for a single application!

Slide 5

Slide 5 text

Keycloak provides the login screen for your apps

Slide 6

Slide 6 text

Day 2: Become ﬂexible in your setup ● Integrate LDAP and Kerberos ● Brokerage to existing SAML services ● Brokerage to existing OIDC services ● Integrate existing custom stores Reuse existing user stores!

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Use Brokerage for existing providers

Slide 9

Slide 9 text

Skip the form with Kerberos/SNPEGO! This page intentionally left blank.

Slide 10

Slide 10 text

Day 3: Eliminate daily churn ● User password recovery (even when using LDAP) ● Self-registration for users ● User data self-management Resolve the need for calls and tickets!

Slide 11

Slide 11 text

Password recovery and self-registration

Slide 12

Slide 12 text

Declarative User Proﬁle conﬁguration

Slide 13

Slide 13 text

User Proﬁle for admins, registration, and users

Slide 14

Slide 14 text

Keycloak is now a critical component in your infrastructure. You want it to be available 24/7.

Slide 15

Slide 15 text

Single Process User sessions cached locally in-memory Database Persistent State Keycloak Users ● Sessions frequently accessed, cached in-memory for reduced latency ● Infrequently updated data persisted: ○ Users, Groups, Roles…

Slide 16

Slide 16 text

N Keycloak Processes Clustered session state Database Persistent State Single Availability Zone ● Deploy with Kubernetes ● Sessions replicated between Keycloak processes ● Can tolerate K8s node/pod failures ● Increased performance and resilience

Slide 17

Slide 17 text

Tolerating Availability Zone Failures

Slide 18

Slide 18 text

Multi Site Availability Zone 1 Availability Zone 2 Persistent State ? ? ● Deploy Keycloak to multiple availability zones ● How to maintain session and persistent state?

Slide 19

Slide 19 text

Multi Site - Active/Passive AZ-1 Active AZ-2 Passive ● Keycloak site “Active” or “Passive” ● All user requests forwarded to Active site ● Greatly simplifies write semantics ● Avoid data contention

Slide 20

Slide 20 text

Multi Site - Active/Passive AZ-1 Active AZ-2 Active ● Passive becomes Active after failover ● Users connect to backup only after failover

Slide 21

Slide 21 text

Managing User Connections

Slide 22

Slide 22 text

User Connections Health Check Health Check ● All users requests via Route53 hostname ● Route53 determines which site is “Active” ● Requests always routed to Active site ● Periodic Health checks determine site health AZ-1 AZ-2

Slide 23

Slide 23 text

User Connections AZ-1 AZ-2 Health Check Health Check ● On failover DNS routing updated and new Active site established ● DNS caching can lead to longer failover times for some users

Slide 24

Slide 24 text

Session Failover

Slide 25

Slide 25 text

Inﬁnispan ● In-memory Key/Value Cache ● Advanced clustering capabilities ● Independent Project ● Kubernetes Operator ● Spring Boot, Quarkus and more! ● Apache 2.0 License App App ● Embedded Mode ● Client/Server

Slide 26

Slide 26 text

Active/Passive Session Replication AZ-1 Active AZ-2 Passive Session writes replicated across availability zones ● Utilise Infinispan Server Read/Writes Writes ● Infinispan Cross-Site replication used to sync session state ● Infinispan server supports advanced admin operations for Cross-Site failover management ● Session data now survives Keycloak pod restarts

Slide 27

Slide 27 text

Database Failover

Slide 28

Slide 28 text

AWS Region Aurora DB Aurora Availability Zone 1 Writer Availability Zone 2 Reader Reads Writes Reads ● All read/writes handled by a single “Writer” instance ● Writer instance hosted in same AZ as Active site ● Data written to both Availability Zones to allow failover

Slide 29

Slide 29 text

AWS Region Aurora DB Aurora Availability Zone 1 Writer Availability Zone 2 Writer Reads Writes Reads ● Election of new writer instance managed by Aurora ● Failover takes ~ 1m ● AWS JDBC wrapper ensures Keycloak pods drop old connections on failover ● No additional Keycloak semantics required Writes

Slide 30

Slide 30 text

Aurora DB AZ-1 Active AZ-2 Passive Persisted State Read/Writes Writes

Slide 31

Slide 31 text

Architecture Overview

Slide 32

Slide 32 text

AWS Region Availability Zone 1 Availability Zone 2 HA Architecture Connect to Route53 DNS Forward req to Active Keycloak Sessions replicated across AZs Aurora multi-az replication Failover to Passive

Slide 33

Slide 33 text

Why Active/Passive?

Slide 34

Slide 34 text

Active/Passive Benefits ● Greatly simplifies Database semantics ● Session keys synchronously replicated, A/P prevents contention ● Split brain resolved by taking passive site offline ○ SRE input required to bring site online

Slide 35

Slide 35 text

Surprising system behaviors under load

Slide 36

Slide 36 text

Overload situations requiring load shedding ● Overload: More requests incoming than can be handled ● Observation: Requests queue up, memory usage increases, client requests time out. ● Remedy: Drop requests by replying with 503 Queue length > max? Eneue and process request Respond with result 503 immediately no yes

Slide 37

Slide 37 text

Cache stampede protection when restarting pods ● Cache stampede: When restarting under high load, parallel requests access the database while the cache is empty ● Observation: Timeouts, exhaustion of DB connections. ● Remedy: JVM locking if the same resource is about to be fetched from the database. Pending request? Fetch information from database Block for pending request to return no yes

Slide 38

Slide 38 text

Tackling blocking probes and metrics ● Overload: Too many requests make responses slow or lead to load shedding. ● Observation: Blocking probes stop working and Pod restart. Metrics become unavailable. ● Remedy: Use non-blocking probes that don’t enqueue in an overload situation. Disable load-shedding for metrics. Liveness probe failed / Timeout / n times in the last x seconds Container xxx failed liveness probe, will be restarted Symptoms: Kubernetes events similar to:

Slide 39

Slide 39 text

Good habits

Slide 40

Slide 40 text

Up-to-date documentation Use with static site publishing as you, including onboarding. antora.org

Slide 41

Slide 41 text

Ephemeral environments You can have as many environments as you want, but they will be deleted automatically at the end of the day.

Slide 42

Slide 42 text

Measure, record and repeat ● Add Metrics collection and searchable logs as early as possible. ● Capture all insights after a run automatically for an ephemeral environment is hard but safes time in the long run. ● Nightly runs for performance and functional tests show if there are regressions.

Slide 43

Slide 43 text

Tools that were most helpful

Slide 44

Slide 44 text

Java profiler to analyze performance ● Flame graphs help to capture activity. ● Wall-clock async profiling works even in containers with reduced profiling permissions. ● Cryostat.io for simplified integration in container environments.

Slide 45

Slide 45 text

OpenTelemetry Java agent for metrics and traces Adds instrumentation to a lot of well-known libraries, even if your Java application doesn’t support tracing out-of-the box.

Slide 46

Slide 46 text

GitHub actions for automation ● Nightly CI testing ● Recording the results ● Setting up environment ● Workﬂow dispatch via CLI and UI

Slide 47

Slide 47 text

Running Gatling on multiple nodes ● Gatling running on multiple ephemeral nodes to overcome OS network stack limitations under high load. (~ 250 connections per second per load driving host) https://github.com/keycloak/keycloak-benchmark/tree/main/ansible

Slide 48

Slide 48 text

Kubernetes for scheduling deployments ● Utilize status information in resources and from Operators: kubectl wait --for=condition=... --timeout=1200s ● Red Hat OpenShift Service on AWS (ROSA) for ephemeral environments (logging, metrics, etc. as bundled add-ons)

Slide 49

Slide 49 text

Grafana for interactive dashboards ● create dashboards and store them as JSON to share them with the team. ● Plot histograms as heat maps to visualize SLOs. ● Jump from metrics to traces to logs.

Slide 50

Slide 50 text

Helm for fast turnaround in deployment ● Simple templating language ● Try out changes to your charts in seconds ● Our charts grown in variants for development and performance testing, no longer suitable for a production deployment

Slide 51

Slide 51 text

The Future

Slide 52

Slide 52 text

Future roadmap for Keycloak HA and the cloud Ideas for enhancements (but no planned releases): ● Support active-active deployments and evaluate possible beneﬁts ● Simplify deployment and upgrades ● Support more cloud platform and databases ● Provide security-hardened blueprints Join the Keycloak Birds of a Feather session today at 8:00 p.m.

Slide 53

Slide 53 text

● Keycloak https://www.keycloak.org/ ● Keycloak Benchmark Project https://www.keycloak.org/keycloak-benchmark/ ● Keycloak High Availability Guide https://www.keycloak.org/high-availability/introduction ● Keycloak Book 2nd Edition https://www.packtpub.com/product/kc/9781804616444 ● Inﬁnispan https://inﬁnispan.org/ Links

Slide 54

Slide 54 text

Contact Alexander Schwartz Principal Software Engineer [email protected] https://www.ahus1.de @ahus1de @[email protected] Ryan Emerson Principal Software Engineer [email protected] github.com/ryanemerson

Slide 55

Slide 55 text

Questions and answers