From 20s to sub-second: Speeding up CloudNativePG wake-up times

Tudor Golubenco March 2026 1 From 20s to sub-second: Speeding
up CloudNativePG wake-up times

Xata - Postgres by the thousands Spin up tens of
thousands of writable Postgres copies with real data, for the price of one. The Postgres platform for agent-driven development, testing, and training. • Fast copy-on-write branching • Scale-to-zero • High performance/cost ratio • 100% PostgreSQL • Separation of storage from compute • Near-instant cold starts

Xata - for staging/dev/agents Keep production where it is: •
AWS RDS / Aurora • GCP Cloud SQL • Azure Database Add fast branching functionality. Anonymization is included: • https://github.com/xataio/pgstream • Postgres replication with DDL

Xata - for production workloads You can run both production
and staging/dev branching on Xata • HA, automatic failover • Upgrades • Backups, PITR • Very good performance/cost ratio

5 Scale-to-zero Only pay for the compute when you actually
need it: • Postgres shuts down in case of inactivity • New connections wake it up automatically

6 So what? Dev branches are almost free • Compute
thanks to scale-to-zero • Storage thanks to CoW Production use cases: • Free tier • AI Agents that need Postgres occasionally

CloudNativePG operator (CNPG) • Open source Postgres operator for K8s
• Originally created by EDB • CNCF project “Sandbox” • Handles (level 5): ◦ Read replicas ◦ Switchover / failover ◦ Upgrades ◦ Connection pooling ◦ Backups / PITR ◦ etc.

Xata additions for CNPG Branch operator ◦ Xata abstraction on
top of clusters ◦ Manages related objects: network policies, secrets, objectstore, etc. SQL gateway ◦ Routes to the target ◦ Serverless driver (SQL over HTTP/websocket) Scale-to-zero plugin ◦ Automatically hibernates / wakes up clusters based on activity

9 Xata Scale-to-zero plugin for CNPG CNPG declarative hibernation kubectl
annotate cluster <cluster-name> --overwrite cnpg.io/hibernation=on Pod (compute) is deleted, PVC and volumes (storage) are maintained. Xata Scale-to-zero plugin ◦ Open source: https://github.com/xataio/cnpg-i-scale-to-zero ◦ Adds a sidecar, that monitors for active connections ◦ If no active connections for a given period of time, hibernates the cluster Wake up procedure: ◦ The SQL GW accepts connections for hibernated clusters ◦ Wakes up (via the annotation) and forwards the connection

10 Problem: it was slow.. Goal <1 second Wake up
times, after scale-to-zero, at the beginning of the project: min=25.369s max=33.860s stddev=2.703923s

11 Where does the time go?

12 Low hanging fruits Init containers (up to 10 seconds)
• Barman plugin uses an init container • Startup probe of 10 seconds, hardcoded • Solution: don’t use the CNPG Barman plugin, switch to in-image PgBackRest Startup probe (up to 10 seconds) • CNPG calls pg_isready to see when Postgres is ready • By default first after 10 seconds • Solution: Reduced it to 1 second. For sub-second we’ll need to not rely on the startup probes.

13 After picking the low hanging fruits

14 Culprit: CSI driver volume attachments CSI (K8s Container Storage
Interface) • On volume attachment, it waits for the next "node status update" • The node status updates happen once every nodeStatusUpdateFrequency • Default: 10 seconds • https://github.com/kubernetes/kubernetes/issues/28141 When adding a random wait between the test runs:

15 Solution: Simplified CSI driver Normal CSI sequence: CreateVolume →
ControllerPublish → NodeStage → NodePublish Simplified CSI sequence: CreateVolume → NodePublish Phase Min Avg P50 P90 Max --------------------------------------------------------------------- Pod created 1023 1311 1144 1594 2537 PodReadyToStartContainers 243 847 684 1558 1747 Initialized (init done) 642 1347 1293 1826 2538 PG ready (accept conns) 2032 2749 2764 3128 3525

16 Postgres startup times Measured on a tiny DB with
minimal write activity. Without CHECKPOINT on hibernation • pg_ctl stop -m immediate • WAL processing • 700ms - 1200ms With CHECKPOINT on hibernation • pg_ctl stop -m fast • 200ms - 350ms

17 Summary so far Wake-up times went down from ~25s
to 3s on average. Improved: • Latency due to init containers • Latency added at gateway • Volume attachment delays • Latency added by CNPG probes Conclusion: • We can’t reach <1s wake-up times by focussing on any of the above • Overhead of pod-scheduling, volume mounting and init container starts is too high

18 Warm pools of Postgres pods Goal: sub-second wake up
times Can we do better?

19 Pools of CNPG clusters • Multiple cluster pools, one
per configuration ‘class’ • Pools are managed by a PoolOperator • Branch has a running Cluster associated with it - not tied to the same name as the branch • Cluster PVCs are backed by Xatastor volumes, and mounted over NVMe-of

20 On hibernation • On hibernation the CNPG cluster associated
with the branch is deleted. • The volume associated with the cluster is retained

21 On wake-up • On wakeup a Cluster is taken
from the correct pool • The Cluster is associated with the Branch • The clusters service orchestrates the NVME-oF volume mount into the cluster’s existing PVC and a restart of the Postgres container inside the pod. • ClusterPool controller creates a replacement cluster for the pool

22 Volume slots Idea: • Mount an empty volume on
the host • An entrypoint scripts waits for pgdata folder to be available • Establish the NVMe-OF connection to the storage system • When pgdata folder exist, let Postgres start

23 “Demo”

24 “Demo”

25 Summary ~500ms wake up times In the critical path:
• NVMe-of connection (~100-150ms) • Postgres startup (~200-350ms) • Readiness / Gateway connection (?) Rest of the operations are done when adding to the warm pool: • K8s scheduling • Pulling images • Pod initialization

26 Thank you! Postgres by the thousands

From 20s to sub-second: Speeding up CloudNative...

From 20s to sub-second: Speeding up CloudNativePG wake-up times

Tudor Golubenco

More Decks by Tudor Golubenco

Featured

Transcript

Tudor Golubenco March 2026 1 From 20s to sub-second: Speeding

Xata - Postgres by the thousands Spin up tens of

Xata - for staging/dev/agents Keep production where it is: •

Xata - for production workloads You can run both production

5 Scale-to-zero Only pay for the compute when you actually

6 So what? Dev branches are almost free • Compute

CloudNativePG operator (CNPG) • Open source Postgres operator for K8s

Xata additions for CNPG Branch operator ◦ Xata abstraction on

9 Xata Scale-to-zero plugin for CNPG CNPG declarative hibernation kubectl

10 Problem: it was slow.. Goal <1 second Wake up

11 Where does the time go?

12 Low hanging fruits Init containers (up to 10 seconds)

13 After picking the low hanging fruits

14 Culprit: CSI driver volume attachments CSI (K8s Container Storage

15 Solution: Simplified CSI driver Normal CSI sequence: CreateVolume →

16 Postgres startup times Measured on a tiny DB with

17 Summary so far Wake-up times went down from ~25s

18 Warm pools of Postgres pods Goal: sub-second wake up

19 Pools of CNPG clusters • Multiple cluster pools, one

20 On hibernation • On hibernation the CNPG cluster associated

21 On wake-up • On wakeup a Cluster is taken

22 Volume slots Idea: • Mount an empty volume on

23 “Demo”

24 “Demo”

25 Summary ~500ms wake up times In the critical path:

26 Thank you! Postgres by the thousands