How We Benchmarked Quarkus: Patterns and anti-patterns

Holly Cummins Eric Deandrea Francesco Nigro How We Benchmarked Quarkus
Patterns and anti-patterns Quarkus Insights April 20, 2026

@holly_cummins @edeandrea tl;dr

Quarkus is fast + efficient.

quarkusio/spring-quarkus-perf-comparison

@holly_cummins @edeandrea The backstory

@holly_cummins @edeandrea The backstory What problem were we solving?

before

before what Quarkus version is this?

before what Quarkus version is this? when was it measured?

before what Quarkus version is this? where is the source
code? when was it measured?

code? when was it measured? how can I reproduce?

code? when was it measured? how can I reproduce? what framework is this?

before

before why is throughput missing?

before why is throughput missing? is throughput so bad Quarkus
had to hide the numbers?

shadow-benchmarking

home-rolled benchmark home-rolled benchmark home-rolled benchmark home-rolled benchmark home-r benchm
home-rolled benchmark

duplicated effort benchmarking anti-patterns

We needed a benchmark.

We made a benchmark.

open source after

traceable open source after

@holly_cummins @edeandrea The backstory What could possibly go wrong?

Building a benchmark is easy.

Building a benchmark is easy. Building a good benchmark is
hard.

Benchmarks are like puppies

decisions,

decisions, decisions,

decisions, decisions, decisions

decisions

decisions application code

decisions application code how the app is executed

decisions application code how the app is executed execution environment

Every decision changes the numbers.

How should we make decisions?

Guiding Principles

Guiding Principles Parity Like-for-like comparison App code should be equivalent

Normalness Representative of a typical app No occult tuning

Normalness Representative of a typical app No occult tuning High quality Model best app dev practices Model best performance practices

Normalness Representative of a typical app No occult tuning High quality Model best app dev practices Model best performance practices Test framework, not infrastructure Results not dominated by database Aim to be CPU-bound

reproducibility - can we get the same answer repeatedly? -
is there noise in the results from things outside our control? - are we reporting useful metrics? - does this help us make a decision? - is it answering a question we actually care about? - is this close to real-world? - is this representative of the way applications will be run? realism relevance

realism reproducibility relevance - best practices for performance testing -
parity (ie fairness) - test the framework, not the infrastructure - normalness - best practices for application code

It’s easy to make all three worse realism reproducibility relevance

For improvements, choose one :( reproducibility realism relevance

echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo … —cpuset-cpus 1,4,6,8

echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo … —cpuset-cpus 1,4,6,8 But Francesco, no
one would run a real application like that?

echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo … —cpuset-cpus 1,4,6,8 But Francesco, no
one would run a real application like that? reproducibility realism relevance

Benchmarking is a post-truth discipline. realism reproducibility relevance

@holly_cummins @edeandrea Some of our learnings

@holly_cummins @edeandrea Mistakes we’ve made

@holly_cummins @edeandrea Decisions we’ve made

reproducibility realism relevance out-of-the-box or tuned?

reproducibility realism relevance how long should warmup be?

reproducibility realism relevance pinning work to cores

transactional reads? reproducibility realism relevance

reproducibility realism relevance database container networking

reproducibility realism relevance deterministic cpu frequency (spoiler: it isn’t by
default)

reproducibility realism relevance how much data should be in the
database?

reproducibility realism relevance stale database images (oops)

Decisions are hard (and unfair). Can we just … measure
both ways?

JVM Native

Spring 4 Spring 3

Out of the box Tuned

With Virtual Threads No Virtual Threads

AOT Normal

AOT Normal Leyden AOT

AOT Normal Leyden AOT Spring AOT

Combinatorics.

Our current measurement matrix is … a lot. Out-of-the-box Tuned

@holly_cummins @edeandrea How universal are the results?

“It depends.”

The ultimate validation

The ultimate validation They could run 3 times denser deployments
without sacrificing availability and response times of services”, Thornsten reiterated.

I tried your benchmark, and Quarkus is only 1.4x faster

decisions application code how the app is executed execution environment

hardware schedulers make a big difference

@holly_cummins @edeandrea The mistakes we didn’t make

@holly_cummins @edeandrea How not to benchmark

reproducibility realism relevance Running on a laptop don’t do this

reproducibility realism relevance don’t do this Other work running on
the same machine (the load driver counts!)

reproducibility realism relevance Not having a clear question don’t do
this

reproducibility realism relevance Not measuring what you think you’re measuring
don’t do this

reproducibility realism relevance A measurement of the wrong bottleneck is
a useless measurement

solution: active benchmarking

reproducibility realism relevance Coordinated omission don’t do this

reproducibility realism relevance Varying multiple things at once don’t do
this

reproducibility realism relevance Measuring multiple things at once don’t do
this

@holly_cummins @edeandrea What next?

reproducibility realism relevance more varied + complex application

reproducibility realism relevance -energy measurements -cost measurements

Questions?

How We Benchmarked Quarkus: Patterns and anti-p...

How We Benchmarked Quarkus: Patterns and anti-patterns

More Decks by Holly Cummins

Other Decks in Programming

Featured

Transcript