@holly_cummins @edeandrea
The backstory
What could possibly go wrong?
Slide 28
Slide 28 text
No content
Slide 29
Slide 29 text
Building a benchmark is easy.
Slide 30
Slide 30 text
Building a benchmark is easy.
Building a good benchmark is hard.
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
Benchmarks are
like puppies
Slide 36
Slide 36 text
Benchmarks are
like puppies
Slide 37
Slide 37 text
No content
Slide 38
Slide 38 text
No content
Slide 39
Slide 39 text
No content
Slide 40
Slide 40 text
No content
Slide 41
Slide 41 text
decisions,
Slide 42
Slide 42 text
decisions,
decisions,
Slide 43
Slide 43 text
decisions,
decisions,
decisions
Slide 44
Slide 44 text
No content
Slide 45
Slide 45 text
decisions
Slide 46
Slide 46 text
decisions
application code
Slide 47
Slide 47 text
decisions
application code
Slide 48
Slide 48 text
decisions
application code
how the app is executed
Slide 49
Slide 49 text
decisions
application code
how the app is executed
execution environment
Slide 50
Slide 50 text
decisions
application code
how the app is executed
execution environment
Slide 51
Slide 51 text
Every decision
changes the numbers.
Slide 52
Slide 52 text
Every decision
changes the numbers.
Slide 53
Slide 53 text
How should we
make decisions?
Slide 54
Slide 54 text
How should we
make decisions?
Slide 55
Slide 55 text
Guiding Principles
Slide 56
Slide 56 text
Guiding Principles
Parity
Like-for-like comparison
App code should be equivalent
Slide 57
Slide 57 text
Guiding Principles
Parity
Like-for-like comparison
App code should be equivalent
Normalness
Representative of a typical app
No occult tuning
Slide 58
Slide 58 text
Guiding Principles
Parity
Like-for-like comparison
App code should be equivalent
Normalness
Representative of a typical app
No occult tuning
High quality
Model best app dev practices
Model best performance practices
Slide 59
Slide 59 text
Guiding Principles
Parity
Like-for-like comparison
App code should be equivalent
Normalness
Representative of a typical app
No occult tuning
High quality
Model best app dev practices
Model best performance practices
Test framework, not infrastructure
Results not dominated by database
Aim to be CPU-bound
Slide 60
Slide 60 text
reproducibility
- can we get the same answer repeatedly?
- is there noise in the results from things
outside our control?
- are we reporting useful metrics?
- does this help us make a decision?
- is it answering a question we
actually care about?
- is this close to real-world?
- is this representative of the
way applications will be run?
realism
relevance
Slide 61
Slide 61 text
realism
reproducibility
relevance
- best practices for performance testing
- parity (ie fairness)
- test the framework, not the infrastructure
- normalness
- best practices for application code
Slide 62
Slide 62 text
It’s easy to make
all three worse
realism
reproducibility
relevance
Slide 63
Slide 63 text
For improvements,
choose one :(
reproducibility
realism
relevance
echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo
…
—cpuset-cpus 1,4,6,8
But Francesco, no one
would run a real
application like that?
Slide 67
Slide 67 text
echo "1" > /sys/devices/system/cpu/intel_pstate/no_turbo
…
—cpuset-cpus 1,4,6,8
But Francesco, no one
would run a real
application like that?
reproducibility
realism
relevance
Slide 68
Slide 68 text
Benchmarking is a
post-truth discipline.
realism
reproducibility
relevance
Slide 69
Slide 69 text
@holly_cummins @edeandrea
Some of our learnings
Slide 70
Slide 70 text
@holly_cummins @edeandrea
Mistakes we’ve made
Slide 71
Slide 71 text
@holly_cummins @edeandrea
Decisions we’ve made
Slide 72
Slide 72 text
reproducibility
realism
relevance
out-of-the-box or tuned?
Slide 73
Slide 73 text
reproducibility
realism
relevance
how long should warmup be?
Slide 74
Slide 74 text
reproducibility
realism
relevance
pinning work to cores
reproducibility
realism
relevance
deterministic cpu frequency
(spoiler: it isn’t by default)
Slide 78
Slide 78 text
reproducibility
realism
relevance
how much data should
be in the database?
Slide 79
Slide 79 text
reproducibility
realism
relevance
stale database images
(oops)
Slide 80
Slide 80 text
Decisions are hard (and unfair).
Can we just … measure both ways?
Slide 81
Slide 81 text
No content
Slide 82
Slide 82 text
JVM
Native
Slide 83
Slide 83 text
Spring 4
Spring 3
Slide 84
Slide 84 text
Out of the box
Tuned
Slide 85
Slide 85 text
With Virtual Threads
No Virtual Threads
Slide 86
Slide 86 text
AOT
Normal
Slide 87
Slide 87 text
AOT
Normal
Leyden AOT
Slide 88
Slide 88 text
AOT
Normal
Leyden AOT
Spring AOT
Slide 89
Slide 89 text
Combinatorics.
Slide 90
Slide 90 text
Our current
measurement
matrix is … a lot.
Out-of-the-box Tuned
Slide 91
Slide 91 text
@holly_cummins @edeandrea
How universal are the results?
Slide 92
Slide 92 text
“It depends.”
Slide 93
Slide 93 text
The ultimate validation
Slide 94
Slide 94 text
The ultimate validation
They could run 3 times denser deployments without
sacrificing availability and response times of services”,
Thornsten reiterated.
Slide 95
Slide 95 text
No content
Slide 96
Slide 96 text
I tried your
benchmark, and Quarkus
is only 1.4x faster
Slide 97
Slide 97 text
decisions
application code
how the app is executed
execution environment
Slide 98
Slide 98 text
decisions
application code
how the app is executed
execution environment
Slide 99
Slide 99 text
decisions
application code
how the app is executed
execution environment
Slide 100
Slide 100 text
decisions
application code
how the app is executed
execution environment
Slide 101
Slide 101 text
hardware schedulers
make a big difference
Slide 102
Slide 102 text
@holly_cummins @edeandrea
The mistakes we didn’t make
Slide 103
Slide 103 text
@holly_cummins @edeandrea
How not to benchmark
Slide 104
Slide 104 text
reproducibility
realism
relevance
Running on a laptop
don’t do this
Slide 105
Slide 105 text
reproducibility
realism
relevance
don’t do this
Other work running on
the same machine
(the load driver counts!)
Slide 106
Slide 106 text
reproducibility
realism
relevance
Not having a clear question
don’t do this
Slide 107
Slide 107 text
reproducibility
realism
relevance
Not measuring what you
think you’re measuring
don’t do this
Slide 108
Slide 108 text
reproducibility
realism
relevance
A measurement of the wrong
bottleneck is a useless measurement
Slide 109
Slide 109 text
solution:
active benchmarking
Slide 110
Slide 110 text
reproducibility
realism
relevance
Coordinated omission
don’t do this
Slide 111
Slide 111 text
reproducibility
realism
relevance
Varying multiple
things at once
don’t do this
Slide 112
Slide 112 text
reproducibility
realism
relevance
Measuring multiple
things at once
don’t do this
Slide 113
Slide 113 text
@holly_cummins @edeandrea
What next?
Slide 114
Slide 114 text
reproducibility
realism
relevance
more varied + complex application