Benchmarking

Benchmarking Alex Gaynor June 12th, 2013 Wednesday, June 12, 13
Hi everyone, thanks for coming out. I’m Alex, I’ve been a Racker for about a month now! If I say something that doesn’t make sense, or you want more details, please just interrupt me as I go, shout over me if you have to.

What is benchmarking? Wednesday, June 12, 13 So, I’m going
to talk about benchmarking. Everyone knows what benchmarking is, of course, right? But I want to dive in and really nail down a deﬁnition.

“The process of measuring and assessing the performance of a
software system.” Wednesday, June 12, 13 So, benchmarking is about measurement and assessment. There are two primary reasons to benchmark, to compare two software systems, or to compare one software system against a goal.

Why do I care about benchmarking? • I work on
PyPy and Topaz • People tend to do it very badly, leading to misleading, and often dishonest, results • I care about performance Wednesday, June 12, 13 I hack on a project called PyPy, it’s a high performance Python implementation, and Topaz a high performance Ruby implementation. Because their purpose is performance we spend a lot of time measuring performance. And I spend a lot of time reading other people do it badly, not to say we’re perfect, but I think we spend more time thinking about these issues than a lot of other people. If you care about the performance of your software, you need automated reproducible benchmarking, just like you need automated reproducible tests if you care about your program working.

Good benchmarking is really really hard. Wednesday, June 12, 13
Show of hands, who thinks you should benchmark on a quiet system (one with nothign else running)? And who thinks you should benchmark on a noisy system, one with a lot of other stuff going on. (assumption, most people say quiet), what if your software is designed to run in the background on an otherwise busy machine, say monitoring software on a server, or anti-virus on a desktop machine? Now the real world system you want to model is noisy, different software responds differently to a noisy system, your instruction cache is going to get busted, you’re going to miss more on branch prediction, potentially more IO wait on disk, tons of other factors at play.

Benchmarking is Science Wednesday, June 12, 13 How many people
in this room consider them scientists, in a traditional sense (people in lab coats)? Benchmarking is your chance to do science.

Scientific Method Wednesday, June 12, 13 So, in all likilhood
the last time you thought about this was college or maybe even high school, but the hallmark of science is the scientific method. I think if we look at existing benchmark techniques in light of this we start to see many flaws.

Existing Methodology • Write a script • Run it a
few times • Average times • Announce your software is 10x faster than the competition Wednesday, June 12, 13 This is, I would say, par for the course for a lot of existing benchmarking. There’s a lot that’s worse than this, and I’ll get into that more. So what’s wrong with this, there’s a few things.

Bad Statistics Wednesday, June 12, 13 First, there’s a lack
of statistical rigor in almost all benchmarks, when we do real science we look at things like standard deviations and run what are called statistical hypothesis testing to see if the differences we think we’re observing are really statistically signiﬁcant.

Bad Benchmarks Wednesday, June 12, 13 The second problem is
our benchmarks are often bad. The goal of a benchmark is to model the real system to be able to assess the performance of a real system, before it’s real. Often times our benchmarks don’t do this. We do things like turn off durability in our database, which we’d never do in real life, or we just run different code, we think one part of our application is the bottleneck so we only benchmark it, and then we’re wrong.

Bad reporting Wednesday, June 12, 13 The third problem is
we’re dishonest in our conclusions. We say things like “10x faster than X” when the honest conclusion is, “On this speciﬁc benchmark we had a geometric mean of 10x faster with such and such conﬁdence”, if you run one benchmark there is no way you can make global performance claims, unless your application has literally one possible code path (it doesn’t).

Other common mistakes • Not controlling independent variables • Not
understanding what we’re benchmarking • Non-representative environments Wednesday, June 12, 13 These are another couple common pitfalls: * Porting a Java+MySQL system to C+Redis and then claiming C is faster than Java or MySQL is faster than C or whatever. You can claim this whole system is faster than this other system, but without controlling for your variables, you can’t make other claims. * Another common one I see is people write microbenchmarks and they don’t realize the whole thing is constant folded, they’re literally benchmarking the empty loop. * Running on your MacBook Pro and thinking it’ll be the same as a virtualized cloudserver, particularly bad when IO comes into play. Or, “This is taking a while, I’ll go watch cat videos”

Macro vs. Micro- benchmarks • Macro-benchmarks consider “complete” software systems.
• Micro-benchmarks target individual functions. • Similar to unit vs. integration tests. Wednesday, June 12, 13 The long and the short of it is, you should never micro-benchmark code you didn’t write. You should never micro-benchmark code you don’t know to be a bottleneck, you should never micro-benchmark code you aren’t willing to rewrite.

How to do benchmarking right better Wednesday, June 12, 13
So, that’s a whole bunch of ways it’s possible to get it wrong (and ways I *have* done it wrong). Now I want to talk about how to possibly do it better, I won’t promise this will be good, but I think we can do better.

A question Wednesday, June 12, 13 The ﬁrst step in
the scientiﬁc method is to articulate a question we want to answer. For example, “How can we improve performance of the list_containers API method for users with no containers?”

A hypothesis Wednesday, June 12, 13 This is the part
where we ﬁgure out a speciﬁc, testable idea, “Does removing 2 SQL queries from the list_containers API call improve its performance when called with a user who has no containers?”

A prediction Wednesday, June 12, 13 This should be “It
will improve performance”, unless you have some sort of bizarre scenario in which you want to slow down your code, in which case let me tell you about my good friend sleep(3).

Testing Wednesday, June 12, 13 This is usually where people
go wrong. This is the part where you write your benchmark, make the changes you want, and measure.

Writing a benchmark • Minimal • Reproducible • Stable •
Immutable Wednesday, June 12, 13 You want to only cover the code thats in your hypothesis. You want something that can be run many times by many people (no random REPL one liners). And you want to get numerically stable results. Best practice here is to record N readings from a single process (to account for variance due to GC) and run the full process multiple times. If you’re using the same benchmark over many runs, and keeping a history, you must never cange the benchmark.

Change your code Wednesday, June 12, 13 Do whatever you
have to do, but change nothing besides what your hypothesis was about. It your hypothesis is to remove 2 SQL queries, don’t also refactor something in some other corner.

Run your benchmark Wednesday, June 12, 13 Quiet system (if
you need to benchmark a noisy one, do both, don’t skip the quiet), matches the system you want to model, run your before (control group!) and after, multiple runs, collect a ton of data.

Analysis Wednesday, June 12, 13 This is the part where
you do your analysis. You want to compute standard deviations, run a statistical hypothesis test, look at high level numbers: how do the means compare, your intuition is not useless. I am not a statistician, if someone here is, we should chat because the statistical methodology is the place I’m weakest. You probably want to do math to compute how much faster/slower one was.

Conclusion Wednesday, June 12, 13 Now you know you were
right or wrong.

Things benchmarks can’t tell you • Whether the tradeoff was
worth it • Whether you’re as performant as possible • Whether they’re correct Wednesday, June 12, 13 So, benchmarks give you a lot of information, particulary if you do them well, but they don’t answer a few questions: * Did you make a mess of your code to get this result? * Have you reached a maximum for how fast you can achieve a result? * Benchmarks don’t tell you when you screwed up any of the stuff we discussed.

Help us build better infrastructure Wednesday, June 12, 13 There’s
not a lot of good infrastructure for doing real statistically rigorous benchmarking.

Q and A? Wednesday, June 12, 13

Benchmarking

Benchmarking

Alex Gaynor

More Decks by Alex Gaynor

Other Decks in Programming

Featured

Transcript

Benchmarking Alex Gaynor June 12th, 2013 Wednesday, June 12, 13

What is benchmarking? Wednesday, June 12, 13 So, I’m going

“The process of measuring and assessing the performance of a

Why do I care about benchmarking? • I work on

Good benchmarking is really really hard. Wednesday, June 12, 13

Benchmarking is Science Wednesday, June 12, 13 How many people

Scientiﬁc Method Wednesday, June 12, 13 So, in all likilhood

Existing Methodology • Write a script • Run it a

Bad Statistics Wednesday, June 12, 13 First, there’s a lack

Bad Benchmarks Wednesday, June 12, 13 The second problem is

Bad reporting Wednesday, June 12, 13 The third problem is

Other common mistakes • Not controlling independent variables • Not

Macro vs. Micro- benchmarks • Macro-benchmarks consider “complete” software systems.

How to do benchmarking right better Wednesday, June 12, 13

A question Wednesday, June 12, 13 The ﬁrst step in

A hypothesis Wednesday, June 12, 13 This is the part

A prediction Wednesday, June 12, 13 This should be “It

Testing Wednesday, June 12, 13 This is usually where people

Writing a benchmark • Minimal • Reproducible • Stable •

Change your code Wednesday, June 12, 13 Do whatever you

Run your benchmark Wednesday, June 12, 13 Quiet system (if

Analysis Wednesday, June 12, 13 This is the part where

Conclusion Wednesday, June 12, 13 Now you know you were

Things benchmarks can’t tell you • Whether the tradeoff was

Help us build better infrastructure Wednesday, June 12, 13 There’s

Q and A? Wednesday, June 12, 13