Slide 1

Slide 1 text

Benchmarking Alex Gaynor June 12th, 2013 Wednesday, June 12, 13 Hi everyone, thanks for coming out. I’m Alex, I’ve been a Racker for about a month now! If I say something that doesn’t make sense, or you want more details, please just interrupt me as I go, shout over me if you have to.

Slide 2

Slide 2 text

What is benchmarking? Wednesday, June 12, 13 So, I’m going to talk about benchmarking. Everyone knows what benchmarking is, of course, right? But I want to dive in and really nail down a definition.

Slide 3

Slide 3 text

“The process of measuring and assessing the performance of a software system.” Wednesday, June 12, 13 So, benchmarking is about measurement and assessment. There are two primary reasons to benchmark, to compare two software systems, or to compare one software system against a goal.

Slide 4

Slide 4 text

Why do I care about benchmarking? • I work on PyPy and Topaz • People tend to do it very badly, leading to misleading, and often dishonest, results • I care about performance Wednesday, June 12, 13 I hack on a project called PyPy, it’s a high performance Python implementation, and Topaz a high performance Ruby implementation. Because their purpose is performance we spend a lot of time measuring performance. And I spend a lot of time reading other people do it badly, not to say we’re perfect, but I think we spend more time thinking about these issues than a lot of other people. If you care about the performance of your software, you need automated reproducible benchmarking, just like you need automated reproducible tests if you care about your program working.

Slide 5

Slide 5 text

Good benchmarking is really really hard. Wednesday, June 12, 13 Show of hands, who thinks you should benchmark on a quiet system (one with nothign else running)? And who thinks you should benchmark on a noisy system, one with a lot of other stuff going on. (assumption, most people say quiet), what if your software is designed to run in the background on an otherwise busy machine, say monitoring software on a server, or anti-virus on a desktop machine? Now the real world system you want to model is noisy, different software responds differently to a noisy system, your instruction cache is going to get busted, you’re going to miss more on branch prediction, potentially more IO wait on disk, tons of other factors at play.

Slide 6

Slide 6 text

Benchmarking is Science Wednesday, June 12, 13 How many people in this room consider them scientists, in a traditional sense (people in lab coats)? Benchmarking is your chance to do science.

Slide 7

Slide 7 text

Scientific Method Wednesday, June 12, 13 So, in all likilhood the last time you thought about this was college or maybe even high school, but the hallmark of science is the scientific method. I think if we look at existing benchmark techniques in light of this we start to see many flaws.

Slide 8

Slide 8 text

Existing Methodology • Write a script • Run it a few times • Average times • Announce your software is 10x faster than the competition Wednesday, June 12, 13 This is, I would say, par for the course for a lot of existing benchmarking. There’s a lot that’s worse than this, and I’ll get into that more. So what’s wrong with this, there’s a few things.

Slide 9

Slide 9 text

Bad Statistics Wednesday, June 12, 13 First, there’s a lack of statistical rigor in almost all benchmarks, when we do real science we look at things like standard deviations and run what are called statistical hypothesis testing to see if the differences we think we’re observing are really statistically significant.

Slide 10

Slide 10 text

Bad Benchmarks Wednesday, June 12, 13 The second problem is our benchmarks are often bad. The goal of a benchmark is to model the real system to be able to assess the performance of a real system, before it’s real. Often times our benchmarks don’t do this. We do things like turn off durability in our database, which we’d never do in real life, or we just run different code, we think one part of our application is the bottleneck so we only benchmark it, and then we’re wrong.

Slide 11

Slide 11 text

Bad reporting Wednesday, June 12, 13 The third problem is we’re dishonest in our conclusions. We say things like “10x faster than X” when the honest conclusion is, “On this specific benchmark we had a geometric mean of 10x faster with such and such confidence”, if you run one benchmark there is no way you can make global performance claims, unless your application has literally one possible code path (it doesn’t).

Slide 12

Slide 12 text

Other common mistakes • Not controlling independent variables • Not understanding what we’re benchmarking • Non-representative environments Wednesday, June 12, 13 These are another couple common pitfalls: * Porting a Java+MySQL system to C+Redis and then claiming C is faster than Java or MySQL is faster than C or whatever. You can claim this whole system is faster than this other system, but without controlling for your variables, you can’t make other claims. * Another common one I see is people write microbenchmarks and they don’t realize the whole thing is constant folded, they’re literally benchmarking the empty loop. * Running on your MacBook Pro and thinking it’ll be the same as a virtualized cloudserver, particularly bad when IO comes into play. Or, “This is taking a while, I’ll go watch cat videos”

Slide 13

Slide 13 text

Macro vs. Micro- benchmarks • Macro-benchmarks consider “complete” software systems. • Micro-benchmarks target individual functions. • Similar to unit vs. integration tests. Wednesday, June 12, 13 The long and the short of it is, you should never micro-benchmark code you didn’t write. You should never micro-benchmark code you don’t know to be a bottleneck, you should never micro-benchmark code you aren’t willing to rewrite.

Slide 14

Slide 14 text

How to do benchmarking right better Wednesday, June 12, 13 So, that’s a whole bunch of ways it’s possible to get it wrong (and ways I *have* done it wrong). Now I want to talk about how to possibly do it better, I won’t promise this will be good, but I think we can do better.

Slide 15

Slide 15 text

A question Wednesday, June 12, 13 The first step in the scientific method is to articulate a question we want to answer. For example, “How can we improve performance of the list_containers API method for users with no containers?”

Slide 16

Slide 16 text

A hypothesis Wednesday, June 12, 13 This is the part where we figure out a specific, testable idea, “Does removing 2 SQL queries from the list_containers API call improve its performance when called with a user who has no containers?”

Slide 17

Slide 17 text

A prediction Wednesday, June 12, 13 This should be “It will improve performance”, unless you have some sort of bizarre scenario in which you want to slow down your code, in which case let me tell you about my good friend sleep(3).

Slide 18

Slide 18 text

Testing Wednesday, June 12, 13 This is usually where people go wrong. This is the part where you write your benchmark, make the changes you want, and measure.

Slide 19

Slide 19 text

Writing a benchmark • Minimal • Reproducible • Stable • Immutable Wednesday, June 12, 13 You want to only cover the code thats in your hypothesis. You want something that can be run many times by many people (no random REPL one liners). And you want to get numerically stable results. Best practice here is to record N readings from a single process (to account for variance due to GC) and run the full process multiple times. If you’re using the same benchmark over many runs, and keeping a history, you must never cange the benchmark.

Slide 20

Slide 20 text

Change your code Wednesday, June 12, 13 Do whatever you have to do, but change nothing besides what your hypothesis was about. It your hypothesis is to remove 2 SQL queries, don’t also refactor something in some other corner.

Slide 21

Slide 21 text

Run your benchmark Wednesday, June 12, 13 Quiet system (if you need to benchmark a noisy one, do both, don’t skip the quiet), matches the system you want to model, run your before (control group!) and after, multiple runs, collect a ton of data.

Slide 22

Slide 22 text

Analysis Wednesday, June 12, 13 This is the part where you do your analysis. You want to compute standard deviations, run a statistical hypothesis test, look at high level numbers: how do the means compare, your intuition is not useless. I am not a statistician, if someone here is, we should chat because the statistical methodology is the place I’m weakest. You probably want to do math to compute how much faster/slower one was.

Slide 23

Slide 23 text

Conclusion Wednesday, June 12, 13 Now you know you were right or wrong.

Slide 24

Slide 24 text

Things benchmarks can’t tell you • Whether the tradeoff was worth it • Whether you’re as performant as possible • Whether they’re correct Wednesday, June 12, 13 So, benchmarks give you a lot of information, particulary if you do them well, but they don’t answer a few questions: * Did you make a mess of your code to get this result? * Have you reached a maximum for how fast you can achieve a result? * Benchmarks don’t tell you when you screwed up any of the stuff we discussed.

Slide 25

Slide 25 text

Help us build better infrastructure Wednesday, June 12, 13 There’s not a lot of good infrastructure for doing real statistically rigorous benchmarking.

Slide 26

Slide 26 text

Q and A? Wednesday, June 12, 13