Hi everyone, thanks for coming out. I’m Alex, I’ve been a Racker for about a month now! If I say something that doesn’t make sense, or you want more details, please just interrupt me as I go, shout over me if you have to.
software system.” Wednesday, June 12, 13 So, benchmarking is about measurement and assessment. There are two primary reasons to benchmark, to compare two software systems, or to compare one software system against a goal.
PyPy and Topaz • People tend to do it very badly, leading to misleading, and often dishonest, results • I care about performance Wednesday, June 12, 13 I hack on a project called PyPy, it’s a high performance Python implementation, and Topaz a high performance Ruby implementation. Because their purpose is performance we spend a lot of time measuring performance. And I spend a lot of time reading other people do it badly, not to say we’re perfect, but I think we spend more time thinking about these issues than a lot of other people. If you care about the performance of your software, you need automated reproducible benchmarking, just like you need automated reproducible tests if you care about your program working.
Show of hands, who thinks you should benchmark on a quiet system (one with nothign else running)? And who thinks you should benchmark on a noisy system, one with a lot of other stuff going on. (assumption, most people say quiet), what if your software is designed to run in the background on an otherwise busy machine, say monitoring software on a server, or anti-virus on a desktop machine? Now the real world system you want to model is noisy, different software responds differently to a noisy system, your instruction cache is going to get busted, you’re going to miss more on branch prediction, potentially more IO wait on disk, tons of other factors at play.
the last time you thought about this was college or maybe even high school, but the hallmark of science is the scientiﬁc method. I think if we look at existing benchmark techniques in light of this we start to see many ﬂaws.
few times • Average times • Announce your software is 10x faster than the competition Wednesday, June 12, 13 This is, I would say, par for the course for a lot of existing benchmarking. There’s a lot that’s worse than this, and I’ll get into that more. So what’s wrong with this, there’s a few things.
of statistical rigor in almost all benchmarks, when we do real science we look at things like standard deviations and run what are called statistical hypothesis testing to see if the differences we think we’re observing are really statistically signiﬁcant.
our benchmarks are often bad. The goal of a benchmark is to model the real system to be able to assess the performance of a real system, before it’s real. Often times our benchmarks don’t do this. We do things like turn off durability in our database, which we’d never do in real life, or we just run different code, we think one part of our application is the bottleneck so we only benchmark it, and then we’re wrong.
we’re dishonest in our conclusions. We say things like “10x faster than X” when the honest conclusion is, “On this speciﬁc benchmark we had a geometric mean of 10x faster with such and such conﬁdence”, if you run one benchmark there is no way you can make global performance claims, unless your application has literally one possible code path (it doesn’t).
understanding what we’re benchmarking • Non-representative environments Wednesday, June 12, 13 These are another couple common pitfalls: * Porting a Java+MySQL system to C+Redis and then claiming C is faster than Java or MySQL is faster than C or whatever. You can claim this whole system is faster than this other system, but without controlling for your variables, you can’t make other claims. * Another common one I see is people write microbenchmarks and they don’t realize the whole thing is constant folded, they’re literally benchmarking the empty loop. * Running on your MacBook Pro and thinking it’ll be the same as a virtualized cloudserver, particularly bad when IO comes into play. Or, “This is taking a while, I’ll go watch cat videos”
• Micro-benchmarks target individual functions. • Similar to unit vs. integration tests. Wednesday, June 12, 13 The long and the short of it is, you should never micro-benchmark code you didn’t write. You should never micro-benchmark code you don’t know to be a bottleneck, you should never micro-benchmark code you aren’t willing to rewrite.
So, that’s a whole bunch of ways it’s possible to get it wrong (and ways I *have* done it wrong). Now I want to talk about how to possibly do it better, I won’t promise this will be good, but I think we can do better.
Immutable Wednesday, June 12, 13 You want to only cover the code thats in your hypothesis. You want something that can be run many times by many people (no random REPL one liners). And you want to get numerically stable results. Best practice here is to record N readings from a single process (to account for variance due to GC) and run the full process multiple times. If you’re using the same benchmark over many runs, and keeping a history, you must never cange the benchmark.
you do your analysis. You want to compute standard deviations, run a statistical hypothesis test, look at high level numbers: how do the means compare, your intuition is not useless. I am not a statistician, if someone here is, we should chat because the statistical methodology is the place I’m weakest. You probably want to do math to compute how much faster/slower one was.
worth it • Whether you’re as performant as possible • Whether they’re correct Wednesday, June 12, 13 So, benchmarks give you a lot of information, particulary if you do them well, but they don’t answer a few questions: * Did you make a mess of your code to get this result? * Have you reached a maximum for how fast you can achieve a result? * Benchmarks don’t tell you when you screwed up any of the stuff we discussed.