Benchmarks: Avoding Lies & Damn Lies

Benchmarks: Avoding Lies & Damn Lies Steven Lembark Workhorse Computing
[email protected]

What are “benchmarks”? Generally, two kinds: Performance (a.k.a., “lies”, “damn
lies”). Functionality (different subject). Different needs. Similar requirements.

Functionality Benchmarks “Does it do what I need?” Utility not
speed. “Integration” testing (vs. unit). Nice with Perl using HTTP, Selenium, DBI. Not what I’m describing here.

Performance benchmarks. Good: Time for specific tasks. Objective, realistic tasks.
Garbage: ”Twice the speed of our competition”

Perl is nice for testing Making other code run. “Duct
tape with timers.” Perl makes it manageable with: %ENV Forks Sockets & Pipes

Designing & Execution Code & environment. Unstable environments == unusable
timings. Background noise easily swamps data. Watch system around the test. Repeat tests.

You may have all the time(1) you need Simple end-to-end
test: time your_thing_here; Subjective vs. objective time. Multiple iterations to get averages.

What kind of time do you have? Wallclock: Observed by
user. User: What your program runs. System: Time for kernel services. You cannot control wallclock. Includes latency from timeslice, stolen VM time.

Basline: Time to do nothing Check startup time. Affected by
O/S, disk. Run multiple times: see effects of buffering. $ time perl -e 0 real 0m0.005s user 0m0.000s sys 0m0.000s $ time bash /dev/null real 0m0.005s user 0m0.000s sys 0m0.000s

What does startup time tell us? Opterons are fast? Perl
and bash block at the same rate? Not much by themselves. Differences can be telling. Stop until you explain any differences.

Control overhead tmpfs on linux minimizes I/O overhead. Unloaded system
minimizes contention. High-priority VM minimizes stolen time. Taskset minimizes L1/L2 turnover.

Basic performance “How long does X take?”, you ask. “Well,
it depends.” “On what?” “On what it is.”

Basic performance Time for hardware? Time for software? Time for
I/O? Creating realistic tests requires knowing! All you may know is that “it runs too slowly”.

Step one: Use a reasonable perl. Centos has 5.8... RHEL’s
built with 5.00503 compat, -O0, -g.

Step one: Use a reasonable perl. Centos has 5.8... RHEL’s
built with 5.00503 compat, -O0, -g. Simple lesson: BUILD YOUR OWN!!! Perl, Python, R, Postgres, MySQL, whatever.

Step two: use Benchmark; This has the basic tools you
need. use Benchmark ':hireswallclock'; Do what it takes to use hireswallclock. Recompile Perl, hack the kernel, whatever.

Minimize confusion Test atomic units of code. Establish a baseline
Even if you test end-to-end.

Basic baseline Running on a VM? Your benchmark could be
time-sliced! timethis 1_000_000, sub{}; Should be near-zero time at 100% CPU.

Only a million? Well, maybe more... System load can effect
reasonable counts. Run enough to get a valid time. DB<2> timethis 1_000_000, sub{}; timethis 1000000: -0.0285478 wallclock secs (-0.03 usr + 0.00 sys = -0.03 CPU) @ -33333333.33/s (n=1000000) (warning: too few iterations for a reliable count)

Basline kernel calls. Run a million each: sub {open my
$fh,‘<’,‘/dev/null’}; sub { open my $fh ‘>’ ‘/var/tmp/$$’; unlink “/var/tmp/$$”; };

Basline kernel calls. Watch for “IO Wait” time during the
test. This can block the entire system. Make sure IO Wait is yours. Or run the test when there isn’t any.

Top is your friend So is procinfo-ng with “-D”: delta
counts, current memory Notice if your prcess gets 100% CPU. Notice if the process jumps between cores. Notice if your task forks, threads. Look for non-zero I/O wait times.

Red flags High I/O wait. Runnable jobs > number of
cores. High stolen time. Lots of paging/swaping. Large changes in swap used.

Fixing red-flags Run on specific cores: taskset -c X your_test_code;
taskset -c N-M your_test_code; Use multi-core on same CPU for threads/forks.

Memory hog Force non-running jobs out of core. Malloc a
huge data area and exit: my @a = ( ‘Foo’ ) x 2 ** 32; exit 0; Then run your test quickly:

Counting a baseline baseline Benchmark has its own baseline. Suggest
using your own. Examine top or procinfo to estimate “stolen” time.

Bad situation for a benchmark: 18+ jobs on 4 cores
with 0% idle. top - 15:32:52 up 1 day, 19:20, 19 users, load average: 18.35, 6.20, 2.79 Tasks: 202 total, 6 running, 196 sleeping, 0 stopped, 0 zombie %Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 94.1 us, 5.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 16065988 total, 10037580 free, 3800400 used, 2228008 buff/cache KiB Swap: 33425404 total, 33425404 free, 0 used. 10689004 avail Mem PID USER NI RES SWAP %MEM %CPU TIME+ nTH S COMMAND 6973 lembark 0 128776 0 0.8 62.5 0:05.20 1 R /usr/libexec/gcc/+ 7774 lembark 0 30004 0 0.2 62.5 0:00.32 1 R /usr/libexec/gcc/+ 4093 lembark 0 50640 0 0.3 56.2 0:06.19 1 R lib/unicore/mktab+

procinfo viewing an overloaded system $ make -wk -j all
test install; # building perl Memory: Total Used Free Buffers RAM: 16065988 6257268 9808720 0 Swap: 33425404 0 33425404 Bootup: Fri Jun 22 20:11:57 2018 Load average: 22.83 6.59 3.47 23/470 22355 user : 00:00:19.22 95.9% page in : 0 nice : 00:00:00.00 0.0% page out: 67 system: 00:00:00.80 4.0% page act: 598 IOwait: 00:00:00.00 0.0% page dea: 0 hw irq: 00:00:00.00 0.0% page flt: 307906 sw irq: 00:00:00.00 0.0% swap in : 0 idle : 00:00:00.00 0.0% swap out: 0 uptime: 1d 19:33:38.00 context : 15811

Prove reports times. find t -type -f -name ‘*t’ |
xargs -l1 prove; ... t/re/overload.t .. ok All tests successful. Files=1, Tests=87, 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) Result: PASS ... Test isolate parts of code being tested.

Inserting timings my $t0 = Benchmark->new; do{ somethig … };
my $t1 = Benchmark->new; say timestr timediff $t1, $t0; Notice the order: t1 – t0.

Object::Exercise The “benchmark” directive Gives time for each stage of
test. Nice for timing progressive operations.

End to end tests Add Benchmark to your #! code.
time(1) Catch: No accounting for human time. Need runtime for things like web service.

Timing back-end Exclude latency or measure it explicitly: get_request; push
@timz, Benchmark->new; <assemble reply> push @timz, Benchmark->new; send_reply; Compute timediff on way out.

New York Time Profiler By Tim Bunce. New York Second
better than :hireswallclock. His talk on profiling is an excellent introduction.

Summary: Benchmarks don’t have to be damn lies. Control the
environment. Establish baselines for units of work. Use Benchmark with “:hireswallclock”. Watch the system to verify isolation. taskset(1) and tmpfs (see mount(1)) can help.

Benchmarks: Avoding Lies & Damn Lies

Benchmarks: Avoding Lies & Damn Lies

More Decks by Steven Lembark

Other Decks in Technology

Featured

Transcript