Improving Rust Performance Through Profiling and Benchmarking

Slide 1

Slide 1 text

Understanding Rust Performance Steve Jenson @stevej

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

BIOGRAPHY • Engineer at Buoyant • Work on Rust and Scala code

Slide 5

Slide 5 text

LINKERD-TCP • TCP proxy written in Rust • Designed to work in cloud-native environments • Protocols coming! • Apache licensed • github.com/linkerd/linkerd-tcp

Slide 6

Slide 6 text

TACHO • Stats library for instrumenting applications • Counter - how many times something happened • Timing - how long something takes (w/ histogram) • Gauge - value at a Point in time • Apache licensed • github.com/linkerd/tacho

Slide 7

Slide 7 text

TACHO example/multithread.rs thread::spawn(move || { let mut prior = None; for i in 0..10_000_000 { let t0 = Timing::start(); current_iter.set(i); loop_counter.incr(1); if let Some(p) = prior { loop_iter_us.add(p); } prior = Some(t0.elapsed_us()); } if let Some(p) = prior { loop_iter_us.add(p); } work_done_tx.send(()).expect("could not send"); });

Slide 8

Slide 8 text

OUTLINE • Causes of Slowness • Rust-specific pitfalls • Tools • IPC

Slide 9

Slide 9 text

CAUSES OF SLOWNESS

Slide 10

Slide 10 text

LOCK CONTENTION CPU UTILIZATION MEMORY STALLS

Slide 11

Slide 11 text

MEMORY STALLS

Slide 12

Slide 12 text

MEMORY HIERARCHY LATENCY GUIDELINES Register 0.5 nanoseconds Last-Level Cache 10 nanoseconds RAM 100 nanoseconds Numbers every programmer should know by Jeff Dean

Slide 13

Slide 13 text

LOCK CONTENTION

Slide 14

Slide 14 text

LOCK CONTENTION • spin loops • blocking wait

Slide 15

Slide 15 text

CPU UTILIZATION

Slide 16

Slide 16 text

CPU UTILIZATION • Can hide memory latency (slow instructions) • Can hide lock contention (spin loops) • Idleness is often counted as useful work • 90% utilized can also mean 80% waiting for RAM or disk

Slide 17

Slide 17 text

RUST-SPECIFIC PITFALLS

Slide 18

Slide 18 text

#[derive(Copy)]on large structs • Copy semantics can be a life-saver • Overuse can kill memory bandwidth • Most common reason It was small when I ﬁrst derived!

Slide 19

Slide 19 text

#[derive(Copy)]on large structs #[derive(Copy)] struct Person { user_id: Int, name: &str, // Whoops! 800MB! Should be a reference dna: Vec[u8], }

Slide 20

Slide 20 text

clone() in a loop • Saturate memory bandwidth • clone()can be an easy way to satisfy the borrow checker • Thankfully, easy to spot in a proﬁle

Slide 21

Slide 21 text

clone() in a loop for person in &people { friends.push(person.clone()); }

Slide 22

Slide 22 text

DEFAULT HASHER IN THE STANDARD HASHMAP • Cryptographically strong for DoS protection • Well-known trade-off for Rustaceans • Surprises programmers new to Rust • Lots of great alternatives!

Slide 23

Slide 23 text

DEFAULT HASHER IN THE STANDARD HASHMAP let map: HashMap, u32, DefaultState> = Default::default();

Slide 24

Slide 24 text

EXPENSIVE ARGUMENTS TO expect() • Don’t use expensive expansions as arguments to expect() • Not speciﬁc to expect(), be mindful of eagerness

Slide 25

Slide 25 text

EXPENSIVE ARGUMENTS TO expect() let index = self.to_byte_index(index) .expect( &format!(“invalid index! {:?} in {:?}" , index, s));

Slide 26

Slide 26 text

PREALLOCATE Vec WHEN POSSIBLE • If you have a sense of how many items you’ll need, use that as your initial capacity

Slide 27

Slide 27 text

PREALLOCATE Vec WHEN POSSIBLE let buf = { let sz = self.buffer_size .unwrap_or(DEFAULT_BUFFER_SIZE); vec![0 as u8; sz] };

Slide 28

Slide 28 text

MAC TOOLS Instruments  cargo bench cargo benchcmp LINUX TOOLS perf  FlameGraphs  VTune  cargo bench cargo benchcmp

Slide 29

Slide 29 text

CARGO BENCH AND BENCHCMP

Slide 30

Slide 30 text

CARGO BENCH • Microbenchmarking tool • Part of the standard tooling • Great for important parts of your API

Slide 31

Slide 31 text

CARGO BENCH #[bench] fn bench_counter_create(b: &mut Bencher) { let (metrics, _) = super::new(); b.iter(move || { let _ = metrics.counter(“counter_name”); }); }

Slide 32

Slide 32 text

CARGO BENCH test tests::bench_counter_create ... bench: 143 ns/iter (+/- 71) test tests::bench_counter_create_x1000 ... bench: 429,854 ns/iter (+/- 277,291) test tests::bench_counter_update ... bench: 23 ns/iter (+/- 10) test tests::bench_counter_update_x1000 ... bench: 955 ns/iter (+/- 141) test tests::bench_gauge_create ... bench: 136 ns/iter (+/- 19) test tests::bench_gauge_create_x1000 ... bench: 415,618 ns/iter (+/- 301,114) test tests::bench_gauge_update ... bench: 17 ns/iter (+/- 7) test tests::bench_gauge_update_x1000 ... bench: 3,327 ns/iter (+/- 519) test tests::bench_scope_clone ... bench: 64 ns/iter (+/- 11) test tests::bench_scope_clone_x1000 ... bench: 177,623 ns/iter (+/- 103,595) test tests::bench_scope_label ... bench: 164 ns/iter (+/- 91) test tests::bench_scope_label_x1000 ... bench: 269,845 ns/iter (+/- 54,753) test tests::bench_stat_add_x1000 ... bench: 2,575 ns/iter (+/- 425) test tests::bench_stat_create ... bench: 131 ns/iter (+/- 45) test tests::bench_stat_create_x1000 ... bench: 412,913 ns/iter (+/- 121,406) test tests::bench_stat_update ... bench: 47 ns/iter (+/- 4) test tests::bench_stat_update_x1000 ... bench: 2,694 ns/iter (+/- 1,243)

Slide 33

Slide 33 text

CARGO BENCHCMP • Compare two cargo bench runs • Great for avoiding performance regressions • github.com/BurntSushi/cargo-benchcmp

Slide 34

Slide 34 text

CARGO BENCHCMP

Slide 35

Slide 35 text

HOW TO CONSTRUCT A MACROBENCHMARK

Slide 36

Slide 36 text

HOW TO CONSTRUCT A MACROBENCHMARK • Microbenchmarks are limited in utility • Measure your code running in context • Exercise a reasonable subset of your API • In one of our macrobenchmarks, we loop 10,000,000 times and do work each loop

Slide 37

Slide 37 text

HOW TO CONSTRUCT A MACROBENCHMARK • cargo build —release —example multithread • Always use release builds for proﬁling • For symbols, add this to your Cargo.toml [profile.release] debug = true

Slide 38

Slide 38 text

HOW TO CONSTRUCT A MACROBENCHMARK • tacho has two macrobenchmarks • single-threaded (simple.rs) • multi-threaded (multithread.rs)

Slide 39

Slide 39 text

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

INSTRUCTIONS PER CYCLE • IPC is a useful empirical metric • How many instructions are completed every clock cycle

Slide 42

Slide 42 text

BASIC CPU ARCHITECTURE • Executes instructions serially • How we learn CPU architecture in school • Not how it works on modern Intel CPUs • Deep pipelines, dependent on other work

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

HOW DO WE KNOW IF WE’RE HITTING PEAK PERFORMANCE? • Since instructions can depend on other instructions • And Performance is dictated by a full pipeline • How do we know we’re doing well?

Slide 46

Slide 46 text

INTEL PERFORMANCE COUNTERS

Slide 47

Slide 47 text

INTEL PERFORMANCE COUNTERS • Intel engineers had the same question • Added Performance Monitor Counters • How often certain events happen • Allow you to calculate ratios

Slide 48

Slide 48 text

INTEL PERFORMANCE COUNTERS • Number of counters is daunting • Hundreds of counters • 400+ pages of documentation • Allow you calculate derived metrics

Slide 49

Slide 49 text

INSTRUCTIONS PER CYCLE

Slide 50

Slide 50 text

INSTRUCTIONS PER CYCLE • How many instructions can the core “retire” per cycle • < 1.0 often means memory stalled • > 1.0 often means instruction stalled • You can learn this empirically! • On a 3 wide core, theoretical max IPC of 3.0

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

INSTRUMENTS (MAC)

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

INTEL PMCS • Available directly in Instruments • Counter • Recording Options • Events • Can create formula from PMCs

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

INSTRUMENTS (MAC) TAKEAWAY • Lots of easy-to-use Performance tools • Unfortunately, many speciﬁc to Cocoa programming • A simple way to access Performance Counters

Slide 59

Slide 59 text

PERF (LINUX)

Slide 60

Slide 60 text

PERF • Linux kernel and user space • Sampling profiler with configurable sampling rate • Constantly being improved

Slide 61

Slide 61 text

PERF STAT — IPC $ sudo perf stat target/release/examples/multithread Performance counter stats for 'target/release/examples/multithread': 12268.515738 374,342 3 505 26,206,859,982 13,711,393,152 2,838,706,433 7,730,077 7.663850635 sec time elapsed task-clock (msec)# 1.601 CPUs utilized context-switches # 0.031 M/sec cpu-migrations # 0.000 K/sec page-faults # 0.041 K/sec cycles # 2.136 GHz instructions # 0.52 instructions per cycle branches # 231.381 M/sec branch-misses # 0.27% of all branches

Slide 62

Slide 62 text

PERF CACHE MISSES/HITS

Slide 63

Slide 63 text

PERF CACHE MISSES/HITS

Slide 64

Slide 64 text

PERF TAKEAWAY • Deep tooling • Low overhead • Kernel and User space • Linux-speciﬁc • Scheduler analysis • IO and Network subsystems

Slide 65

Slide 65 text

FLAMEGRAPHS

Slide 66

Slide 66 text

FLAMEGRAPHS • Sample what’s on the CPU • Aggregate the call stacks • Gives you a sense of the shape of your program • The color change has no semantic value • Mouse-over for extra info • Can drill into stacks • Peak is what’s on the CPU at sample time

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

FLAMEGRAPHS TAKEAWAY • Really useful for looking at a long-running program • Netﬂix has pioneered this technique for measuring the health of their online services • Needs symbols!

Slide 70

Slide 70 text

VTUNE

Slide 71

Slide 71 text

VTUNE • Made by Intel • Helps make sense of the many Performance Counters • Tooltips! • GUI (works with ssh X forwarding on macOS) • CLI with CSV • Free for open source developers

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

VTUNE TAKEAWAY • Intel knows their CPUs better than anyone • VTune is detailed and powerful • Overwhelming at ﬁrst • Helpful tooltips!

Slide 80

Slide 80 text

LESSON LEARNED • While preparing this talk, I learned something! • VTune highlighted a ‘Remote Cache’ issue • Oh no! One of my threads was running on a different socket! • Cache hit rate improved with taskset

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

BEFORE TASKSET

Slide 83

Slide 83 text

AFTER TASKSET

Slide 84

Slide 84 text

TAKEAWAYS • Performance is hard to understand • Need an empirical measurement • IPC is one empirical measurement • The best tool is the one you use

Slide 85

Slide 85 text

THANKS! A special thank you to Eliza Weisman for the Instruments walk-through, screenshots, and feedback

Slide 86

Slide 86 text

@linkerd github.com/linkerd linkerd.io @buoyantIO buoyant.io