Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving Rust Performance Through Profiling ...

Improving Rust Performance Through Profiling and Benchmarking

This talk will compare and contrast common industry tool support for profiling and debugging Rust applications. We'll discuss our experiences finding and fixing performance problems in a production Rust application.

Steve Jenson

August 19, 2017
Tweet

Other Decks in Programming

Transcript

  1. LINKERD-TCP • TCP proxy written in Rust • Designed to

    work in cloud-native environments • Protocols coming! • Apache licensed • github.com/linkerd/linkerd-tcp
  2. TACHO • Stats library for instrumenting applications • Counter -

    how many times something happened • Timing - how long something takes (w/ histogram) • Gauge - value at a Point in time • Apache licensed • github.com/linkerd/tacho
  3. TACHO example/multithread.rs thread::spawn(move || { let mut prior = None;

    for i in 0..10_000_000 { let t0 = Timing::start(); current_iter.set(i); loop_counter.incr(1); if let Some(p) = prior { loop_iter_us.add(p); } prior = Some(t0.elapsed_us()); } if let Some(p) = prior { loop_iter_us.add(p); } work_done_tx.send(()).expect("could not send"); });
  4. MEMORY HIERARCHY LATENCY GUIDELINES Register 0.5 nanoseconds Last-Level Cache 10

    nanoseconds RAM 100 nanoseconds Numbers every programmer should know by Jeff Dean
  5. CPU UTILIZATION • Can hide memory latency (slow instructions) •

    Can hide lock contention (spin loops) • Idleness is often counted as useful work • 90% utilized can also mean 80% waiting for RAM or disk
  6. #[derive(Copy)]on large structs • Copy semantics can be a life-saver

    • Overuse can kill memory bandwidth • Most common reason It was small when I first derived!
  7. #[derive(Copy)]on large structs #[derive(Copy)] struct Person { user_id: Int, name:

    &str, // Whoops! 800MB! Should be a reference dna: Vec[u8], }
  8. clone() in a loop • Saturate memory bandwidth • clone()can

    be an easy way to satisfy the borrow checker • Thankfully, easy to spot in a profile
  9. DEFAULT HASHER IN THE STANDARD HASHMAP • Cryptographically strong for

    DoS protection • Well-known trade-off for Rustaceans • Surprises programmers new to Rust • Lots of great alternatives!
  10. DEFAULT HASHER IN THE STANDARD HASHMAP let map: HashMap<Vec<u8>, u32,

    DefaultState<FnvHasher>> = Default::default();
  11. EXPENSIVE ARGUMENTS TO expect() • Don’t use expensive expansions as

    arguments to expect() • Not specific to expect(), be mindful of eagerness
  12. PREALLOCATE Vec WHEN POSSIBLE • If you have a sense

    of how many items you’ll need, use that as your initial capacity
  13. PREALLOCATE Vec WHEN POSSIBLE let buf = { let sz

    = self.buffer_size .unwrap_or(DEFAULT_BUFFER_SIZE); vec![0 as u8; sz] };
  14. MAC TOOLS Instruments
 cargo bench cargo benchcmp LINUX TOOLS perf


    FlameGraphs
 VTune
 cargo bench cargo benchcmp
  15. CARGO BENCH • Microbenchmarking tool • Part of the standard

    tooling • Great for important parts of your API
  16. CARGO BENCH #[bench] fn bench_counter_create(b: &mut Bencher) { let (metrics,

    _) = super::new(); b.iter(move || { let _ = metrics.counter(“counter_name”); }); }
  17. CARGO BENCH test tests::bench_counter_create ... bench: 143 ns/iter (+/- 71)

    test tests::bench_counter_create_x1000 ... bench: 429,854 ns/iter (+/- 277,291) test tests::bench_counter_update ... bench: 23 ns/iter (+/- 10) test tests::bench_counter_update_x1000 ... bench: 955 ns/iter (+/- 141) test tests::bench_gauge_create ... bench: 136 ns/iter (+/- 19) test tests::bench_gauge_create_x1000 ... bench: 415,618 ns/iter (+/- 301,114) test tests::bench_gauge_update ... bench: 17 ns/iter (+/- 7) test tests::bench_gauge_update_x1000 ... bench: 3,327 ns/iter (+/- 519) test tests::bench_scope_clone ... bench: 64 ns/iter (+/- 11) test tests::bench_scope_clone_x1000 ... bench: 177,623 ns/iter (+/- 103,595) test tests::bench_scope_label ... bench: 164 ns/iter (+/- 91) test tests::bench_scope_label_x1000 ... bench: 269,845 ns/iter (+/- 54,753) test tests::bench_stat_add_x1000 ... bench: 2,575 ns/iter (+/- 425) test tests::bench_stat_create ... bench: 131 ns/iter (+/- 45) test tests::bench_stat_create_x1000 ... bench: 412,913 ns/iter (+/- 121,406) test tests::bench_stat_update ... bench: 47 ns/iter (+/- 4) test tests::bench_stat_update_x1000 ... bench: 2,694 ns/iter (+/- 1,243)
  18. CARGO BENCHCMP • Compare two cargo bench runs • Great

    for avoiding performance regressions • github.com/BurntSushi/cargo-benchcmp
  19. HOW TO CONSTRUCT A MACROBENCHMARK • Microbenchmarks are limited in

    utility • Measure your code running in context • Exercise a reasonable subset of your API • In one of our macrobenchmarks, we loop 10,000,000 times and do work each loop
  20. HOW TO CONSTRUCT A MACROBENCHMARK • cargo build —release —example

    multithread • Always use release builds for profiling • For symbols, add this to your Cargo.toml [profile.release] debug = true
  21. HOW TO CONSTRUCT A MACROBENCHMARK • tacho has two macrobenchmarks

    • single-threaded (simple.rs) • multi-threaded (multithread.rs)
  22. TACHO example/multithread.rs thread::spawn(move || { let mut prior = None;

    for i in 0..10_000_000 { let t0 = Timing::start(); current_iter.set(i); loop_counter.incr(1); if let Some(p) = prior { loop_iter_us.add(p); } prior = Some(t0.elapsed_us()); } if let Some(p) = prior { loop_iter_us.add(p); } work_done_tx.send(()).expect("could not send"); });
  23. INSTRUCTIONS PER CYCLE • IPC is a useful empirical metric

    • How many instructions are completed every clock cycle
  24. BASIC CPU ARCHITECTURE • Executes instructions serially • How we

    learn CPU architecture in school • Not how it works on modern Intel CPUs • Deep pipelines, dependent on other work
  25. HOW DO WE KNOW IF WE’RE HITTING PEAK PERFORMANCE? •

    Since instructions can depend on other instructions • And Performance is dictated by a full pipeline • How do we know we’re doing well?
  26. INTEL PERFORMANCE COUNTERS • Intel engineers had the same question

    • Added Performance Monitor Counters • How often certain events happen • Allow you to calculate ratios
  27. INTEL PERFORMANCE COUNTERS • Number of counters is daunting •

    Hundreds of counters • 400+ pages of documentation • Allow you calculate derived metrics
  28. INSTRUCTIONS PER CYCLE • How many instructions can the core

    “retire” per cycle • < 1.0 often means memory stalled • > 1.0 often means instruction stalled • You can learn this empirically! • On a 3 wide core, theoretical max IPC of 3.0
  29. INTEL PMCS • Available directly in Instruments • Counter •

    Recording Options • Events • Can create formula from PMCs
  30. INSTRUMENTS (MAC) TAKEAWAY • Lots of easy-to-use Performance tools •

    Unfortunately, many specific to Cocoa programming • A simple way to access Performance Counters
  31. PERF • Linux kernel and user space • Sampling profiler

    with configurable sampling rate • Constantly being improved
  32. PERF STAT — IPC $ sudo perf stat target/release/examples/multithread Performance

    counter stats for 'target/release/examples/multithread': 12268.515738 374,342 3 505 26,206,859,982 13,711,393,152 2,838,706,433 7,730,077 7.663850635 sec time elapsed task-clock (msec)# 1.601 CPUs utilized context-switches # 0.031 M/sec cpu-migrations # 0.000 K/sec page-faults # 0.041 K/sec cycles # 2.136 GHz instructions # 0.52 instructions per cycle branches # 231.381 M/sec branch-misses # 0.27% of all branches
  33. PERF TAKEAWAY • Deep tooling • Low overhead • Kernel

    and User space • Linux-specific • Scheduler analysis • IO and Network subsystems
  34. FLAMEGRAPHS • Sample what’s on the CPU • Aggregate the

    call stacks • Gives you a sense of the shape of your program • The color change has no semantic value • Mouse-over for extra info • Can drill into stacks • Peak is what’s on the CPU at sample time
  35. FLAMEGRAPHS TAKEAWAY • Really useful for looking at a long-running

    program • Netflix has pioneered this technique for measuring the health of their online services • Needs symbols!
  36. VTUNE • Made by Intel • Helps make sense of

    the many Performance Counters • Tooltips! • GUI (works with ssh X forwarding on macOS) • CLI with CSV • Free for open source developers
  37. VTUNE TAKEAWAY • Intel knows their CPUs better than anyone

    • VTune is detailed and powerful • Overwhelming at first • Helpful tooltips!
  38. LESSON LEARNED • While preparing this talk, I learned something!

    • VTune highlighted a ‘Remote Cache’ issue • Oh no! One of my threads was running on a different socket! • Cache hit rate improved with taskset
  39. TAKEAWAYS • Performance is hard to understand • Need an

    empirical measurement • IPC is one empirical measurement • The best tool is the one you use
  40. THANKS! A special thank you to Eliza Weisman for the

    Instruments walk-through, screenshots, and feedback