Osna2020 - What Do You Mean by "Cache Friendly"?

What Do You Mean by “Cache Friendly”? – Osnabrück C++
July 2020 © Björn Fahller @bjorn_fahller 1/207 What Do You Mean by “Cache Friendly”? Björn Fahller

July 2020 © Björn Fahller @bjorn_fahller 2/207 typedef uint32_t (*timer_cb)(void*); struct timer { uint32_t deadline = 0; timer_cb callback = nullptr; void* userp = nullptr; struct timer* next = this; struct timer* prev = this; }; static timer timeouts{}; timer* schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer* iter = timeouts.prev; while (iter != &timeouts && is_after(iter->deadline, deadline)) iter = iter->prev; add_behind(iter, deadline, cb, userp); }

July 2020 © Björn Fahller @bjorn_fahller 5/207 typedef uint32_t (*timer_cb)(void*); struct timer { uint32_t deadline = 0; timer_cb callback = nullptr; void* userp = nullptr; struct timer* next = this; struct timer* prev = this; }; static timer timeouts{}; timer* schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer* iter = timeouts.prev; while (iter != &timeouts && is_after(iter->deadline, deadline)) iter = iter->prev; add_behind(iter, deadline, cb, userp); } void cancel_timer(timer* t) { t->next->prev = t->prev; t->prev->next = t->next; delete t; }

July 2020 © Björn Fahller @bjorn_fahller 6/207 What Do You Mean by “Cache Friendly”? Björn Fahller

July 2020 © Björn Fahller @bjorn_fahller 7/207 0x4010 const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x3A10 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 10/ 0x4010 const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x3A10 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 13/ 0x4010 const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 14/ 0x4010 const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory 0x4040 Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 18/ 0x4010 const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory 0x4040 Simplistic model of cache behaviour 0x4010

July 2020 © Björn Fahller @bjorn_fahller 19/ 0x4010 const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory 0x4040 Simplistic model of cache behaviour 0x4010

July 2020 © Björn Fahller @bjorn_fahller 20/ const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory 0x4040 Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 21/ const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory 0x4040 0x4080 Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 26/ const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory 0x4040 0x4080 0x4080 Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 27/ const int* hot = 0x4004; const int* cold = 0x4048; int* also_cold = 0x4080; int a = *hot; int c = *cold; *also_cold = a; also_cold[1] = c; 0x4000 0x4FF0 cache 0x4000 0x4010 0x4020 0x4030 0x4040 0x4050 0x4060 0x4070 0x4080 0x4090 0x40A0 0x40B0 0x40C0 0x40D0 0x40E0 0x40F0 memory 0x4040 0x4080 0x4080 Simplistic model of cache behaviour

July 2020 © Björn Fahller @bjorn_fahller 28/ Simplistic model of cache behaviour Includes

July 2020 © Björn Fahller @bjorn_fahller 29/ Simplistic model of cache behaviour Includes • The cache is small

July 2020 © Björn Fahller @bjorn_fahller 30/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines

July 2020 © Björn Fahller @bjorn_fahller 31/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines • and data access hit is very fast

July 2020 © Björn Fahller @bjorn_fahller 32/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines • and data access hit is very fast • and data access miss is very slow

July 2020 © Björn Fahller @bjorn_fahller 33/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines • and data access hit is very fast • and data access miss is very slow Excludes

July 2020 © Björn Fahller @bjorn_fahller 34/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines • and data access hit is very fast • and data access miss is very slow Excludes • Multiple levels of caches

July 2020 © Björn Fahller @bjorn_fahller 35/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines • and data access hit is very fast • and data access miss is very slow Excludes • Multiple levels of caches • Associativity

July 2020 © Björn Fahller @bjorn_fahller 36/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines • and data access hit is very fast • and data access miss is very slow Excludes • Multiple levels of caches • Associativity • Threading

July 2020 © Björn Fahller @bjorn_fahller 37/ Simplistic model of cache behaviour Includes • The cache is small • and consists of fixed size lines • and data access hit is very fast • and data access miss is very slow Excludes • Multiple levels of caches • Associativity • Threading All models are wrong, but some are useful

July 2020 © Björn Fahller @bjorn_fahller 38/ Analysis of implementation int main() { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<uint32_t> dist; for (int k = 0; k < 10; ++k) { timer* prev = nullptr; for (int i = 0; i < 20'000; ++i) { timer* t = schedule_timer( dist(gen), [](void*){return 0U;}, nullptr); if (i & 1) cancel_timer(prev); prev = t; } while (shoot_first()) ; } }

July 2020 © Björn Fahller @bjorn_fahller 42/ Analysis of implementation int main() { std::random_device rd; std::mt19937 gen(rd()); std::uniform_int_distribution<uint32_t> dist; for (int k = 0; k < 10; ++k) { timer* prev = nullptr; for (int i = 0; i < 20'000; ++i) { timer* t = schedule_timer( dist(gen), [](void*){return 0U;}, nullptr); if (i & 1) cancel_timer(prev); prev = t; } while (shoot_first()) ; } } bool shoot_first() { if (timeouts.next == &timeouts) return false; timer* t = timeouts.next; t->callback(t->userp); cancel_timer(t); return true; }

July 2020 © Björn Fahller @bjorn_fahller 43/ Analysis of implementation valgrind --tool=callgrind –-cache-sim=yes –-dump-instr=yes --branch-sim=yes

July 2020 © Björn Fahller @bjorn_fahller 44/ Analysis of implementation valgrind --tool=callgrind –-cache-sim=yes –-dump-instr=yes --branch-sim=yes Essentially a profiler that collects info about call hierarchies, number of calls, and time spent. The CPU simulator is not cycle accurate, so see timing results as a broad picture.

July 2020 © Björn Fahller @bjorn_fahller 45/ Analysis of implementation valgrind --tool=callgrind –-cache-sim=yes –-dump-instr=yes --branch-sim=yes Essentially a profiler that collects info about call hierarchies, number of calls, and time spent. The CPU simulator is not cycle accurate, so see timing results as a broad picture. Simulates a CPU cache, flattened to 2 levels, L1 and LL. It shows you where you get cache misses. L1 is by default a model of your host CPU L1, but you can change size, line-size, and associativity.

July 2020 © Björn Fahller @bjorn_fahller 46/ Analysis of implementation valgrind --tool=callgrind –-cache-sim=yes –-dump-instr=yes --branch-sim=yes Essentially a profiler that collects info about call hierarchies, number of calls, and time spent. The CPU simulator is not cycle accurate, so see timing results as a broad picture. Simulates a CPU cache, flattened to 2 levels, L1 and LL. It shows you where you get cache misses. L1 is by default a model of your host CPU L1, but you can change size, line-size, and associativity. Collects statistics per instruction instead of per source line. Can help pinpointing bottlenecks.

July 2020 © Björn Fahller @bjorn_fahller 47/ Analysis of implementation valgrind --tool=callgrind –-cache-sim=yes –-dump-instr=yes --branch-sim=yes Essentially a profiler that collects info about call hierarchies, number of calls, and time spent. The CPU simulator is not cycle accurate, so see timing results as a broad picture. Simulates a CPU cache, flattened to 2 levels, L1 and LL. It shows you where you get cache misses. L1 is by default a model of your host CPU L1, but you can change size, line-size, and associativity. Collects statistics per instruction instead of per source line. Can help pinpointing bottlenecks. Simulates a branch predictor.

July 2020 © Björn Fahller @bjorn_fahller 48/ Analysis of implementation valgrind --tool=callgrind –-cache-sim=yes –-dump-instr=yes --branch-sim=yes Essentially a profiler that collects info about call hierarchies, number of calls, and time spent. The CPU simulator is not cycle accurate, so see timing results as a broad picture. Simulates a CPU cache, flattened to 2 levels, L1 and LL. It shows you where you get cache misses. L1 is by default a model of your host CPU L1, but you can change size, line-size, and associativity. Collects statistics per instruction instead of per source line. Can help pinpointing bottlenecks. Simulates a branch predictor. Very slow!

July 2020 © Björn Fahller @bjorn_fahller 49/ Live demo

July 2020 © Björn Fahller @bjorn_fahller 50/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer;

July 2020 © Björn Fahller @bjorn_fahller 51/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes

July 2020 © Björn Fahller @bjorn_fahller 52/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment

July 2020 © Björn Fahller @bjorn_fahller 53/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes

July 2020 © Björn Fahller @bjorn_fahller 54/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes

July 2020 © Björn Fahller @bjorn_fahller 55/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes // 8 bytes

July 2020 © Björn Fahller @bjorn_fahller 56/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes // 8 bytes // 8 bytes

July 2020 © Björn Fahller @bjorn_fahller 57/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes // 8 bytes // 8 bytes // sum = 40 bytes

July 2020 © Björn Fahller @bjorn_fahller 58/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes // 8 bytes // 8 bytes // sum = 40 bytes 66% of all L1d cache misses

July 2020 © Björn Fahller @bjorn_fahller 59/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes // 8 bytes // 8 bytes // sum = 40 bytes 66% of all L1d cache misses Rule of thumb: Follow pointer => cache miss

July 2020 © Björn Fahller @bjorn_fahller 60/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes // 8 bytes // 8 bytes // sum = 40 bytes 66% of all L1d cache misses Rule of thumb: Follow pointer => cache miss 33% of all L1d cache misses

July 2020 © Björn Fahller @bjorn_fahller 61/ typedef uint32_t (*timer_cb)(void*); typedef struct timer { uint32_t deadline; timer_cb callback; void* userp; struct timer* next; struct timer* prev; } timer; // 4 bytes // 4 bytes padding for alignment // 8 bytes // 8 bytes // 8 bytes // 8 bytes // sum = 40 bytes 66% of all L1d cache misses Rule of thumb: Follow pointer => cache miss 33% of all L1d cache misses Rule of thumb: Data that is accessed together should be stored together

July 2020 © Björn Fahller @bjorn_fahller 62/ Chasing pointers is expensive. Let’s get rid of the pointers.

July 2020 © Björn Fahller @bjorn_fahller 63/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0;

July 2020 © Björn Fahller @bjorn_fahller 64/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; 24 bytes per entry. No pointer chasing

July 2020 © Björn Fahller @bjorn_fahller 65/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; 24 bytes per entry. No pointer chasing Linear structure

July 2020 © Björn Fahller @bjorn_fahller 66/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { auto idx = timeouts.size(); timeouts.push_back({}); while (idx > 0 && is_after(timeouts[idx-1].deadline, deadline)) { timeouts[idx] = std::move(timeouts[idx-1]); --idx; } timeouts[idx] = timer_data{deadline, next_id++, userp, cb }; return next_id; }

July 2020 © Björn Fahller @bjorn_fahller 67/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { auto idx = timeouts.size(); timeouts.push_back({}); while (idx > 0 && is_after(timeouts[idx-1].deadline, deadline)) { timeouts[idx] = std::move(timeouts[idx-1]); --idx; } timeouts[idx] = timer_data{deadline, next_id++, userp, cb }; return next_id; } Linear insertion sort

July 2020 © Björn Fahller @bjorn_fahller 68/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { auto idx = timeouts.size(); timeouts.push_back({}); while (idx > 0 && is_after(timeouts[idx-1].deadline, deadline)) { timeouts[idx] = std::move(timeouts[idx-1]); --idx; } timeouts[idx] = timer_data{deadline, next_id++, userp, cb }; return next_id; }

July 2020 © Björn Fahller @bjorn_fahller 69/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { auto idx = timeouts.size(); timeouts.push_back({}); while (idx > 0 && is_after(timeouts[idx-1].deadline, deadline)) { timeouts[idx] = std::move(timeouts[idx-1]); --idx; } timeouts[idx] = timer_data{deadline, next_id++, userp, cb }; return next_id; } void cancel_timer(timer t) { auto i = std::find_if(timeouts.begin(), timeouts.end(), [t](const auto& e) { return e.id == t; }); timeouts.erase(i); }

July 2020 © Björn Fahller @bjorn_fahller 70/ typedef uint32_t (*timer_cb)(void*); typedef uint32_t timer; struct timer_data { uint32_t deadline; timer id; void* userp; timer_cb callback; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { auto idx = timeouts.size(); timeouts.push_back({}); while (idx > 0 && is_after(timeouts[idx-1].deadline, deadline)) { timeouts[idx] = std::move(timeouts[idx-1]); --idx; } timeouts[idx] = timer_data{deadline, next_id++, userp, cb }; return next_id; } void cancel_timer(timer t) { auto i = std::find_if(timeouts.begin(), timeouts.end(), [t](const auto& e) { return e.id == t; }); timeouts.erase(i); } Linear search

July 2020 © Björn Fahller @bjorn_fahller 71/ Analysis of implementation perf stat -e cycles,instructions,l1d-loads,l1d-load-misses

July 2020 © Björn Fahller @bjorn_fahller 72/ Analysis of implementation perf stat -e cycles,instructions,l1d-loads,l1d-load-misses Presents statistics from whole run of program, using counters from HW and linux kernel.

July 2020 © Björn Fahller @bjorn_fahller 73/ Analysis of implementation perf stat -e cycles,instructions,l1d-loads,l1d-load-misses Presents statistics from whole run of program, using counters from HW and linux kernel. Number of cycles per instruction is a proxy for how much the CPU is working or waiting.

July 2020 © Björn Fahller @bjorn_fahller 74/ Analysis of implementation perf stat -e cycles,instructions,l1d-loads,l1d-load-misses Presents statistics from whole run of program, using counters from HW and linux kernel. Number of cycles per instruction is a proxy for how much the CPU is working or waiting. Number of reads from L1d cache, and number of misses. Speculative execution can make these numbers confusing.

July 2020 © Björn Fahller @bjorn_fahller 75/ Analysis of implementation perf stat -e cycles,instructions,l1d-loads,l1d-load-misses Presents statistics from whole run of program, using counters from HW and linux kernel. Number of cycles per instruction is a proxy for how much the CPU is working or waiting. Number of reads from L1d cache, and number of misses. Speculative execution can make these numbers confusing. Very fast!

July 2020 © Björn Fahller @bjorn_fahller 76/ Analysis of implementation perf record -e cycles,instructions,l1d-loads,l1d-load-misses --call-graph=lbr

July 2020 © Björn Fahller @bjorn_fahller 77/ Analysis of implementation perf record -e cycles,instructions,l1d-loads,l1d-load-misses --call-graph=lbr Records where in your program the counters are gathered.

July 2020 © Björn Fahller @bjorn_fahller 78/ Analysis of implementation perf record -e cycles,instructions,l1d-loads,l1d-load-misses --call-graph=lbr Records where in your program the counters are gathered. Records call graph info, instead of just location. LBR requires no special compilation flags.

July 2020 © Björn Fahller @bjorn_fahller 79/ Analysis of implementation perf record -e cycles,instructions,l1d-loads,l1d-load-misses --call-graph=lbr Records where in your program the counters are gathered. Records call graph info, instead of just location. LBR requires no special compilation flags. Very fast!

July 2020 © Björn Fahller @bjorn_fahller 81/

July 2020 © Björn Fahller @bjorn_fahller 82/ Linear search is expensive. Maybe try binary search?

July 2020 © Björn Fahller @bjorn_fahller 83/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0;

July 2020 © Björn Fahller @bjorn_fahller 84/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer_data element{deadline, next_id, userp, cb}; auto i = std::lower_bound(timeouts.begin(), timeouts.end(), element, is_after); timeouts.insert(i, element); return {deadline, next_id++}; }

July 2020 © Björn Fahller @bjorn_fahller 85/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer_data element{deadline, next_id, userp, cb}; auto i = std::lower_bound(timeouts.begin(), timeouts.end(), element, is_after); timeouts.insert(i, element); return {deadline, next_id++}; } Binary search for insertion point

July 2020 © Björn Fahller @bjorn_fahller 86/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer_data element{deadline, next_id, userp, cb}; auto i = std::lower_bound(timeouts.begin(), timeouts.end(), element, is_after); timeouts.insert(i, element); return {deadline, next_id++}; } Linear insertion

July 2020 © Björn Fahller @bjorn_fahller 87/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer_data element{deadline, next_id, userp, cb}; auto i = std::lower_bound(timeouts.begin(), timeouts.end(), element, is_after); timeouts.insert(i, element); return {deadline, next_id++}; } Linear insertion void cancel_timer(timer t) { timer_data element{t.deadline, t.id, nullptr, nullptr}; auto [lo, hi] = std::equal_range(timeouts.begin(), timeouts.end(), element, is_after); auto i = std::find_if(lo, hi, [t](const auto& e) { return e.id == t.id; }); if (i != hi) { timeouts.erase(i); } }

July 2020 © Björn Fahller @bjorn_fahller 88/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer_data element{deadline, next_id, userp, cb}; auto i = std::lower_bound(timeouts.begin(), timeouts.end(), element, is_after); timeouts.insert(i, element); return {deadline, next_id++}; } Linear insertion void cancel_timer(timer t) { timer_data element{t.deadline, t.id, nullptr, nullptr}; auto [lo, hi] = std::equal_range(timeouts.begin(), timeouts.end(), element, is_after); auto i = std::find_if(lo, hi, [t](const auto& e) { return e.id == t.id; }); if (i != hi) { timeouts.erase(i); } } Binary search for timers with the same deadline

July 2020 © Björn Fahller @bjorn_fahller 89/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer_data element{deadline, next_id, userp, cb}; auto i = std::lower_bound(timeouts.begin(), timeouts.end(), element, is_after); timeouts.insert(i, element); return {deadline, next_id++}; } Linear insertion void cancel_timer(timer t) { timer_data element{t.deadline, t.id, nullptr, nullptr}; auto [lo, hi] = std::equal_range(timeouts.begin(), timeouts.end(), element, is_after); auto i = std::find_if(lo, hi, [t](const auto& e) { return e.id == t.id; }); if (i != hi) { timeouts.erase(i); } } Linear search for matching id

July 2020 © Björn Fahller @bjorn_fahller 90/ typedef uint32_t (*timer_cb)(void*); struct timer_data { uint32_t deadline; uint32_t id; void* userp; timer_cb callback; }; struct timer { uint32_t deadline; uint32_t id; }; std::vector<timer_data> timeouts; uint32_t next_id = 0; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { timer_data element{deadline, next_id, userp, cb}; auto i = std::lower_bound(timeouts.begin(), timeouts.end(), element, is_after); timeouts.insert(i, element); return {deadline, next_id++}; } Linear insertion void cancel_timer(timer t) { timer_data element{t.deadline, t.id, nullptr, nullptr}; auto [lo, hi] = std::equal_range(timeouts.begin(), timeouts.end(), element, is_after); auto i = std::find_if(lo, hi, [t](const auto& e) { return e.id == t.id; }); if (i != hi) { timeouts.erase(i); } } Linear removal

July 2020 © Björn Fahller @bjorn_fahller 92/ Searches not visible in profiling. Number of reads reduced. Number of cache misses high. memmove() dominates.

July 2020 © Björn Fahller @bjorn_fahller 93/ Searches not visible in profiling. Number of reads reduced. Number of cache misses high. memmove() dominates. Failed branch predictions can lead to cache entry eviction!

July 2020 © Björn Fahller @bjorn_fahller 94/ Searches not visible in profiling. Number of reads reduced. Number of cache misses high. memmove() dominates. Failed branch predictions can lead to cache entry eviction! Maybe try a map<>?

July 2020 © Björn Fahller @bjorn_fahller 95/ typedef uint32_t (*timer_cb)(void*); struct timer_data { void* userp; timer_cb callback; }; struct is_after { bool operator()(uint32_t lh, uint32_t rh) const { return lh < rh; } }; using timer_map = std::multimap<uint32_t, timer_data, is_after>; using timer = timer_map::iterator; static timer_map timeouts;

July 2020 © Björn Fahller @bjorn_fahller 96/ typedef uint32_t (*timer_cb)(void*); struct timer_data { void* userp; timer_cb callback; }; struct is_after { bool operator()(uint32_t lh, uint32_t rh) const { return lh < rh; } }; using timer_map = std::multimap<uint32_t, timer_data, is_after>; using timer = timer_map::iterator; static timer_map timeouts;

July 2020 © Björn Fahller @bjorn_fahller 97/ typedef uint32_t (*timer_cb)(void*); struct timer_data { void* userp; timer_cb callback; }; struct is_after { bool operator()(uint32_t lh, uint32_t rh) const { return lh < rh; } }; using timer_map = std::multimap<uint32_t, timer_data, is_after>; using timer = timer_map::iterator; static timer_map timeouts; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { return timeouts.insert(std::make_pair(deadline, timer_data{userp, cb})); } void cancel_timer(timer t) { timeouts.erase(t); }

July 2020 © Björn Fahller @bjorn_fahller 100/ typedef uint32_t (*timer_cb)(void*); struct timer_data { void* userp; timer_cb callback; }; struct is_after { bool operator()(uint32_t lh, uint32_t rh) const { return lh < rh; } }; using timer_map = std::multimap<uint32_t, timer_data, is_after>; using timer = timer_map::iterator; static timer_map timeouts; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { return timeouts.insert(std::make_pair(deadline, timer_data{userp, cb})); } void cancel_timer(timer t) { timeouts.erase(t); } bool shoot_first() { if (timeouts.empty()) return false; auto i = timeouts.begin(); i->second.callback(i->second.userp); timeouts.erase(i); return true; }

July 2020 © Björn Fahller @bjorn_fahller 102/ Faster, but lots of cache misses when comparing keys and rebalancing the tree.

July 2020 © Björn Fahller @bjorn_fahller 103/ Faster, but lots of cache misses when comparing keys and rebalancing the tree. What did I say about chasing pointers?

July 2020 © Björn Fahller @bjorn_fahller 104/ Faster, but lots of cache misses when comparing keys and rebalancing the tree. What did I say about chasing pointers? 1 10 100 1000 10000 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 Execution time linear bsearch map Number of elements seconds

July 2020 © Björn Fahller @bjorn_fahller 105/ Faster, but lots of cache misses when comparing keys and rebalancing the tree. What did I say about chasing pointers? 1 10 100 1000 10000 1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 Execution time linear bsearch map Number of elements seconds 1 10 100 1000 10000 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Performance relative to linear Execution time bsearch/linear map/linear Number of elements Time relative linear

July 2020 © Björn Fahller @bjorn_fahller 106/ Faster, but lots of cache misses when comparing keys and rebalancing the tree. What did I say about chasing pointers?

July 2020 © Björn Fahller @bjorn_fahller 107/ Faster, but lots of cache misses when comparing keys and rebalancing the tree. What did I say about chasing pointers? Can we get log(n) lookup without chasing pointers?

July 2020 © Björn Fahller @bjorn_fahller 108/ Enter the HEAP

July 2020 © Björn Fahller @bjorn_fahller 109/ 3 5 8 6 10 10 14 9 15 13 12 11 Enter the HEAP

July 2020 © Björn Fahller @bjorn_fahller 110/ 3 5 8 6 10 10 14 9 15 13 12 11 Enter the HEAP • Perfectly balanced partially sorted tree

July 2020 © Björn Fahller @bjorn_fahller 111/ 3 5 8 6 10 10 14 9 15 13 12 11 Enter the HEAP • Perfectly balanced partially sorted tree • Every node is sorted after or same as its parent

July 2020 © Björn Fahller @bjorn_fahller 112/ 3 5 8 6 10 10 14 9 15 13 12 11 Enter the HEAP • Perfectly balanced partially sorted tree • Every node is sorted after or same as its parent • No relation between siblings

July 2020 © Björn Fahller @bjorn_fahller 113/ 3 5 8 6 10 10 14 9 15 13 12 11 Enter the HEAP • Perfectly balanced partially sorted tree • Every node is sorted after or same as its parent • No relation between siblings • At most one node with only one child, and that child is the last node

July 2020 © Björn Fahller @bjorn_fahller 114/ 10 14 15 13 12 11 8 3 5 6 9

July 2020 © Björn Fahller @bjorn_fahller 115/ 10 14 15 13 12 11 8 Insertion: 3 5 6 9

July 2020 © Björn Fahller @bjorn_fahller 116/ 10 14 15 13 12 11 8 Insertion: • Create space 3 5 6 9

July 2020 © Björn Fahller @bjorn_fahller 117/ 10 14 15 13 12 11 8 Insertion: • Create space • Trickle down greater nodes 3 5 6 9

July 2020 © Björn Fahller @bjorn_fahller 118/ 10 14 15 13 12 11 8 Insertion: • Create space • Trickle down greater nodes • Insert into space 3 5 6 9

July 2020 © Björn Fahller @bjorn_fahller 119/ 10 14 15 13 12 11 8 7 Insertion: • Create space • Trickle down greater nodes • Insert into space 3 5 6 9

July 2020 © Björn Fahller @bjorn_fahller 120/ 10 10 14 15 13 12 11 8 7 Insertion: • Create space • Trickle down greater nodes • Insert into space 3 5 6 9

July 2020 © Björn Fahller @bjorn_fahller 123/ 7 8 10 14 15 13 12 11 3 5 6 9 10

July 2020 © Björn Fahller @bjorn_fahller 124/ 7 8 10 14 15 13 12 11 Pop top: 3 5 6 9 10

July 2020 © Björn Fahller @bjorn_fahller 125/ 7 8 10 14 15 13 12 11 Pop top: • Remove top 3 5 6 9 10

July 2020 © Björn Fahller @bjorn_fahller 126/ 7 8 10 14 15 13 12 11 Pop top: • Remove top • Trickle up lesser child 3 5 6 9 10

July 2020 © Björn Fahller @bjorn_fahller 127/ 7 8 10 14 15 13 12 11 Pop top: • Remove top • Trickle up lesser child • move-insert last into hole 3 5 6 9 10

July 2020 © Björn Fahller @bjorn_fahller 128/ 7 8 10 14 15 13 12 11 Pop top: • Remove top • Trickle up lesser child • move-insert last into hole 5 6 9 10

July 2020 © Björn Fahller @bjorn_fahller 134/ 5 1 6 2 7 3 9 4 10 5 8 6 14 7 10 8 12 9 13 10 11 11 15 12 15 15 15 15

July 2020 © Björn Fahller @bjorn_fahller 135/ 5 1 6 2 7 3 9 4 10 5 8 6 14 7 10 8 12 9 13 10 11 11 15 12 15 15 15 15 Addressing: The index of a parent node is half (rounded down) of that of a child.

July 2020 © Björn Fahller @bjorn_fahller 136/ 5 1 6 2 7 3 9 4 10 5 8 6 14 7 10 8 12 9 13 10 11 11 15 12 15 15 15 15 Addressing: The index of a parent node is half (rounded down) of that of a child.

July 2020 © Björn Fahller @bjorn_fahller 137/ 5 1 6 2 7 3 9 4 10 5 8 6 14 7 10 8 12 9 13 10 11 11 15 12 15 15 15 15 Addressing: The index of a parent node is half (rounded down) of that of a child. Array indexes! No pointer chasing!

July 2020 © Björn Fahller @bjorn_fahller 138/ The heap is not searchable, so how handle cancellation?

July 2020 © Björn Fahller @bjorn_fahller 139/ The heap is not searchable, so how handle cancellation? struct timer_action { uint32_t (*callback)(void*); void* userp; };

July 2020 © Björn Fahller @bjorn_fahller 140/ The heap is not searchable, so how handle cancellation? actions struct timer_action { uint32_t (*callback)(void*); void* userp; };

July 2020 © Björn Fahller @bjorn_fahller 141/ The heap is not searchable, so how handle cancellation? actions struct timer_action { uint32_t (*callback)(void*); void* userp; };

July 2020 © Björn Fahller @bjorn_fahller 142/ The heap is not searchable, so how handle cancellation? actions struct timer_action { uint32_t (*callback)(void*); void* userp; }; struct timeout { uint32_t deadline; uint32_t action_index; };

July 2020 © Björn Fahller @bjorn_fahller 143/ The heap is not searchable, so how handle cancellation? actions struct timer_action { uint32_t (*callback)(void*); void* userp; }; struct timeout { uint32_t deadline; uint32_t action_index; };

July 2020 © Björn Fahller @bjorn_fahller 144/ The heap is not searchable, so how handle cancellation? actions struct timer_action { uint32_t (*callback)(void*); void* userp; }; struct timeout { uint32_t deadline; uint32_t action_index; }; Only 8 bytes per element of working data in the heap.

July 2020 © Björn Fahller @bjorn_fahller 145/ The heap is not searchable, so how handle cancellation? actions struct timer_action { uint32_t (*callback)(void*); void* userp; }; struct timeout { uint32_t deadline; uint32_t action_index; }; Cancel by setting callback to nullptr Only 8 bytes per element of working data in the heap.

July 2020 © Björn Fahller @bjorn_fahller 146/ struct timer_data { uint32_t deadline; uint32_t action_index; }; struct is_after { bool operator()(const timer_data& lh, const timer_data& rh) const { return lh.deadline < rh.deadline; } }; std::priority_queue<timer_data, std::vector<timer_data>, is_after> timeouts; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { auto action_index = actions.push(cb, userp); timeouts.push(timer_data{deadline, action_index}); return action_index; }

July 2020 © Björn Fahller @bjorn_fahller 147/ struct timer_data { uint32_t deadline; uint32_t action_index; }; struct is_after { bool operator()(const timer_data& lh, const timer_data& rh) const { return lh.deadline < rh.deadline; } }; std::priority_queue<timer_data, std::vector<timer_data>, is_after> timeouts; timer schedule_timer(uint32_t deadline, timer_cb cb, void* userp) { auto action_index = actions.push(cb, userp); timeouts.push(timer_data{deadline, action_index}); return action_index; } Container adapter that implements a heap

July 2020 © Björn Fahller @bjorn_fahller 150/ bool shoot_first() { while (!timeouts.empty()) { auto& t = timeouts.top(); auto& action = actions[t.action_index]; if (action.callback) break; actions.remove(t.action_index); timeouts.pop(); } if (timeouts.empty()) return false; auto& t = timeouts.top(); auto& action = actions[t.action_index]; action.callback(action.userp); actions.remove(t.action_index); timeouts.pop(); return true; } Pop-off any cancelled items

July 2020 © Björn Fahller @bjorn_fahller 153/ A lot fewer everything! and nearly twice as fast too 1 10 100 1000 10000 100000 0 0.01 0.02 0.03 Execution time linear bsearch map heap Number of elements Seconds

July 2020 © Björn Fahller @bjorn_fahller 154/ A lot fewer everything! and nearly twice as fast too 1 10 100 1000 10000 100000 0 0.01 0.02 0.03 Execution time linear bsearch map heap Number of elements Seconds 1 10 100 1000 10000 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Relative execution time heap/linear heap/map Number of elements Relavite time

July 2020 © Björn Fahller @bjorn_fahller 156/ A lot fewer everything! and nearly twice as fast too But there are many cache misses in the adjust-heap functions

July 2020 © Björn Fahller @bjorn_fahller 157/ A lot fewer everything! and nearly twice as fast too But there are many cache misses in the adjust-heap functions Can we do better?

July 2020 © Björn Fahller @bjorn_fahller 178/ 5 1 6 2 7 3 9 4 10 5 8 6 14 7 0 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

July 2020 © Björn Fahller @bjorn_fahller 179/ 5 1 6 2 7 3 9 4 10 5 8 6 14 7 0 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

July 2020 © Björn Fahller @bjorn_fahller 180/ class timeout_store { static constexpr size_t block_size = 8; static constexpr size_t block_mask = block_size – 1U; static size_t block_offset(size_t idx) { return idx & block_mask; } static size_t block_base(size_t idx) { return idx & ~block_mask; } static bool is_block_root(size_t idx) { return block_offset(idx) == 1; } static bool is_block_leaf(size_t idx) { return (idx & (block_size >> 1)) != 0U; } ... }; 1 2 3 4 5 6 7 0

July 2020 © Björn Fahller @bjorn_fahller 185/ class timeout_store { static constexpr size_t block_size = 8; static constexpr size_t block_mask = block_size – 1U; static size_t block_offset(size_t idx); static size_t block_base(size_t idx); static bool is_block_root(size_t idx); static bool is_block_leaf(size_t idx); static size_t left_child_of(size_t idx) { if (!is_block_leaf(idx)) return idx + block_offset(idx); auto base = block_base(idx) + 1; return base * block_size + child_no(idx) * block_size * 2 + 1; } ... };

July 2020 © Björn Fahller @bjorn_fahller 186/ class timeout_store { static constexpr size_t block_size = 8; static constexpr size_t block_mask = block_size – 1U; static size_t block_offset(size_t idx); static size_t block_base(size_t idx); static bool is_block_root(size_t idx); static bool is_block_leaf(size_t idx); static size_t left_child_of(size_t idx) { if (!is_block_leaf(idx)) return idx + block_offset(idx); auto base = block_base(idx) + 1; return base * block_size + child_no(idx) * block_size * 2 + 1; } ... };

July 2020 © Björn Fahller @bjorn_fahller 187/ class timeout_store { static constexpr size_t block_size = 8; static constexpr size_t block_mask = block_size – 1U; static size_t block_offset(size_t idx); static size_t block_base(size_t idx); static bool is_block_root(size_t idx); static bool is_block_leaf(size_t idx); static size_t left_child_of(size_t idx) { if (!is_block_leaf(idx)) return idx + block_offset(idx); auto base = block_base(idx) + 1; return base * block_size + child_no(idx) * block_size * 2 + 1; } ... }; static size_t parent_of(size_t idx) { auto const node_root = block_base(idx); if (!is_block_root(idx)) return node_root + block_offset(idx) / 2; auto parent_base = block_base(node_root / block_size - 1); auto child = ((idx - block_size) / block_size - parent_base) / 2; return parent_base + block_size / 2 + child; }

July 2020 © Björn Fahller @bjorn_fahller 188/ class timeout_store { static constexpr size_t block_size = 8; static constexpr size_t block_mask = block_size – 1U; static size_t block_offset(size_t idx); static size_t block_base(size_t idx); static bool is_block_root(size_t idx); static bool is_block_leaf(size_t idx); static size_t left_child_of(size_t idx) { if (!is_block_leaf(idx)) return idx + block_offset(idx); auto base = block_base(idx) + 1; return base * block_size + child_no(idx) * block_size * 2 + 1; } ... }; static size_t parent_of(size_t idx) { auto const node_root = block_base(idx); if (!is_block_root(idx)) return node_root + block_offset(idx) / 2; auto parent_base = block_base(node_root / block_size - 1); auto child = ((idx - block_size) / block_size - parent_base) / 2; return parent_base + block_size / 2 + child; }

July 2020 © Björn Fahller @bjorn_fahller 189/ class timeout_store { ... using allocator = align_allocator<64>::type<timer_data>; std::vector<timer_data, allocator> bheap_store; };

July 2020 © Björn Fahller @bjorn_fahller 190/ class timeout_store { ... using allocator = align_allocator<64>::type<timer_data>; std::vector<timer_data, allocator> bheap_store; };

July 2020 © Björn Fahller @bjorn_fahller 191/ class timeout_store { ... using allocator = align_allocator<64>::type<timer_data>; std::vector<timer_data, allocator> bheap_store; }; template <size_t N> struct align_allocator { template <typename T> struct type { using value_type = T; static constexpr std::align_val_t alignment{N}; T* allocate(size_t n) { return static_cast<T*>(operator new(n*sizeof(T), alignment)); } void deallocate(T* p, size_t) { operator delete(p, alignment); } }; };

July 2020 © Björn Fahller @bjorn_fahller 192/ class timeout_store { ... using allocator = align_allocator<64>::type<timer_data>; std::vector<timer_data, allocator> bheap_store; }; template <size_t N> struct align_allocator { template <typename T> struct type { using value_type = T; static constexpr std::align_val_t alignment{N}; T* allocate(size_t n) { return static_cast<T*>(operator new(n*sizeof(T), alignment)); } void deallocate(T* p, size_t) { operator delete(p, alignment); } }; };

July 2020 © Björn Fahller @bjorn_fahller 193/ class timeout_store { ... using allocator = align_allocator<64>::type<timer_data>; std::vector<timer_data, allocator> bheap_store; }; template <size_t N> struct align_allocator { template <typename T> struct type { using value_type = T; static constexpr std::align_val_t alignment{N}; T* allocate(size_t n) { return static_cast<T*>(operator new(n*sizeof(T), alignment)); } void deallocate(T* p, size_t) { operator delete(p, alignment); } }; }; Aligned operator new and delete came with C++ 17

July 2020 © Björn Fahller @bjorn_fahller 196/ About 2/3 as many cache misses but not much speed difference 1 10 100 1000 10000 100000 0 0 0 0 0 0 0.01 0.1 Execution time linear bsearch map heap bheap Number of elements seconds

July 2020 © Björn Fahller @bjorn_fahller 197/ About 2/3 as many cache misses but not much speed difference 1 10 100 1000 10000 100000 0 0 0 0 0 0 0.01 0.1 Execution time linear bsearch map heap bheap Number of elements seconds 1 10 100 1000 10000 100000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Execution time relative map heap/map bheap/map Number of elements factor

July 2020 © Björn Fahller @bjorn_fahller 199/ What do I mean by cache friendly? • For small data sets, linear search in contiguous memory is the fastest. Algorithm complexity matters for large data sets.

July 2020 © Björn Fahller @bjorn_fahller 200/ What do I mean by cache friendly? • For small data sets, linear search in contiguous memory is the fastest. Algorithm complexity matters for large data sets. • Avoid pointer chasing. You have to wait for the pointer to load to know where to go, and it’s likely to be a cache miss.

July 2020 © Björn Fahller @bjorn_fahller 201/ What do I mean by cache friendly? • For small data sets, linear search in contiguous memory is the fastest. Algorithm complexity matters for large data sets. • Avoid pointer chasing. You have to wait for the pointer to load to know where to go, and it’s likely to be a cache miss. • Keep your working data set small. Sometimes split data structures.

July 2020 © Björn Fahller @bjorn_fahller 202/ What do I mean by cache friendly? • For small data sets, linear search in contiguous memory is the fastest. Algorithm complexity matters for large data sets. • Avoid pointer chasing. You have to wait for the pointer to load to know where to go, and it’s likely to be a cache miss. • Keep your working data set small. Sometimes split data structures. • Use as much of a cache entry as you can.

July 2020 © Björn Fahller @bjorn_fahller 203/ What do I mean by cache friendly? • For small data sets, linear search in contiguous memory is the fastest. Algorithm complexity matters for large data sets. • Avoid pointer chasing. You have to wait for the pointer to load to know where to go, and it’s likely to be a cache miss. • Keep your working data set small. Sometimes split data structures. • Use as much of a cache entry as you can. • Fewer evicted cache lines means more data in hot cache for the rest of the program.

July 2020 © Björn Fahller @bjorn_fahller 204/ What do I mean by cache friendly? • For small data sets, linear search in contiguous memory is the fastest. Algorithm complexity matters for large data sets. • Avoid pointer chasing. You have to wait for the pointer to load to know where to go, and it’s likely to be a cache miss. • Keep your working data set small. Sometimes split data structures. • Use as much of a cache entry as you can. • Fewer evicted cache lines means more data in hot cache for the rest of the program. • Mispredicted branches can evict cache entries.

July 2020 © Björn Fahller @bjorn_fahller 205/ What do I mean by cache friendly? • For small data sets, linear search in contiguous memory is the fastest. Algorithm complexity matters for large data sets. • Avoid pointer chasing. You have to wait for the pointer to load to know where to go, and it’s likely to be a cache miss. • Keep your working data set small. Sometimes split data structures. • Use as much of a cache entry as you can. • Fewer evicted cache lines means more data in hot cache for the rest of the program. • Mispredicted branches can evict cache entries. • Measure measure measure.

July 2020 © Björn Fahller @bjorn_fahller 206/ Resources Ulrich Drepper - “What every programmer should know about memory” http://www.akkadia.org/drepper/cpumemory.pdf Milian Wolff - “Linux perf for Qt Developers” https://www.youtube.com/watch?v=L4NClVxqdMw Travis Downs - “Cache counters rant” https://gist.github.com/travisdowns/90a588deaaa1b93559fe2b8510f2a739 Emery Berger - “Performance Matters” https://www.youtube.com/watch?v=r-TLSBdHe1A

July 2020 © Björn Fahller @bjorn_fahller 207/ [email protected] @bjorn_fahller @rollbear Björn Fahller What Do You Mean by “Cache Friendly”?

Osna2020 - What Do You Mean by "Cache Friendly"?

Osna2020 - What Do You Mean by "Cache Friendly"?

More Decks by Björn Fahller

Other Decks in Programming

Featured

Transcript