Slide 1

Slide 1 text

Zero alloc pathfinding @quasilyte 2023

Slide 2

Slide 2 text

A dive into specialized data structures and micro-optimizations What this talk is

Slide 3

Slide 3 text

A dive into specialized data structures and micro-optimizations Something that you would easily use in your job What this talk is …but not this

Slide 4

Slide 4 text

Agenda 1. A very short intro 2. Existing libraries 3. Why my library is so fast 4. How to overcome some of its limitations

Slide 5

Slide 5 text

Image source: www.redblobgames.com/pathfinding/a-star Pathfinding finds some way to get from point A to point B. Depending on the algorithm, the paths can have different properties.

Slide 6

Slide 6 text

About me & pathfinding Open source!

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Roboden & pathfinding ● Can have large maps (scrollable, several screens) ● Maps are generated, there are no key waypoints ● Different kinds of landscape (mountains, lava, forests, …) ● Hundreds of units that are active in the real-time ● Fixed 60 ticks per second

Slide 10

Slide 10 text

Pathfinding libraries 1. github.com/quasilyte/pathing (my library) 2. github.com/fzipp/astar 3. github.com/beefsack/go-astar 4. github.com/s0rg/grid 5. github.com/solarlune/paths

Slide 11

Slide 11 text

Let’s benchmark libraries! Tests: ● no_walls - the trivial solution is the best one ● simple_wall - go around a simple wall ● multi_wall - several simple walls 50x50 grid map

Slide 12

Slide 12 text

Benchmark details (for nerds) Sources: github.com/quasilyte/pathing/_bench OS: Linux Mint 21.1 CPU: x86-64 12th Gen Intel(R) Core(TM) i5-1235U Tools used: “go test -bench”, benchstat Turbo boost: disabled (intel_pstate/no_turbo=1) Go version: devel go1.21-c30faf9c54

Slide 13

Slide 13 text

Pathfinding benchmarks: CPU time (ns/op) LIBRARY no_walls simple_wall multi_wall pathing 3525 2084 2688 astar go-astar grid paths

Slide 14

Slide 14 text

Pathfinding benchmarks: CPU time (ns/op) LIBRARY no_walls simple_wall multi_wall pathing 3525 2084 2688 astar (x268) 948367 (x745) 1554290 (x685) 1842812 go-astar grid paths Read as “X times slower”

Slide 15

Slide 15 text

Pathfinding benchmarks: CPU time (ns/op) LIBRARY no_walls simple_wall multi_wall pathing 3525 2084 2688 astar (x268) 948367 (x745) 1554290 (x685) 1842812 go-astar (x128) 453939 (x450) 939300 (x343) 1032581 grid (x514) 1816039 (x553) 1154117 (x442) 1189989 paths (x1868) 6588751 (x2474) 5158604 (x2274) 6114856

Slide 16

Slide 16 text

Pathfinding benchmarks: allocations LIBRARY no_walls simple_wall multi_wall pathing 0 0 0 astar 337336 B 2008 511908 B 3677 722690 B 3600 go-astar 43653 B 529 93122 B 1347 130731 B 1557 grid 996889 B 2976 551976 B 1900 740523 B 1759 paths 235168 B 7199 194768 B 6368 230416 B 7001

Slide 17

Slide 17 text

quasilyte/pathing ● x128-2474 times faster than the alternatives ● Does no heap allocations to build a path ● Optimized for very big grid maps (thousands of cells) ● Simple and zero-cost layers system ● Works out of the box, no need to implement an interface

Slide 18

Slide 18 text

Why other libraries are so much slower? Instead of talking about this…

Slide 19

Slide 19 text

Why other libraries are so much slower? Why pathing is so fast? Instead of talking about this… …we’ll focus on this

Slide 20

Slide 20 text

Greedy BFS performance-critical parts ● Matrix to store cell information (the grid) ● Result path representation ● Priority queue for “frontier” ● Map to store the visited cells

Slide 21

Slide 21 text

Greedy BFS performance-critical parts ● Matrix to store cell information (the grid) ● Result path representation ● Priority queue for “frontier” ● Map to store the visited cells

Slide 22

Slide 22 text

{0,0} {1,0} {2,0} {3,0} {0,1} {1,1} {2,1} {3,1} {0,2} {1,2} {2,2} {3,3} Grid map Plains Sand Lava Map legend

Slide 23

Slide 23 text

{0,0} {1,0} {2,0} {3,0} {0,1} {1,1} {2,1} {3,1} {0,2} {1,2} {2,2} {3,3} Grid map Some units can only traverse the plains.

Slide 24

Slide 24 text

{0,0} {1,0} {2,0} Grid map {0,1} {1,1} {2,1} {3,1} {0,2} {1,2} {2,2} {3,3} Other units can traverse sand as well {3,0}

Slide 25

Slide 25 text

{0,0} {1,0} {2,0} {3,0} {0,1} {1,1} {2,1} {3,1} {0,2} {1,2} {2,2} {3,3} Grid map Flying units can probably cross lava too

Slide 26

Slide 26 text

● Same grid map ● Different cell interpretation ???

Slide 27

Slide 27 text

● Same grid map ● Different cell interpretation Enter grid cell mappers (“layers”)

Slide 28

Slide 28 text

Defining the grid cell types type CellType int const ( CellPlain CellType = iota // 0 CellSand // 1 CellLava // 2 )

Slide 29

Slide 29 text

Storing the grid cell data ● We have only 3 tile types => ● 2 bits per cell are enough

Slide 30

Slide 30 text

Choosing a grid cell data size (in bits) SIZE (bits) VALUE RANGE MAP SIZE (3600 CELLS) 2 0-3 (2^2) 900 bytes 3 0-7 (2^3) 1350 bytes 4 0-15 (2^4) 1800 bytes 5 0-31 (2^5) 2250 bytes 6 0-63 (2^6) 2700 bytes A typical cache line size is only 64 bytes

Slide 31

Slide 31 text

Flat array storage (2 bits / cell) 0 1 2 3 4 5 6 7 Bytes[0] Bytes[1] i := y*g.numCols + x byteIndex := i / 4 shift := (i % 4) * 2 cell := g.bytes[byteIndex] >> shift & 0b11 Array memory

Slide 32

Slide 32 text

Grid access patterns {x,y} {x+1,y} {x,y-1} {x-1,y} {x,y+1} These cells may be inside the same byte => good cache locality

Slide 33

Slide 33 text

Grid access patterns {x,y} {x+1,y} {x,y-1} {x-1,y} {x,y+1} These cells may not even fit into the same cache line (64 byte)

Slide 34

Slide 34 text

Grid access patterns {x,y} {x,y+1} index = numCols*y + x Map width is our limiting factor

Slide 35

Slide 35 text

Cache lines vs flat array storage SIZE (bits) MAX WIDTH (cells, 32x32 pixels) 2 256 cells (8192 pixels) 3 ~170 cells (5440 pixels) 4 128 cells (4096 pixels) 5 ~102 cells (3264 pixels) 6 ~85 cells (2720 pixels)

Slide 36

Slide 36 text

Querying the grid cell value 1. Get a raw value from the grid at {x,y} (0-3 for 2 bits)

Slide 37

Slide 37 text

Querying the grid cell value 1. Get a raw value from the grid at {x,y} (0-3 for 2 bits) 2. Map that through the cell mapper (get a 0-255 value)

Slide 38

Slide 38 text

Querying the grid cell value 1. Get a raw value from the grid at {x,y} (0-3 for 2 bits) 2. Map that through the cell mapper (get a 0-255 value) 3. Use that value during the pathfinding

Slide 39

Slide 39 text

Interpreting the cell value ● After the mapping, cell value is in [0,255] range ● 0 means that this cell can’t be traversed ● Any other value specifies the traversal cost For the simplest cases, 0 and 1 are enough.

Slide 40

Slide 40 text

Defining the layer type GridLayer [4]byte func (l GridLayer) Get(i int) byte { return l[i] }

Slide 41

Slide 41 text

Declaring layers (user code) var NormalLayer = pathing.GridLayer{ CellPlain: 1, CellSand: 0, CellLava: 0, }

Slide 42

Slide 42 text

Declaring layers (user code) var FlyingLayer = pathing.GridLayer{ CellPlain: 1, CellSand: 1, CellLava: 1, }

Slide 43

Slide 43 text

A better way to define the layer type GridLayer uint32 func (l GridLayer) Get(i int) byte { return byte(l >> (uint32(i) * 8)) }

Slide 44

Slide 44 text

Greedy BFS performance-critical parts ● Matrix to store cell information (the grid) ● Result path representation ● Priority queue for “frontier” ● Map to store the visited cells

Slide 45

Slide 45 text

How to represent the path? Array (slice) of points

Slide 46

Slide 46 text

How to represent the path? Packed deltas

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

{0,0} {1,0} {2,0} {2,1} A path with points {2,2} {2,3} 0,0 1,0 2,0 S 2,1 2,2 F Map

Slide 49

Slide 49 text

{0,0} {1,0} {2,0} {2,1} A path with points {2,2} {2,3} Up Right Right Down A path with deltas (“steps”) Down Down 0,0 1,0 2,0 S 2,1 2,2 F Map

Slide 50

Slide 50 text

Defining cardinal directions (no diagonals for now) type Direction int // Just 4 values, so 2 bits per direction will suffice! const ( DirRight Direction = iota // 0 DirDown // 1 DirLeft // 2 DirUp // 3 )

Slide 51

Slide 51 text

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 We could store up to 64 “steps” in just 16 bytes (2 registers) Second 8 bytes First 8 bytes

Slide 52

Slide 52 text

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 But we also need pos+len to iterate over the path pos len

Slide 53

Slide 53 text

Defining the Path structure const gridPathBytes = (16 - 2) // 14 const gridPathMaxLen = gridPathBytes * 4 // 56 type GridPath struct { bytes [gridPathBytes]byte len byte pos byte }

Slide 54

Slide 54 text

Our Path type advantages ● Value semantics - just pass it without a pointer ● Copying is trivial - a pair of 64-bit MOVQ instructions ● No need to do a heap allocations ● Requires no real memory fetching, unlike dynamic arrays

Slide 55

Slide 55 text

Our Path type disadvantages ● Limited max path length SIZE (bytes) MAX PATH LENGTH (steps) 16 56 24 88 32 120 There is an advantage in keeping it under 64

Slide 56

Slide 56 text

Greedy BFS performance-critical parts ● Matrix to store cell information (the grid) ● Result path representation ● Priority queue for “frontier” ● Map to store the visited cells

Slide 57

Slide 57 text

Frontier & priority queue ● Push(coord, score) ● Pop( ) -> (coord, score) ● Reset( ) - to re-use the memory Score is a Manhattan distance from coord to the destination. (*) The exact score calculation depends on the algorithm.

Slide 58

Slide 58 text

How do we use priority queue here? ● Push all current possible moves to pqueue ● Investigate all tempting routes first If some move brings us closer to the finish, we’ll check that route first. This is needed to reduce the average computation steps performed by the algorithm.

Slide 59

Slide 59 text

Step 0

Slide 60

Slide 60 text

x Step 1.1

Slide 61

Slide 61 text

x 3 3: {1,1} Step 1.2

Slide 62

Slide 62 text

5 x 3 3: {1,1} 5: {0,0} Step 1.3

Slide 63

Slide 63 text

5 x 3 3: {1,1} 5: {0,0} Step 2 - pop min

Slide 64

Slide 64 text

5 x 3 2 2: {2,1} 5: {0,0} Step 2.1

Slide 65

Slide 65 text

5 x 3 2 2 2: {2,1} {1,2} 5: {0,0} Step 2.2

Slide 66

Slide 66 text

5 x 4 3 2 2 2: {2,1}, {1,2} 4: {0,1} 5: {0,0} Step 2.3

Slide 67

Slide 67 text

5 x 4 3 2 2 2: {2,1}, {1,2} 4: {0,1} 5: {0,0} Step 3 - pop min

Slide 68

Slide 68 text

5 x 4 3 2 1 2 1: {3,1} 2: {1,2} 4: {0,1} 5: {0,0} Step 3.1

Slide 69

Slide 69 text

5 x 4 3 2 1 2 1 1: {3,1} {2,2} 2: {1,2} 4: {0,1} 5: {0,0} Step 3.2

Slide 70

Slide 70 text

How to implement priority queue for this? Min heap

Slide 71

Slide 71 text

How to implement priority queue for this? Bucket queue

Slide 72

Slide 72 text

Defining PriorityQueue type PriorityQueue struct { buckets [64][]Coord mask uint64 }

Slide 73

Slide 73 text

func (q *PriorityQueue[T]) Push(priority int, value T) { i := uint(priority) & 0b111111 q.buckets[i] = append(q.buckets[i], value) q.mask |= 1 << i } This masking removes the bound check

Slide 74

Slide 74 text

mask: 0 buckets: Actions Queue state

Slide 75

Slide 75 text

Push(1, "foo") mask: 0b10 buckets: { 1: {"foo"}, } Actions Queue state

Slide 76

Slide 76 text

Push(1, "foo") Push(4, "bar") mask: 0b10010 buckets: { 1: {"foo"}, 4: {"bar"}, } Actions Queue state

Slide 77

Slide 77 text

Push(1, "foo") Push(4, "bar") Push(1, "baz") mask: 0b10010 buckets: { 1: {"foo", "baz"}, 4: {"bar"}, } Actions Queue state Unchanged!

Slide 78

Slide 78 text

func (q *PriorityQueue[T]) Pop() T { i := uint(bits.TrailingZeros64(q.mask)) if i < uint(len(q.buckets)) { e := q.buckets[i][len(q.buckets[i])-1] q.buckets[i] = q.buckets[i][:len(q.buckets[i])-1] if len(q.buckets[i]) == 0 { q.mask &^= 1 << i } return e } return T{} } TrailingZeros is basically a BSF+CMOV instructions on x86-64

Slide 79

Slide 79 text

Push(1, "foo") Push(4, "bar") Push(1, "baz") mask: 0b10010 buckets: { 1: {"foo", "baz"}, 4: {"bar"}, } Actions Queue state

Slide 80

Slide 80 text

Push(1, "foo") Push(4, "bar") Push(1, "baz") Pop() // => "baz" mask: 0b10010 buckets: { 1: {"foo"}, 4: {"bar"}, } Actions Queue state Unchanged!

Slide 81

Slide 81 text

Push(1, "foo") Push(4, "bar") Push(1, "baz") Pop() // => "baz" Pop() // => "foo" mask: 0b10000 buckets: { 4: {"bar"}, } Actions Queue state Becomes zero!

Slide 82

Slide 82 text

Push(1, "foo") Push(4, "bar") Push(1, "baz") Pop() // => "baz" Pop() // => "foo" Pop() // => "bar" mask: 0b00000 buckets: Actions Queue state All bits are 0

Slide 83

Slide 83 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 1 0 1 0 mask uint8 bucket elements

Slide 84

Slide 84 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 1 0 1 0 mask uint8 Pop#1 bits.TrailingZeros(0b00101010) => 1

Slide 85

Slide 85 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 1 0 1 0 mask uint8 Pop#2 bits.TrailingZeros(0b00101010) => 1

Slide 86

Slide 86 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 1 0 1 0 mask uint8 Pop#3 bits.TrailingZeros(0b00101010) => 1

Slide 87

Slide 87 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 1 0 0 0 mask uint8 Became zero

Slide 88

Slide 88 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 1 0 0 0 mask uint8 Pop#4 bits.TrailingZeros(0b00101000) => 3

Slide 89

Slide 89 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 0 0 0 0 mask uint8 Became zero

Slide 90

Slide 90 text

How Pop( ) calculates the bucket in O(1) 0 0 1 0 0 0 0 0 mask uint8 Pop#5 bits.TrailingZeros(0b00100000) => 5

Slide 91

Slide 91 text

func (q *PriorityQueue[T]) Reset() { offset := uint(bits.TrailingZeros64(q.mask)) q.mask >>= offset i := offset for q.mask != 0 { if i < uint(len(q.buckets)) { q.buckets[i] = q.buckets[i][:0] } q.mask >>= 1 i++ } q.mask = 0 } 100% memory re-use

Slide 92

Slide 92 text

Bucket64 properties ● Push( ) - O(1) ● Pop( ) - O(1) ● Reset( ) - O(1)* Reset is constant to the number of buckets. Note that we’re using the mask to skip reslicing batches of buckets, so it’s quite fast.

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

Greedy BFS performance-critical parts ● Matrix to store cell information (the grid) ● Result path representation ● Priority queue for “frontier” ● Map to store the visited cells

Slide 95

Slide 95 text

How to implement “visited set”? map[coord]data

Slide 96

Slide 96 text

How to implement “visited set”? [ ]data

Slide 97

Slide 97 text

How to implement “visited set”? [ ]data

Slide 98

Slide 98 text

2D (or 1D) array properties ● Get(coord) - O(1) ● Set(coord, value) - O(1) ● Reset( ) - O(n) Re-using a big array is an expensive memset(0). It makes them impractical for us.

Slide 99

Slide 99 text

CoordMap benchmarks, size=96*96 (ns/op) CONTAINER SET GET RESET array 10 4 7200 Reset( ) for an array can be quite slow

Slide 100

Slide 100 text

Sparse map to the rescue! https://research.swtch.com/sparse

Slide 101

Slide 101 text

Sparse map properties ● Get(coord) - O(1) ● Set(coord, value) - O(1) ● Reset( ) - O(1)

Slide 102

Slide 102 text

CoordMap benchmarks, size=96*96 (ns/op) CONTAINER SET GET RESET array 10 4 7200 sparse-dense (+25) 35 (+13) 17 1 sparse-dense Set( ) & Get( ) have some extra overhead

Slide 103

Slide 103 text

Generations-based map to the rescue! https://quasilyte.dev/blog/post/gen-map/

Slide 104

Slide 104 text

Elements Gen value gen

Slide 105

Slide 105 text

0 0 0 0 0 0 Elements 1 Gen value gen The zero-value container gen is 1, element gens are 0

Slide 106

Slide 106 text

0 0 A 1 0 0 0 Elements 1 Gen Set( ) assigns the value & updates the gen counter value gen

Slide 107

Slide 107 text

0 0 A 1 0 0 0 Elements 1 Gen Get( ) compares the elem and container gen values value gen Gen mismatch => not found

Slide 108

Slide 108 text

0 0 A 1 0 0 0 Elements 1 Gen Get( ) compares the elem and container gen values value gen Gen match => return A

Slide 109

Slide 109 text

0 0 A 1 0 0 0 Elements 2 Gen No memory reset is needed, Get( ) will work correctly right away value gen gen++

Slide 110

Slide 110 text

Generation map properties ● Get(coord) - O(1) ● Set(coord, value) - O(1) ● Reset( ) - O(1) minimal overhead

Slide 111

Slide 111 text

CoordMap benchmarks, size=96*96 (ns/op) CONTAINER SET GET RESET array 10 4 7200 sparse-dense (+25) 35 (+13) 17 1 gen-map (+4) 14 (+6) 10 1 Generations-based map is ~2 times faster than sparse-dense map

Slide 112

Slide 112 text

CoordMap size estimations ● Grid stores 1 cell for 2 bits ● CoordMap needs ~8 bytes per cell We’re paying x32 times more memory for the CoordMap! Can we still have an ~unlimited Grid size after that?

Slide 113

Slide 113 text

= x20 cells Grid 160x140 cells

Slide 114

Slide 114 text

= x20 cells Max path length (showed as 60 cells)

Slide 115

Slide 115 text

= x20 cells Max search zone (it’s less than Grid)

Slide 116

Slide 116 text

CoordMap max size ● Max search area is N=((56*2)^2)+2 => 12546 ● The max size is 12546 * 8 => 100368 bytes (0,1 MB) After that, you can increase the Grid size for as much as you want; the CoordMap won’t get bigger.

Slide 117

Slide 117 text

Defining coord map element type mapElem struct { value uint8 // Direction stored as uint8 gen uint32 // Generation counter } // Sizeof(mapElem) => 8 (due to padding)

Slide 118

Slide 118 text

Defining coord map element type mapElem struct { value uint8 // Direction stored as uint8 gen uint16 // Generation counter } // Sizeof(mapElem) => 4 (1 byte wasted)

Slide 119

Slide 119 text

Handling the overflow type mapElem struct { value uint8 // Direction stored as uint8 gen uint8 // Generation counter } // Sizeof(mapElem) => 2 (no bytes wasted)

Slide 120

Slide 120 text

Defining coord map element func (m *CoordMap) Reset() { // Use a constant to match the gen type. if m.gen == math.MaxUint16 { m.clear() // A real memclr; happens very rarely } else { m.gen++ } }

Slide 121

Slide 121 text

Defining coord map element func (m *CoordMap) clear() { m.gen = 1 clear(m.elems) // Sets elem.gen to 0 }

Slide 122

Slide 122 text

Gen counter size Smaller counter => less memory overhead per element, but real memory clears will happen more often. Counters like uint32-uint64 make real memory clears almost impossible, but they will consume more memory per element.

Slide 123

Slide 123 text

Overcoming the limitations

Slide 124

Slide 124 text

What if you need a path longer than 54 cells?

Slide 125

Slide 125 text

Build multiple paths!

Slide 126

Slide 126 text

Build multiple paths!

Slide 127

Slide 127 text

Build multiple paths!

Slide 128

Slide 128 text

More than 4 tile types in a game? Use per-biome layer sets! 0 Plains 1 Forest 2 Water 3 Mountains 0 Sand 1 Lava 2 Volcano 3 Mountains Jungle biome Inferno biome

Slide 129

Slide 129 text

When you see an L-shaped turn, a diagonal move is possible (hint: use a Peek2 function)

Slide 130

Slide 130 text

If both adjacent cells are free, do a diagonal move

Slide 131

Slide 131 text

If one of them is not free, don’t do a diagonal move

Slide 132

Slide 132 text

Diagonal moves in action

Slide 133

Slide 133 text

The closing notes

Slide 134

Slide 134 text

Zero allocs? ● Paths are just an on-stack value (16 bytes)

Slide 135

Slide 135 text

Zero allocs? ● Paths are just an on-stack value (16 bytes) ● Unlike builtin maps, sparse maps allows 100% re-use

Slide 136

Slide 136 text

Zero allocs? ● Paths are just an on-stack value (16 bytes) ● Unlike builtin maps, sparse maps allows 100% re-use ● Priority queue is also re-use friendly

Slide 137

Slide 137 text

A* vs Greedy BFS A* pros: ● Optimal paths ● Always finds the path A* cons: ● More expensive to compute ● Requires more memory

Slide 138

Slide 138 text

A* (21) Average BFS (27) My BFS (21)

Slide 139

Slide 139 text

A* vs Greedy BFS performance LIBRARY no_walls simple_wall multi_wall pathing/bfs 3525 2084 2688 pathing/astar 20140 3415 13310

Slide 140

Slide 140 text

Useful links ● My library source code ● Sparse set/map explained ● Generations map explained ● Great A* and greedy BFS introduction ● Morton space-filling curve ● Roboden game source code ● Awesome Ebitengine list

Slide 141

Slide 141 text

Useful links (2) ● Factorio path finding details ● How to get success in life

Slide 142

Slide 142 text

Zero alloc pathfinding @quasilyte 2023 Roboden game Free & Source-Available “GOOD GAEM” - quasilyte