Slide 1

Slide 1 text

Code and memory optimization tricks Evgeny Muralev Software Engineer Sperasoft Inc.

Slide 2

Slide 2 text

About me • Software engineer at Sperasoft • Worked on code for EA Sports games (FIFA, NFL, Madden); Now Ubisoft AAA title • Indie game developer in my free time 

Slide 3

Slide 3 text

Our Clients Electronic Arts Riot Games Wargaming BioWare Ubisoft Disney Sony Our Projects Dragon Age: Inquisition FIFA 14 SIMS 4 Mass Effect 2 League of Legends Grand Theft Auto V About us Our Office Locations USA Poland Russia The Facts Founded in 2004 300+ employees Sperasoft on-line sperasoft.com linkedin.com/company/sperasoft twitter.com/sperasoft facebook.com/sperasoft

Slide 4

Slide 4 text

Agenda • Brief architecture overview • Optimizing for data cache • Optimizing branches (and I-cache)

Slide 5

Slide 5 text

Developing AAA title • Fixed performance requirements • Min 30fps (33.3ms per frame) • Performance is a king • a LOT of work to do in one frame!

Slide 6

Slide 6 text

Make code faster?… • Improved hardware • Wait for another generation • Fixed on consoles • Improved algorithm • Very important • Hardware-aware optimization • Optimizing for (limited) range of hardware • Microoptimizations for specific architecture!

Slide 7

Slide 7 text

Brief overview CPU REG L1 I L1 D L2 I/D RAM REG ~2 cycles ~20 cycles ~200 cycles

Slide 8

Slide 8 text

• Last level cache (LLC) miss cost ~ 200 cycles • Intel Skylake instruction latencies • ADDPS/ADDSS 4 cycles • MULPS/MULSS 4 cycles • DIVSS/DIVPS 11 cycles • SQRTPS/SQRTSS 13 cycles Brief overview

Slide 9

Slide 9 text

Brief overview

Slide 10

Slide 10 text

Intel Skylake case study: Level Capacity/ Associativity Fastest Latency Peak Bandwidth (B/cycle) L1/D 32Kb/8 4 cycles 96 (2x32 Load + 1x32 Store) L1/I 32Kb/8 N/A N/A L2 256Kb/4 12 cycles 64B/cycle L3 (shared) Up to 2Mb per core/Up to 16 44 cycles 32B/cycle http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Brief overview

Slide 11

Slide 11 text

• Out-of-order execution cannot hide big latencies like access to main memory • That’s why processor always tries to prefetch ahead • Both instructions and data Brief overview

Slide 12

Slide 12 text

• Linear data access is the best you can do to help hardware prefetching • Processor recognizes pattern and preload data for next iterations beforehand Vec4D in[SIZE]; // Offset from origin float ChebyshevDist[SIZE]; // Chebyshev distance from origin for (auto i = 0; i < SIZE; ++i) { ChebyshevDist[i] = Max(in[i].x, in[i].y, in[i].z, in[i].w); } Optimizing for data cache

Slide 13

Slide 13 text

• Access patterns must be trivial • Triggering prefetching after every cache miss will pollute a cache • Prefetching cannot happen across page boundaries • Might trigger invalid page table walk (on TLB miss) Optimizing for data cache

Slide 14

Slide 14 text

• What about traversal of pointer-based data structures? • Spoiler: It sucks Optimizing for data cache

Slide 15

Slide 15 text

• Prefetching is blocked • next->next is not known • Cache miss every iteration! • Increases chance of TLB misses • * Depending on your memory allocator current current->next->next current->next struct GameActor { // Data… GameActor* next; }; while (current != nullptr) { // Do some operations on current // actor… current = current->next; } LLC miss! LLC miss! LLC miss! Optimizing for data cache

Slide 16

Slide 16 text

Array vs Linked List traversal Linear Data Random access Optimizing for data cache Time N of elements

Slide 17

Slide 17 text

• Load from memory: • auto data = *pointerToData; • Special instructions: • use intrinsics: _mm_prefetch(void *p, enum _mmhint h) Configurable! Optimizing for data cache

Slide 18

Slide 18 text

• Usually retire after virtual to physical address translation is completed • In case of exception such as page fault software prefetch retired without prefetching any data e.g. Intel guide on prefetch instructions: • Load from memory != prefetch instructions • Prefetch instructions may differ depending on H/W vendor Optimizing for data cache

Slide 19

Slide 19 text

• Probably won’t help • Computations don’t overlap memory access time enough • Remember LLC miss ~ 200c vs trivial ALUs ~ 3-4c while (current != nullptr) { Prefetch(current->next) // Trivial ALU computations on current actor current = current->next; } Optimizing for data cache

Slide 20

Slide 20 text

while (current != nullptr) { Prefetch(current->next) //HighLatencyComputation… current = current->next; } • May help around high latency • Make sure data is not evicted from cache before use Optimizing for data cache

Slide 21

Slide 21 text

• Prefetch far enough to overlap memory access time • Prefetch near enough so it’s not evicted from data cache • Do NOT overprefetch • Prefetching is not free • Polluting cache • Always profile when using software prefetching Optimizing for data cache

Slide 22

Slide 22 text

RAM: … … … … … … a … … … … … … … … … … … … … … … … … … … … … … … … … Cache: • Cache operates with blocks called “cache lines” • When accessing “a” whole cache line is loaded • You can expect 64 bytes wide cache line on x64 a … … … … … … … … … … … … … … … Optimizing for data cache

Slide 23

Slide 23 text

struct FooBonus { float fooBonus; float otherData[15]; }; // For every character… // Assume we have array structs; float Sum{0.0f}; for (auto i = 0; i < SIZE; ++i) { Actor->Total += FooArray[i].fooBonus; } Example of poor data layout: Optimizing for data cache

Slide 24

Slide 24 text

• 64 byte offset between loads • Each is on separate cache line • 60 from 64 bytes are wasted addss xmm6,dword ptr [rax-40h] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+40h] addss xmm6,dword ptr [rax+80h] addss xmm6,dword ptr [rax+0C0h] addss xmm6,dword ptr [rax+100h] addss xmm6,dword ptr [rax+140h] addss xmm6,dword ptr [rax+180h] add rax,200h cmp rax,rcx jl main+0A0h *MSVC loves x8 loop unrolling Optimizing for data cache

Slide 25

Slide 25 text

• Look for patterns how your data is accessed • Split the data based on access patterns • Data used together should be located together • Look for most common case Optimizing for data cache

Slide 26

Slide 26 text

Cold fields struct FooBonus { MiscData* otherData; float fooBonus; }; struct MiscData { float otherData[15]; }; Optimizing for data cache + 4 bytes for memory alignment on 64bit

Slide 27

Slide 27 text

• 12 byte offset • Much less bandwidth is wasted • Can do better?! addss xmm6,dword ptr [rax-0Ch] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+18h] addss xmm6,dword ptr [rax+24h] addss xmm6,dword ptr [rax+30h] addss xmm6,dword ptr [rax+3Ch] addss xmm6,dword ptr [rax+48h] add rax,60h cmp rax,rcx jl main+0A0h Optimizing for data cache

Slide 28

Slide 28 text

• Maybe no need to make a pointer to the cold fields? • Make use of Structure of Arrays • Store and index different arrays struct FooBonus { float fooBonus; }; struct MiscData { float otherData[15]; }; Optimizing for data cache

Slide 29

Slide 29 text

• 100% bandwidth utilization • If everything is 64byte aligned addss xmm6,dword ptr [rax-4] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+4] addss xmm6,dword ptr [rax+8] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+10h] addss xmm6,dword ptr [rax+14h] addss xmm6,dword ptr [rax+18h] add rax,20h cmp rax,rcx jl main+0A0h Optimizing for data cache

Slide 30

Slide 30 text

B/D Utilization Attempt 1 Attempt 2 Attempt 3 Optimizing for data cache Time N of elements

Slide 31

Slide 31 text

• Poor data utilization: • Wasted bandwidth • Increasing probability of TLB misses • More cache misses due to crossing page boundary Optimizing for data cache

Slide 32

Slide 32 text

• Recognize data access patterns: • Just analyze the data and how it’s used • Include logging to getters/setters • Collect any other useful data (time/counters) float GameCharacter::GetStamina() const { // Active only in debug build CollectData(“GameCharacter::Stamina”); return Stamina; } Optimizing for data cache

Slide 33

Slide 33 text

• What to consider: • What data is accessed together • How often data is accessed? • From where it’s accessed? Optimizing for data cache

Slide 34

Slide 34 text

• Instruction fetch • Decoding • Execution • Memory Access • Retirement *of course it is more complex on real hardware  Optimizing branches Instruction lifetime:

Slide 35

Slide 35 text

IF ID EX MEM WB I1 Optimizing branches

Slide 36

Slide 36 text

IF ID EX MEM WB I1 I2 I1 Optimizing branches

Slide 37

Slide 37 text

IF ID EX MEM WB I1 I2 I1 I3 I2 I1 Optimizing branches

Slide 38

Slide 38 text

IF ID EX MEM WB I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 Optimizing branches

Slide 39

Slide 39 text

IF ID EX MEM WB I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 Optimizing branches

Slide 40

Slide 40 text

• What instructions to fetch after Inst A? • Condition hasn’t been evaluated yet • Processor speculatively chooses one of the paths • Wrong guess is called branch misprediction // Instruction A if (Condition == true) { // Instruction B // Instruction C } else { // Instruction D // Instruction E } Optimizing branches

Slide 41

Slide 41 text

IF ID EX MEM WB A B A C B A A D A • Pipeline Flush • A lot of wasted cycles  Mispredicted branch! Optimizing branches

Slide 42

Slide 42 text

• Try to remove branches at all • Especially hard to predict branches • Reduces chance of branch misprediction • Doesn’t take resources of Branch Target Buffer Optimizing branches

Slide 43

Slide 43 text

Know bit tricks! Example: Negate number based on flag value int In; int Out; bool bDontNegate; r = (bDontNegate ^ (bDontNegate– 1)) * v; int In; int Out; bool bDontNegate; Out = In; if (bDontNegate == false) { out *= -1; } Branchy version: Branchless version: https://graphics.stanford.edu/~seander/bithacks.html Optimizing branches

Slide 44

Slide 44 text

• Compute both branches Example: X = (A < B) ? CONST1 : CONST2 Optimizing branches

Slide 45

Slide 45 text

• Conditional instructions (setCC and cmovCC) cmp a, b ;Condition jbe L30 ;Conditional branch mov ebx const1 ;ebx holds X jmp L31 ;Unconditional branch L30: mov ebx const2 L31: X = (A < B) ? CONST1 : CONST2 xor ebx, ebx ;Clear ebx (X in the C code) cmp A, B setge bl ;When ebx = 0 or 1 ;OR the complement condition sub ebx, 1 ;ebx=11..11 or 00..00 and ebx, const3 ;const3 = const1-const2 add ebx, const2 ;ebx=const1 or const2 Branchy version: Branchless version: http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Optimizing branches

Slide 46

Slide 46 text

• SIMD mask + blending example X = (A < B) ? CONST1 : CONST2 // create selector mask = __mm_cmplt_ps(a, b); // blend values res = __mm_blendv_ps(const2, const1, mask); mask = 0xffffffff if (a < b); 0 otherwise blend values using mask Optimizing branches

Slide 47

Slide 47 text

• Do it only for hard to predict branches • Obviously have to compute both results • Introduces data-dependency blocking out-of- order execution • Profile! Compute both summary: Optimizing branches

Slide 48

Slide 48 text

• Blue nodes - archers • Red nodes - swordsmen Optimizing branches Example: Need to updatea squad

Slide 49

Slide 49 text

struct CombatActor { // Data… EUnitType Type; //ARCHER or SWORDSMAN }; struct Squad { CombatActor Units[SIZE][SIZE]; }; void UpdateArmy(const Squad& squad) { for (auto i = 0; i < SIZE; ++i) for (auto j = 0; j < SIZE; ++j) { const auto & Unit = squad.Units[i][j]; switch (Unit.Type) { case EElementType::ARCHER: // Process archer break; case EElementType::SWORDSMAN: // Process swordsman break; default: // Handle default break; } } } • Branching every iteration? • Bad performance for hard-to-predict branches Optimizing branches

Slide 50

Slide 50 text

struct CombatActor { // Data… EUnitType Type; //ARCHER or SWORDSMAN }; struct Squad { CombatActor Archers[A_SIZE]; CombatActor Swordsmen[S_SIZE]; }; void UpdateArchers(const Squad & squad) { // Just iterate and process, no branching here // Update archers } • Split! And process separately • No branching in processing methods • + Better utilization of I-cache! void UpdateSwordsmen(const Squad & squad) { // Just iterate and process, no branching here // Update swordsmen } Optimizing branches

Slide 51

Slide 51 text

• For very predictable branches: • Generally prefer predicted not taken conditional branches • Depending on architecture predicted taken branch may take little more latency Optimizing branches

Slide 52

Slide 52 text

; function prologue cmp dword ptr [data], 0 je END ; set of some ALU instructions… ;… END: ; function epilogue ; function prologue cmp dword ptr [data], 0 jne COMP jmp END COMP: ; set of some ALU instructions… ;… END: ; function epilogue • Imagine cmp dword ptr [data], 0 – likely to evaluate to “false” • Prefer predicted not taken Predicted not taken Predicted taken Optimizing branches

Slide 53

Slide 53 text

• Study branch predictor on target architecture • Consider whether you really need a branch • Compute both results • Bit/Math hacks • Study the data and split it • Based on access patterns • Based on performed computation Optimizing branches

Slide 54

Slide 54 text

Conclusion • Know your hardware • Architecture matters! • Design code around data, not abstractions • Hardware is a real thing

Slide 55

Slide 55 text

Resources • http://www.agner.org/optimize/microarchitecture.pdf • https://people.freebsd.org/~lstewart/articles/cpumemory.pdf • http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf • http://www.intel.ru/content/dam/www/public/us/en/documents/manual s/64-ia-32-architectures-optimization-manual.pdf • https://graphics.stanford.edu/~seander/bithacks.html

Slide 56

Slide 56 text

Questions? • E-mail: [email protected] • Twitter: @EvgenyGD • Web: evgenymuralev.com

Slide 57

Slide 57 text

• www.sperasoft.com • Follow us on Twitter, LinkedIn and Facebook! Sperasoft