Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Code and Memory Optimisation Tricks

Sperasoft
October 03, 2016

Code and Memory Optimisation Tricks

Presentation by Evgeny Muralev at DevGamm2016 conference

#programming #software #IT #gamedev

Sperasoft

October 03, 2016
Tweet

More Decks by Sperasoft

Other Decks in Technology

Transcript

  1. About me • Software engineer at Sperasoft • Worked on

    code for EA Sports games (FIFA, NFL, Madden); Now Ubisoft AAA title • Indie game developer in my free time 
  2. Our Clients Electronic Arts Riot Games Wargaming BioWare Ubisoft Disney

    Sony Our Projects Dragon Age: Inquisition FIFA 14 SIMS 4 Mass Effect 2 League of Legends Grand Theft Auto V About us Our Office Locations USA Poland Russia The Facts Founded in 2004 300+ employees Sperasoft on-line sperasoft.com linkedin.com/company/sperasoft twitter.com/sperasoft facebook.com/sperasoft
  3. Developing AAA title • Fixed performance requirements • Min 30fps

    (33.3ms per frame) • Performance is a king • a LOT of work to do in one frame!
  4. Make code faster?… • Improved hardware • Wait for another

    generation • Fixed on consoles • Improved algorithm • Very important • Hardware-aware optimization • Optimizing for (limited) range of hardware • Microoptimizations for specific architecture!
  5. Brief overview CPU REG L1 I L1 D L2 I/D

    RAM REG ~2 cycles ~20 cycles ~200 cycles
  6. • Last level cache (LLC) miss cost ~ 200 cycles

    • Intel Skylake instruction latencies • ADDPS/ADDSS 4 cycles • MULPS/MULSS 4 cycles • DIVSS/DIVPS 11 cycles • SQRTPS/SQRTSS 13 cycles Brief overview
  7. Intel Skylake case study: Level Capacity/ Associativity Fastest Latency Peak

    Bandwidth (B/cycle) L1/D 32Kb/8 4 cycles 96 (2x32 Load + 1x32 Store) L1/I 32Kb/8 N/A N/A L2 256Kb/4 12 cycles 64B/cycle L3 (shared) Up to 2Mb per core/Up to 16 44 cycles 32B/cycle http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Brief overview
  8. • Out-of-order execution cannot hide big latencies like access to

    main memory • That’s why processor always tries to prefetch ahead • Both instructions and data Brief overview
  9. • Linear data access is the best you can do

    to help hardware prefetching • Processor recognizes pattern and preload data for next iterations beforehand Vec4D in[SIZE]; // Offset from origin float ChebyshevDist[SIZE]; // Chebyshev distance from origin for (auto i = 0; i < SIZE; ++i) { ChebyshevDist[i] = Max(in[i].x, in[i].y, in[i].z, in[i].w); } Optimizing for data cache
  10. • Access patterns must be trivial • Triggering prefetching after

    every cache miss will pollute a cache • Prefetching cannot happen across page boundaries • Might trigger invalid page table walk (on TLB miss) Optimizing for data cache
  11. • Prefetching is blocked • next->next is not known •

    Cache miss every iteration! • Increases chance of TLB misses • * Depending on your memory allocator current current->next->next current->next struct GameActor { // Data… GameActor* next; }; while (current != nullptr) { // Do some operations on current // actor… current = current->next; } LLC miss! LLC miss! LLC miss! Optimizing for data cache
  12. • Load from memory: • auto data = *pointerToData; •

    Special instructions: • use intrinsics: _mm_prefetch(void *p, enum _mmhint h) Configurable! Optimizing for data cache
  13. • Usually retire after virtual to physical address translation is

    completed • In case of exception such as page fault software prefetch retired without prefetching any data e.g. Intel guide on prefetch instructions: • Load from memory != prefetch instructions • Prefetch instructions may differ depending on H/W vendor Optimizing for data cache
  14. • Probably won’t help • Computations don’t overlap memory access

    time enough • Remember LLC miss ~ 200c vs trivial ALUs ~ 3-4c while (current != nullptr) { Prefetch(current->next) // Trivial ALU computations on current actor current = current->next; } Optimizing for data cache
  15. while (current != nullptr) { Prefetch(current->next) //HighLatencyComputation… current = current->next;

    } • May help around high latency • Make sure data is not evicted from cache before use Optimizing for data cache
  16. • Prefetch far enough to overlap memory access time •

    Prefetch near enough so it’s not evicted from data cache • Do NOT overprefetch • Prefetching is not free • Polluting cache • Always profile when using software prefetching Optimizing for data cache
  17. RAM: … … … … … … a … …

    … … … … … … … … … … … … … … … … … … … … … … … Cache: • Cache operates with blocks called “cache lines” • When accessing “a” whole cache line is loaded • You can expect 64 bytes wide cache line on x64 a … … … … … … … … … … … … … … … Optimizing for data cache
  18. struct FooBonus { float fooBonus; float otherData[15]; }; // For

    every character… // Assume we have array<FooBonus> structs; float Sum{0.0f}; for (auto i = 0; i < SIZE; ++i) { Actor->Total += FooArray[i].fooBonus; } Example of poor data layout: Optimizing for data cache
  19. • 64 byte offset between loads • Each is on

    separate cache line • 60 from 64 bytes are wasted addss xmm6,dword ptr [rax-40h] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+40h] addss xmm6,dword ptr [rax+80h] addss xmm6,dword ptr [rax+0C0h] addss xmm6,dword ptr [rax+100h] addss xmm6,dword ptr [rax+140h] addss xmm6,dword ptr [rax+180h] add rax,200h cmp rax,rcx jl main+0A0h *MSVC loves x8 loop unrolling Optimizing for data cache
  20. • Look for patterns how your data is accessed •

    Split the data based on access patterns • Data used together should be located together • Look for most common case Optimizing for data cache
  21. Cold fields struct FooBonus { MiscData* otherData; float fooBonus; };

    struct MiscData { float otherData[15]; }; Optimizing for data cache + 4 bytes for memory alignment on 64bit
  22. • 12 byte offset • Much less bandwidth is wasted

    • Can do better?! addss xmm6,dword ptr [rax-0Ch] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+18h] addss xmm6,dword ptr [rax+24h] addss xmm6,dword ptr [rax+30h] addss xmm6,dword ptr [rax+3Ch] addss xmm6,dword ptr [rax+48h] add rax,60h cmp rax,rcx jl main+0A0h Optimizing for data cache
  23. • Maybe no need to make a pointer to the

    cold fields? • Make use of Structure of Arrays • Store and index different arrays struct FooBonus { float fooBonus; }; struct MiscData { float otherData[15]; }; Optimizing for data cache
  24. • 100% bandwidth utilization • If everything is 64byte aligned

    addss xmm6,dword ptr [rax-4] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+4] addss xmm6,dword ptr [rax+8] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+10h] addss xmm6,dword ptr [rax+14h] addss xmm6,dword ptr [rax+18h] add rax,20h cmp rax,rcx jl main+0A0h Optimizing for data cache
  25. • Poor data utilization: • Wasted bandwidth • Increasing probability

    of TLB misses • More cache misses due to crossing page boundary Optimizing for data cache
  26. • Recognize data access patterns: • Just analyze the data

    and how it’s used • Include logging to getters/setters • Collect any other useful data (time/counters) float GameCharacter::GetStamina() const { // Active only in debug build CollectData(“GameCharacter::Stamina”); return Stamina; } Optimizing for data cache
  27. • What to consider: • What data is accessed together

    • How often data is accessed? • From where it’s accessed? Optimizing for data cache
  28. • Instruction fetch • Decoding • Execution • Memory Access

    • Retirement *of course it is more complex on real hardware  Optimizing branches Instruction lifetime:
  29. IF ID EX MEM WB I1 I2 I1 I3 I2

    I1 Optimizing branches
  30. IF ID EX MEM WB I1 I2 I1 I3 I2

    I1 I4 I3 I2 I1 Optimizing branches
  31. IF ID EX MEM WB I1 I2 I1 I3 I2

    I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 Optimizing branches
  32. • What instructions to fetch after Inst A? • Condition

    hasn’t been evaluated yet • Processor speculatively chooses one of the paths • Wrong guess is called branch misprediction // Instruction A if (Condition == true) { // Instruction B // Instruction C } else { // Instruction D // Instruction E } Optimizing branches
  33. IF ID EX MEM WB A B A C B

    A A D A • Pipeline Flush • A lot of wasted cycles  Mispredicted branch! Optimizing branches
  34. • Try to remove branches at all • Especially hard

    to predict branches • Reduces chance of branch misprediction • Doesn’t take resources of Branch Target Buffer Optimizing branches
  35. Know bit tricks! Example: Negate number based on flag value

    int In; int Out; bool bDontNegate; r = (bDontNegate ^ (bDontNegate– 1)) * v; int In; int Out; bool bDontNegate; Out = In; if (bDontNegate == false) { out *= -1; } Branchy version: Branchless version: https://graphics.stanford.edu/~seander/bithacks.html Optimizing branches
  36. • Compute both branches Example: X = (A < B)

    ? CONST1 : CONST2 Optimizing branches
  37. • Conditional instructions (setCC and cmovCC) cmp a, b ;Condition

    jbe L30 ;Conditional branch mov ebx const1 ;ebx holds X jmp L31 ;Unconditional branch L30: mov ebx const2 L31: X = (A < B) ? CONST1 : CONST2 xor ebx, ebx ;Clear ebx (X in the C code) cmp A, B setge bl ;When ebx = 0 or 1 ;OR the complement condition sub ebx, 1 ;ebx=11..11 or 00..00 and ebx, const3 ;const3 = const1-const2 add ebx, const2 ;ebx=const1 or const2 Branchy version: Branchless version: http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Optimizing branches
  38. • SIMD mask + blending example X = (A <

    B) ? CONST1 : CONST2 // create selector mask = __mm_cmplt_ps(a, b); // blend values res = __mm_blendv_ps(const2, const1, mask); mask = 0xffffffff if (a < b); 0 otherwise blend values using mask Optimizing branches
  39. • Do it only for hard to predict branches •

    Obviously have to compute both results • Introduces data-dependency blocking out-of- order execution • Profile! Compute both summary: Optimizing branches
  40. • Blue nodes - archers • Red nodes - swordsmen

    Optimizing branches Example: Need to updatea squad
  41. struct CombatActor { // Data… EUnitType Type; //ARCHER or SWORDSMAN

    }; struct Squad { CombatActor Units[SIZE][SIZE]; }; void UpdateArmy(const Squad& squad) { for (auto i = 0; i < SIZE; ++i) for (auto j = 0; j < SIZE; ++j) { const auto & Unit = squad.Units[i][j]; switch (Unit.Type) { case EElementType::ARCHER: // Process archer break; case EElementType::SWORDSMAN: // Process swordsman break; default: // Handle default break; } } } • Branching every iteration? • Bad performance for hard-to-predict branches Optimizing branches
  42. struct CombatActor { // Data… EUnitType Type; //ARCHER or SWORDSMAN

    }; struct Squad { CombatActor Archers[A_SIZE]; CombatActor Swordsmen[S_SIZE]; }; void UpdateArchers(const Squad & squad) { // Just iterate and process, no branching here // Update archers } • Split! And process separately • No branching in processing methods • + Better utilization of I-cache! void UpdateSwordsmen(const Squad & squad) { // Just iterate and process, no branching here // Update swordsmen } Optimizing branches
  43. • For very predictable branches: • Generally prefer predicted not

    taken conditional branches • Depending on architecture predicted taken branch may take little more latency Optimizing branches
  44. ; function prologue cmp dword ptr [data], 0 je END

    ; set of some ALU instructions… ;… END: ; function epilogue ; function prologue cmp dword ptr [data], 0 jne COMP jmp END COMP: ; set of some ALU instructions… ;… END: ; function epilogue • Imagine cmp dword ptr [data], 0 – likely to evaluate to “false” • Prefer predicted not taken Predicted not taken Predicted taken Optimizing branches
  45. • Study branch predictor on target architecture • Consider whether

    you really need a branch • Compute both results • Bit/Math hacks • Study the data and split it • Based on access patterns • Based on performed computation Optimizing branches
  46. Conclusion • Know your hardware • Architecture matters! • Design

    code around data, not abstractions • Hardware is a real thing