Slide 24
Slide 24 text
• 64 byte offset between loads
• Each is on separate cache line
• 60 from 64 bytes are wasted
addss xmm6,dword ptr [rax-40h]
addss xmm6,dword ptr [rax]
addss xmm6,dword ptr [rax+40h]
addss xmm6,dword ptr [rax+80h]
addss xmm6,dword ptr [rax+0C0h]
addss xmm6,dword ptr [rax+100h]
addss xmm6,dword ptr [rax+140h]
addss xmm6,dword ptr [rax+180h]
add rax,200h
cmp rax,rcx
jl main+0A0h
*MSVC loves x8 loop unrolling
Optimizing for data cache