Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SETCON'19 - Alexander Stepaniuk - Effective Memory

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

SETCON'19 - Alexander Stepaniuk - Effective Memory

Avatar for Maksim

Maksim

May 10, 2019

More Decks by Maksim

Other Decks in Technology

Transcript

  1. Memory In computing, memory refers to the computer hardware devices

    used to store information for immediate use in a computer www.wikipedia.org Изображение Powered by EPAM
  2. Memory hierarchy Powered by EPAM L0 registers L1 L2 L3

    L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM)
  3. Memory hierarchy Powered by EPAM L0 registers L1 L2 L3

    L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM)
  4. Memory hierarchy Powered by EPAM L0 registers L1 L2 L3

    L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM)
  5. Memory hierarchy Powered by EPAM L0 registers L1 L2 L3

    L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM) local secondary storage (local disks)
  6. Memory hierarchy Powered by EPAM L0 registers on-chip L1 cache

    (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (distributed file servers, web servers) L1 L2 L3 L4 L5 L6
  7. Memory hierarchy Powered by EPAM L0 registers on-chip L1 cache

    (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (distributed file servers, web servers) L1 L2 L3 L4 L5 L6 Smaller, faster, costlier Larger, slower, cheaper
  8. Keynotes Powered by EPAM Memory is hierarchical. The fastest memory

    is smaller. The slowest memory is cheaper.
  9. Keynotes Powered by EPAM Memory is hierarchical. The fastest memory

    is smaller. The slowest memory is cheaper. Data Locality principle is a phenomenon in which the same values, or related storage locations, are frequently accessed.
  10. CPU Cache Powered by EPAM Intel Core i7 Cache Hierarchy

    Processor package Core 0 Registers
  11. CPU Cache Powered by EPAM Intel Core i7 Cache Hierarchy

    Processor package Core 0 Registers L1 d-cache L1 i-cache L1 cache: 32 KB, 8-way Access: 4 cycles
  12. CPU Cache Powered by EPAM Intel Core i7 Cache Hierarchy

    Processor package Core 0 Registers L1 d-cache L1 i-cache L2 unified cache L1 cache: 32 KB, 8-way Access: 4 cycles L2 cache: 256 KB, 8-way Access: 11 cycles
  13. CPU Cache Powered by EPAM Intel Core i7 Cache Hierarchy

    Processor package Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache L1 cache: 32 KB, 8-way Access: 4 cycles L2 cache: 256 KB, 8-way Access: 11 cycles L3 cache: 8 MB, 16-way Access: 30-40 cycles
  14. CPU Cache Powered by EPAM Intel Core i7 Cache Hierarchy

    Processor package Main memory Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache L1 cache: 32 KB, 8-way Access: 4 cycles L2 cache: 256 KB, 8-way Access: 11 cycles L3 cache: 8 MB, 16-way Access: 30-40 cycles
  15. Cache unit organization Powered by EPAM Intel Core i7 Cache

    Hierarchy Processor package Main memory Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache
  16. Cache unit organization Powered by EPAM Tag Data Flags Tag

    Data Flags Tag Data Flags Tag Data Flags … Cache unit
  17. Cache entry structure Powered by EPAM Tag Data Flags Tag

    Data Flags Tag Data Flags Tag Data Flags … Cache Line (Payload) Cache unit Common cache line sizes: 32, 64 and 128 bytes
  18. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Word Address Word (Data) Cache hit 4 cycles
  19. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache 4 cycles
  20. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache Word Address 4 cycles
  21. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles
  22. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles L2 Cache Line Address
  23. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles L2 Cache Line Address …
  24. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles L2 Cache Line Address … Line (Data)
  25. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache Cache miss Word Address Word (Data) 4 cycles L2 Cache Line Address Line (Data) … 11+ cycles
  26. Memory read flow Powered by EPAM CPU ALU L1 Cache

    Cache hit CPU ALU L1 Cache Cache miss Word Address Word (Data) 4 cycles L2 Cache Line Address Line (Data) … 11+ cycles
  27. Example: Matrix multiplication Powered by EPAM a00 a01 a02 a10

    a11 a12 A = b00 B01 b10 b11 b20 b21 B =
  28. Example: Matrix multiplication Powered by EPAM a00 a01 a02 a10

    a11 a12 A = b00 B01 b10 b11 b20 b21 B = C = AB = a00 a01 a02 a10 a11 a12 b00 B01 b10 b11 b20 b21
  29. Example: Matrix multiplication Powered by EPAM a00 a01 a02 a10

    a11 a12 A = b00 B01 b10 b11 b20 b21 B = C = AB = a00 a01 a02 a10 a11 a12 b00 B01 b10 b11 b20 b21 a00 b00 + a01 b10 + a02 b20 a00 b01 + a01 b11 + a02 b21 a10 b00 + a11 b10 + a12 b20 a00 b01 + a01 b11 + a02 b21 = =
  30. Example: Matrix multiplication Powered by EPAM C = AB for

    i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Could be implemented as:
  31. for i in 0..n for j in 0..m for k

    in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM Simple change improves performance dramatically:
  32. for i in 0..n for j in 0..m for k

    in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM Simple change improves performance dramatically: What is happening?
  33. Example: Matrix multiplication Powered by EPAM a[0,0] a[0,1] … a[0,m-1]

    a[1,0] a[1,1] … a[1,m-1] … a[n-1,0] a[n-1,1] … a[n-1,m-1] Matrix:
  34. Example: Matrix multiplication Powered by EPAM a[0,0] a[0,1] … a[0,m-1]

    a[1,0] a[1,1] … a[1,m-1] … a[n-1,0] a[n-1,1] … a[n-1,m-1] Matrix: In memory:
  35. a[0,0] a[0,1] … a[0,m-1] a[1,0] a[1,1] … a[1,m-1] … a[n-1,0]

    a[n-1,1] … a[n-1,m-1] Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] Matrix: In memory: Row 0 Row 1 Row n-1 a[0,0] a[1,0] a[n-1,0]
  36. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0]
  37. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0]
  38. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0]
  39. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0]
  40. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler
  41. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler
  42. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler
  43. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler Iterating over the items in the single row. Row data are cached.
  44. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler Iterating over the items in the single row. Row data are cached.
  45. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler Iterating over the items in the single row. Row data are cached.
  46. Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]

    … a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler Iterating over the items in the single row. Row data are cached. Iterating over different rows. Cache miss is highly likely.
  47. for i in 0..n for j in 0..m for k

    in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0]
  48. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0]
  49. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0]
  50. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0]
  51. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached.
  52. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached.
  53. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached. Factored out of the loop.
  54. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached. Factored out of the loop.
  55. for i in 0..n for k in 0..p for j

    in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached. Factored out of the loop. Again iterating over the single row. Cached.
  56. Keynote Powered by EPAM Consider Data Locality principle to get

    better performance when dealing with memory.
  57. Cache Coherence Powered by EPAM Intel Core i7 Cache Hierarchy

    Processor package Main memory Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache
  58. Cache Coherence Powered by EPAM Intel Core i7 Cache Hierarchy

    Processor package Main memory Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache
  59. Cache Coherence Powered by EPAM Core 0 Variable L1 Cache

    Core 1 Variable L1 Cache Variable L3 Cache
  60. Cache Coherence Powered by EPAM Core 0 Variable L1 Cache

    Core 1 Variable L1 Cache Variable L3 Cache Write
  61. Cache Coherence Powered by EPAM Core 0 Variable L1 Cache

    Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e
  62. Cache Coherence Powered by EPAM Core 0 Variable L1 Cache

    Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read
  63. Cache Coherence Powered by EPAM Core 0 Variable L1 Cache

    Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read Write
  64. Cache Coherence Powered by EPAM Core 0 Variable L1 Cache

    Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read Write Load
  65. Cache Coherence Powered by EPAM Core 0 Variable L1 Cache

    Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read Write Load Load
  66. False Sharing Antipattern Powered by EPAM struct Point { int

    X; int Y; }; vector<Point> data(SIZE);
  67. False Sharing Antipattern Powered by EPAM struct Point { int

    X; int Y; }; vector<Point> data(SIZE); Let’s increment X and Y for each item in vector
  68. False Sharing Antipattern Powered by EPAM struct Point { int

    X; int Y; }; vector<Point> data(SIZE); Let’s increment X and Y for each item in vector for (auto& p : data) { ++p.X; ++p.Y; }
  69. False Sharing Antipattern Powered by EPAM struct Point { int

    X; int Y; }; vector<Point> data(SIZE); Let’s increment X and Y for each item in vector for (auto& p : data) { ++p.X; ++p.Y; }
  70. False Sharing Antipattern Powered by EPAM struct Point { int

    X; int Y; }; vector<Point> data(SIZE); // Thread 1 (affinity 1) for (auto& p : data) { ++p.X; } // Thread 2 (affinity 4) for (auto& p : data) { ++p.Y; }
  71. False Sharing Antipattern Powered by EPAM struct Point { int

    X; int Y; }; vector<Point> data(SIZE); // Thread 1 (affinity 1) for (auto& p : data) { ++p.X; } // Thread 2 (affinity 4) for (auto& p : data) { ++p.Y; }
  72. False Sharing Antipattern Powered by EPAM // Thread 1 (affinity

    1) for (auto& p : data) { ++p.X; } // Thread 2 (affinity 4) for (auto& p : data) { ++p.Y; } for (auto& p : data) { ++p.X; ++p.Y; } Single-threaded implementation is faster
  73. False Sharing Antipattern Powered by EPAM // Thread 1 (affinity

    1) for (auto& p : data) { ++p.X; } // Thread 2 (affinity 4) for (auto& p : data) { ++p.Y; } for (auto& p : data) { ++p.X; ++p.Y; } Single-threaded implementation is faster
  74. False Sharing Antipattern: Simple fix Powered by EPAM // Thread

    1 for (auto& p : first_n(data, SIZE/2)) { ++p.X; ++p.Y; } // Thread 2 for (auto& p : first_n(data, SIZE/2)) { ++p.X; ++p.Y; }
  75. False Sharing Antipattern: Simple fix Powered by EPAM // Thread

    1 for (auto& p : first_n(data, SIZE/2)) { ++p.X; ++p.Y; } // Thread 2 for (auto& p : first_n(data, SIZE/2)) { ++p.X; ++p.Y; }
  76. Keynote Powered by EPAM False sharing occurs when threads on

    different processors modify different data that reside on the same cache line.
  77. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory Process 1
  78. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 Process 1
  79. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 Process 1
  80. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1
  81. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 0 1 2 3 Process 2
  82. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 0 1 2 3 Process 2 … 0 1 Process N
  83. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1
  84. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 Process 1 Pages Table 0 2 1 4 2 11
  85. Virtual Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 Process 1 Pages Table 0 2 1 4 2 11 Hot memory
  86. Translation Lookaside Buffer (TLB) Powered by EPAM Process Pages Table

    0 2 1 4 2 11 … … … … … … … … 0 2 1 4 2 11 TLB
  87. Swap Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 0 1 2 3 Physical memory 0 1 2 Process 1 Swap file
  88. Swap Memory Powered by EPAM 0 1 2 3 4

    5 6 7 8 9 10 0 1 2 3 Physical memory 0 1 2 Process 1 Process 1 Pages Table 0 2 1 4 2 Swap Page (2) Swap file