SETCON'19 - Alexander Stepaniuk - Effective Memory

Effective Memory Aliaksander Stsepaniuk EPAM Systems Powered by EPAM

Memory In computing, memory refers to the computer hardware devices
used to store information for immediate use in a computer www.wikipedia.org Изображение Powered by EPAM

Agenda • Memory hierarchy • CPU Cache • Virtual memory
Powered by EPAM

Memory hierarchy 01

Memory hierarchy Powered by EPAM L0 registers L1 L2 L3
L4 L5 L6

L4 L5 L6 on-chip L1 cache (SRAM)

L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM)

L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM)

L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM)

L4 L5 L6 on-chip L1 cache (SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM) local secondary storage (local disks)

Memory hierarchy Powered by EPAM L0 registers on-chip L1 cache
(SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (distributed file servers, web servers) L1 L2 L3 L4 L5 L6

Memory hierarchy Powered by EPAM L0 registers on-chip L1 cache
(SRAM) off-chip L2 cache (SRAM) off-chip L3 cache shared by multiple cores (SRAM) main memory (DRAM) local secondary storage (local disks) remote secondary storage (distributed file servers, web servers) L1 L2 L3 L4 L5 L6 Smaller, faster, costlier Larger, slower, cheaper

Principles of locality Powered by EPAM

Principles of locality Powered by EPAM • Spatial locality

Principles of locality Powered by EPAM • Spatial locality •
Temporal locality

Keynotes Powered by EPAM Memory is hierarchical. The fastest memory
is smaller. The slowest memory is cheaper.

Keynotes Powered by EPAM Memory is hierarchical. The fastest memory
is smaller. The slowest memory is cheaper. Data Locality principle is a phenomenon in which the same values, or related storage locations, are frequently accessed.

CPU Cache 02

CPU Cache Powered by EPAM Intel Core i7 Cache Hierarchy
Processor package

Processor package Core 0

Processor package Core 0 Registers

Processor package Core 0 Registers L1 d-cache L1 i-cache L1 cache: 32 KB, 8-way Access: 4 cycles

Processor package Core 0 Registers L1 d-cache L1 i-cache L2 unified cache L1 cache: 32 KB, 8-way Access: 4 cycles L2 cache: 256 KB, 8-way Access: 11 cycles

Processor package Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache L1 cache: 32 KB, 8-way Access: 4 cycles L2 cache: 256 KB, 8-way Access: 11 cycles L3 cache: 8 MB, 16-way Access: 30-40 cycles

Processor package Main memory Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache L1 cache: 32 KB, 8-way Access: 4 cycles L2 cache: 256 KB, 8-way Access: 11 cycles L3 cache: 8 MB, 16-way Access: 30-40 cycles

Cache unit organization Powered by EPAM Intel Core i7 Cache
Hierarchy Processor package Main memory Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache

Cache unit organization Powered by EPAM Tag Data Flags Tag
Data Flags Tag Data Flags Tag Data Flags … Cache unit

Cache entry structure Powered by EPAM Tag Data Flags Tag
Data Flags Tag Data Flags Tag Data Flags … Cache Line (Payload) Cache unit Common cache line sizes: 32, 64 and 128 bytes

Memory read flow Powered by EPAM CPU ALU L1 Cache

Word Address

Word Address Cache hit

Word Address Word (Data) Cache hit 4 cycles

Cache hit CPU ALU L1 Cache 4 cycles

Cache hit CPU ALU L1 Cache Word Address 4 cycles

Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles

Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles L2 Cache Line Address

Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles L2 Cache Line Address …

Cache hit CPU ALU L1 Cache Cache miss Word Address 4 cycles L2 Cache Line Address … Line (Data)

Cache hit CPU ALU L1 Cache Cache miss Word Address Word (Data) 4 cycles L2 Cache Line Address Line (Data) … 11+ cycles

Example: Matrix multiplication Powered by EPAM =

Example: Matrix multiplication Powered by EPAM a00 a01 a02 a10
a11 a12 A = b00 B01 b10 b11 b20 b21 B =

a11 a12 A = b00 B01 b10 b11 b20 b21 B = C = AB = a00 a01 a02 a10 a11 a12 b00 B01 b10 b11 b20 b21

a11 a12 A = b00 B01 b10 b11 b20 b21 B = C = AB = a00 a01 a02 a10 a11 a12 b00 B01 b10 b11 b20 b21 a00 b00 + a01 b10 + a02 b20 a00 b01 + a01 b11 + a02 b21 a10 b00 + a11 b10 + a12 b20 a00 b01 + a01 b11 + a02 b21 = =

Example: Matrix multiplication Powered by EPAM C = AB for
i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Could be implemented as:

for i in 0..n for j in 0..m for k
in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM Simple change improves performance dramatically:

in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM Simple change improves performance dramatically: What is happening?

Example: Matrix multiplication Powered by EPAM a[0,0] a[0,1] … a[0,m-1]
a[1,0] a[1,1] … a[1,m-1] … a[n-1,0] a[n-1,1] … a[n-1,m-1] Matrix:

Example: Matrix multiplication Powered by EPAM a[0,0] a[0,1] … a[0,m-1]
a[1,0] a[1,1] … a[1,m-1] … a[n-1,0] a[n-1,1] … a[n-1,m-1] Matrix: In memory:

a[0,0] a[0,1] … a[0,m-1] a[1,0] a[1,1] … a[1,m-1] … a[n-1,0]
a[n-1,1] … a[n-1,m-1] Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] Matrix: In memory: Row 0 Row 1 Row n-1 a[0,0] a[1,0] a[n-1,0]

Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1]
… a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0]

… a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler

… a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler Iterating over the items in the single row. Row data are cached.

… a[1,m-1] … a[n-1,1] … a[n-1,m-1] for i in 0..n for j in 0..m for k in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; a[0,0] a[1,0] a[n-1,0] Can be factored out of the loop by compiler Iterating over the items in the single row. Row data are cached. Iterating over different rows. Cache miss is highly likely.

in 0..p C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0]

for i in 0..n for k in 0..p for j
in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0]

in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached.

in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached. Factored out of the loop.

in 0..m C[i][j] = C[i][j] + A[i][k] * B[k][j]; Example: Matrix multiplication Powered by EPAM a[0,1] … a[0,m-1] a[1,1] … a[1,m-1] … a[n-1,1] … a[n-1,m-1] a[0,0] a[1,0] a[n-1,0] Iterating over the single row. Cached. Factored out of the loop. Again iterating over the single row. Cached.

Keynote Powered by EPAM Consider Data Locality principle to get
better performance when dealing with memory.

Cache Coherence Powered by EPAM

Cache Coherence Powered by EPAM Intel Core i7 Cache Hierarchy
Processor package Main memory Core 0 Registers L1 d-cache L1 i-cache L2 unified cache Core 3 Registers L1 d-cache L1 i-cache L2 unified cache … L3 unified cache

Cache Coherence Powered by EPAM Core 0 Variable L1 Cache
Core 1 Variable L1 Cache Variable L3 Cache

Core 1 Variable L1 Cache Variable L3 Cache Write

Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e

Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read

Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read Write

Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read Write Load

Core 1 Variable L1 Cache Variable L3 Cache Write Invalidat e Read Write Load Load

False Sharing Antipattern Powered by EPAM struct Point { int
X; int Y; };

X; int Y; }; vector<Point> data(SIZE);

X; int Y; }; vector<Point> data(SIZE); Let’s increment X and Y for each item in vector

X; int Y; }; vector<Point> data(SIZE); Let’s increment X and Y for each item in vector for (auto& p : data) { ++p.X; ++p.Y; }

X; int Y; }; vector<Point> data(SIZE); // Thread 1 (affinity 1) for (auto& p : data) { ++p.X; } // Thread 2 (affinity 4) for (auto& p : data) { ++p.Y; }

False Sharing Antipattern Powered by EPAM // Thread 1 (affinity
1) for (auto& p : data) { ++p.X; } // Thread 2 (affinity 4) for (auto& p : data) { ++p.Y; } for (auto& p : data) { ++p.X; ++p.Y; } Single-threaded implementation is faster

False Sharing Antipattern: Simple fix Powered by EPAM

False Sharing Antipattern: Simple fix Powered by EPAM // Thread
1 for (auto& p : first_n(data, SIZE/2)) { ++p.X; ++p.Y; } // Thread 2 for (auto& p : first_n(data, SIZE/2)) { ++p.X; ++p.Y; }

Keynote Powered by EPAM False sharing occurs when threads on
different processors modify different data that reside on the same cache line.

Virtual Memory 03

Virtual Memory Powered by EPAM 0 1 2 3 4
5 6 7 8 9 10 11 12 13 14 15 Physical memory Process 1

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 Process 1

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 Process 1

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 0 1 2 3 Process 2

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 0 1 2 3 Process 2 … 0 1 Process N

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 Process 1 Pages Table 0 2 1 4 2 11

5 6 7 8 9 10 11 12 13 14 15 Physical memory 0 1 2 Process 1 Process 1 Pages Table 0 2 1 4 2 11 Hot memory

Translation Lookaside Buffer (TLB) Powered by EPAM Process Pages Table
0 2 1 4 2 11 … … … … … … … … 0 2 1 4 2 11 TLB

Swap Memory Powered by EPAM

Swap Memory Powered by EPAM 0 1 2 3 4
5 6 7 8 9 10 0 1 2 3 Physical memory 0 1 2 Process 1 Swap file

Swap Memory Powered by EPAM 0 1 2 3 4
5 6 7 8 9 10 0 1 2 3 Physical memory 0 1 2 Process 1 Process 1 Pages Table 0 2 1 4 2 Swap Page (2) Swap file

Thank You

SETCON'19 - Alexander Stepaniuk - Effective Memory

SETCON'19 - Alexander Stepaniuk - Effective Memory

More Decks by Maksim

Other Decks in Technology

Featured

Transcript