Как стать GPU-инженером за час

Master GPU-engineering in one hour Andrey Volodin Senior Matrix Multiplicator

Agenda • Computer graphics history • Modern rendering • Apple
side of things • What is GPGPU? • Metal Compute Shaders • Hype train 2

History 3

1977 Empire Strikes Back (Atari 2600) The ﬁrst notable system
to use sprite graphics 4

1977 Atari 2600 128 bytes of RAM including call stack
and the state of game world • Typical resolution: 160x192 • 128 colors in palette • 160 * 192 * 7 bits = 26 880 bytes per frame No framebuﬀer, graphics were generated in real-time. Literally. 5

1977 Atari 2600 VCS could only display ﬁve interactive objects
at any one time: • 2 «player» sprites • 2 «missile» sprites • 1 «ball» sprite 12

«Racing the beam» the VCS could only display ﬁve interactive
objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it 13

Blind spots are the only times programmers could do anything
that didn't involve drawing graphics on the screen, such as computing joystick inputs, player movements, scoring 16

1977 Nintendo Entertainment System 1983 (Vice Project Doom) 17

1977 Nintendo Entertainment System 1983 • 8-bit colors • Still
no framebuﬀer • PPU (Picture Processing Unit) • Tiled graphics (aka «character graphics») • Operates with tiles of 8x8 (or 8x16) pixels • 8 sprites per scanline • Another advantage is collision detection   (movable/non-movable sprites) 18

1977 Nintendo Entertainment System 1983 (Vice Project Doom) 19

1977 Second Generation - Shaded Solids 1983 1987 1991 …
• Very expensive, mostly used in professional simulators • Vertex lighting • Rasterization of ﬁlled polygons • Depth buﬀer and blending 21

1977 Second Generation - Shaded Solids 1983 1987 1991 …
22

1977 First «GPU» 1983 1987 1991 … 1999 NVidia releases
ﬁrst «Graphics Processing Unit» - GeForce 256 • Made a deﬁnition of what GPU should be • Achieved 10 million polygons processed in 1 second • Vertex transform • Lighting • Barely programmable 23

1977 GeForce3 with GeForceFX ﬁrst programmable GPU 1983 1987 1991
… 1999 2001 • Introduced a concept of shaders • Vertex and fragment operations • Macro assembly language • Very limited ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; MULR R0.xyz, R0.w, R0.xyzx; ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x; 24

Recent trends 25

Recent trends Time Trans MHz GFLOPS Aug02 121M 500 8
Jan03 130M 475 20 Dec03 222M 400 53 • 1.8x increase of transistors • 20% decrease in clock speed • 6.6x GFLOP speedup 26

Modern rendering process 27

• GPUs are very limited in what they can do
• Can only draw primitives: triangles, lines, points • Highly optimized for ﬂoating point operations 28

3D objects are stored as a set of triangles 29

They are placed in the 3D scene which is simply
a coordinate space 30

Next, there is usually a camera to provide diﬀerent view
angles 31

Last step in vertex stage: projection 32

Aﬃne transform matrices Same concepts in UIKit (CGAﬃneTransform) Translation Scale
Rotation (Y) 33

Aﬃne transform matrices x x x 34

Typical vertex shader return uniforms.modelViewProjectionTransform * ﬂoat4(newPosition, 1.0); 35

Typical vertex shader return uniforms.modelViewProjectionTransform * ﬂoat4(newPosition, 1.0); Calculated on
CPU as a combination of all transforms 36

Next, rasterizer comes into play 37

Next, rasterizer comes into play 38

Fragment shaders Two triangles, that cover the screen 39

Fragment shaders fragColor = vec4(1.0, 0.0, 0.0, 1.0); 40

Fragment shaders fragColor = vec4(length(deltaToCenter), 0.0, 0.0, 1.0); 41

Fragment shaders sin(time) 1.0 -1.0 42

Fragment shaders fragColor = vec4(length(deltaToCenter) * sin(time), 0.0, 0.0, 1.0);
43

Fragment shaders 66 LOC 140 LOC 800 LOC 44

Fragment shaders Business card challenge by Paul Heckbert (1984) 45

Custom shader example Imagine we have a sphere 46

Custom shader example What if we will displace every vertex
by a smooth noise value? 47

Custom shader example Next, we will take gradient texture and
will read as high as current displacement is 50

Custom shader example And also oﬀset read coordinates every frame
51

Custom shader example 52

Draw calls happen for every set of geometry with unique
render state 53

54 Draw calls happen for every set of geometry with
unique render state

55 Draw calls happen for every set of geometry with
unique render state

What Apple has to offer? 56

2007 OpenGL ES 1.1 (iPhone 2G) 57

2007 OpenGL ES 1.1 (iPhone 2G) 2010 OpenGL ES 2.0
(iPhone 4) 59

2007 OpenGL ES 1.1 (iPhone 2G) 2010 OpenGL ES 2.0
(iPhone 4) 2016 OpenGL ES 3.0 (iPhone 7) 61

GLKit.framework OpenGLES.framework SpriteKit SceneKit Cocos2D-ObjC libGDX Unity Unreal Cocos2D-X 62

Announced by Khronos in 2015 Initially, Apple was part of
the working group 63

Low CPU overhead Modern GPU features Do expensive tasks less
often Optimized for CPU behaviour Thinnest possible API 65

often Optimized for CPU behaviour Thinest possible API 67

MTLDevice 72

MTLDevice Represents a single GPU 73

MTLDevice MTLCreateSystemDefaultDevice() NOTE: Don’t treat this object as a singletone
74

MTLDevice MTLCommandQueue 75

MTLDevice MTLCommandQueue Retrieved from device via: device.makeCommandQueue() 76

MTLDevice MTLCommandQueue Often referred to as «Metal context» 77

MTLDevice MTLCommandQueue 78

MTLDevice MTLCommandQueue = MTLCommandBuﬀer 79

MTLDevice MTLCommandQueue = MTLCommandBuﬀer guard let commandBuffer = commandQueue.makeCommandBuffer() else
{ fatalError("Could not create command buffer") } Made on per-queue basis: 80

Command types: 1. Render commands 2. Blit commands 3. Compute
commands 81

1. Render commands 2. Blit commands 3. Compute commands MTLRenderCommandEncoder
MTLBlitCommandEncoder MTLComputeCommandEncoder 82

MTLRenderCommandEncoder commandBuffer.makeRenderCommandEncoder(:) 84

MTLRenderCommandEncoder commandBuffer.makeRenderCommandEncoder(:) 85

Pipeline state • Represents GPU state that is need to
be set for the current command • Must be initialized with shader functions • Each pipeline state type has its own optional parameters • Usually being cached and reused 86

Pipeline state // Create a reusable pipeline state for rendering
geometry let stateDescriptor = MTLRenderPipelineDescriptor() stateDescriptor.vertexFunction = vertexFunc stateDescriptor.fragmentFunction = fragmentFunc 87

Pipeline state let ps = try device.makeRenderPipelineState(descriptor: stateDescriptor) 88

MTLRenderCommandEncoder MTLRenderPipelineState .setRenderPipelineState(pipeLineState) 89

MTLRenderCommandEncoder MTLRenderPipelineState .setVertexBuffer(:index:) Geometry buﬀer 90

MTLRenderCommandEncoder MTLRenderPipelineState .setVertexBuffer(:index:) Geometry buﬀer Uniform buﬀer .setFragmentBuffer(:index:) Texture 0
91

MTLRenderCommandEncoder .drawPrimitives(***) +1 92

MTLRenderCommandEncoder .drawPrimitives(***) +1 Another state Another geometry +1 93

MTLRenderCommandEncoder .endEncoding() +1 +1 94

+1 +1 // Send buffer to the command queue commandBuffer.commit()
// Wait until all command are executed commandBuffer.waitUntilCompleted() // Subscribe to completion event commandBuffer.addCompletionHandler {} 95

Metal Compute Shaders 96

Early GPGPU 1999-2001 • Hoff (1999): Voronoi diagrams on NVIDIA
TNT2 • Larsen &McAllister (2001): first GPU matrix multiplication (8-bit) • Rumpf & Strzodka (2001): first GPU PDEs (diffusion, image segmentation) • NVIDIA SDK Game of Life, Shallow Water (Greg James, 2001) 97

Early GPGPU 1999-2001 • PHD in computer graphics to do
this • Financial companies hired game developers 98

2002 GPGPU.org General Purposed GPU 1999-2001 99

2002 1999-2001 100

2002 1999-2001 R G B A R G B A
101

2002 1999-2001 R G B A R G B A
0.17 0.21 0.1 0.2 0.1 0.21 0.2 0.0 102

2002 1999-2001 2007 CUDA • First GPU arch. and software
platform designed for computing • First C/C++ language and compiler for GPUs • 2007 began a massive surge in GPGPU development 103

2002 1999-2001 2007 CUDA Output registers Output registers Thread ID
Input registers Fragment program Thread program 104

Metal Compute Shaders • Act just like fragment or vertex
shader, but general purposed • Programmed with keyword kernel • Suitable for highly parallel tasks • Can be put in the same command buﬀer with render/blit commands 105

Task: multiply every element in ﬂoat buﬀer by a certain
value Purely parallel thing - suitable for compute shaders 106

Code Time! 107

1. Declare class ArrayProcessor 2. Use MTLDevice or MTLCommandQueue as
a dependency injection 3. Cache static elements in init(:) 108

public class ArrayProcessor { public let commandQueue: MTLCommandQueue public let
device: MTLDevice public let bufferMultiplierPipelineState: MTLComputePipelineState public init(commandQueue: MTLCommandQueue) { … } … } 109

Next, implement encoding GPU work on CPU side 4. Prepare
type-container for kernel’s parameters ﬁleprivate struct Uniforms { public let multiplier: Float public let count: UInt32 } NOTE: Be careful with Swift’s memory layout, use C/C++ delcarations to avoid tricky bugs 110

5. Encode compute kernel command into command queue public class
ArrayProcessor { … public func process(array: [Float], multiplier: Float) { … } … } MTLDevice MTLCommandQueue MTLComputePipelineState 111

public class ArrayProcessor { public func process(array: [Float], multiplier: Float)
{ } } MTLDevice MTLCommandQueue MTLComputePipelineState 112

{ } } MTLDevice MTLCommandQueue MTLComputePipelineState Array buﬀer Uniform buﬀer 113

{ } } MTLDevice MTLCommandQueue MTLComputePipelineState Array buﬀer Uniform buﬀer 114

{ } } MTLDevice MTLCommandQueue MTLComputePipelineState Array buﬀer Uniform buﬀer MTLComputeCommandEncoder 115

MTLDevice MTLCommandQueue MTLComputePipelineState Array buﬀer Uniform buﬀer MTLComputeCommandEncoder public class
ArrayProcessor { public func process(array: [Float], multiplier: Float) { } } 116

ArrayProcessor { public func process(array: [Float], multiplier: Float) { } } 117

Threads and threadgroups • Metal executes your kernel function over
1D, 2D or 3D grid • Each point in the grid represents a single instance of your kernel function • That is called thread • Threads are organized together into threadgroups that can share common block of memory 118

Threads and threadgroups 119

Threads and threadgroups kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],
uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]]) 120

Threads and threadgroups kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],
uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]]) 121

Threads and threadgroups Threads in a threadgroup are executed in
SIMD way (Single Instruction Multiple Data) if All threads execute both branches, keep divergence to minimum 124

Threads and threadgroups The division of threadgroups into SIMD groups
is deﬁned by Metal SIMD group size is returned by threadExecutionWidth of compute pipeline state object All you have to do is deﬁne threadgroup size 125

6. Calculate threadgroup count and size et executionWidth = bufferMultiplierPipelineState.threadExecutionWidth
et threadgroupsPerGrid = MTLSize(width: (buffer.count + executionWidth - 1) / executionWidth, height: 1, depth: 1) et threadsPerThreadgroup = MTLSize(width: executionWidth, height: 1, depth: 1) 126

ArrayProcessor { public func process(array: [Float], multiplier: Float) { } } computeEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup) 127

7. Write shaders Write your kernels kernel void bufferMultiplier(device ﬂoat*
inputBuffer [[buffer(0)]], const device BufferMultiplierUniforms& uniforms [[buffer(1)]], const uint threadIndex [[ thread_position_in_grid ]]) { if (threadIndex >= uniforms.bufferSize) { return; } const ﬂoat initialValue = inputBuffer[threadIndex]; inputBuffer[threadIndex] = initialValue * uniforms.multiplier; } 128

Benchmarks What we will be playing with: var inputBuffer =
[Float](repeating: 1.0, count: 1_000_000) let multiplier: Float = 2.0 What we will be comparing to: // CPU Implementation for i in 0..<inputBuffer.count { inputBuffer[i] = inputBuffer[i] * multiplier } 129

• Metal ﬁnished in 0.006s • CPU ﬁnished in 0.1s
• Which is ~17 times slower Benchmarks 130

1_000_000 1_000 Benchmarks 131

• Metal ﬁnished in 0.003s • CPU ﬁnished in 0.0001s
• Which is ~30 times faster Benchmarks 132

Tips 1. Beware of memory alignment 2. Beware of CPU-side
encoding overhead 3. Keep code divergence to minimum 4. Use half instead of ﬂoat whenever possible 5. Avoid using ints 6. Calculate threadgroup sizes thoughtfully 7. Cache reusable CPU-side objects 8. Don’t wait for GPU to ﬁnish execution 133

Metal Performance Shaders 134 • A framework of data-parallel algorithms
for the GPU • Optimized for iOS • As simple as calling a library function

Metal Performance Shaders 135 Lanczos Resampling  Histogram, Equalization, and Speciﬁcation
Median Filter Thresholding Image Integral iOS9

Metal Performance Shaders 136 iOS10 MPSCNN

Metal Performance Shaders 137

Metal NN Graph API 138

CoreML 139

CoreML 140

CoreML 141 • Easy to use • Wide range of
desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is not customizable • Sometimes buggy • Zero control

desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is not customizable • Sometimes buggy • Zero control DEPRECATED

CoreML 143 Freshly baked: 1. 16-bit ﬂoats 2. Tensorﬂow-Lite converter
3. Custom layers (both CPU and GPU)

desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is customizable • Still a bit buggy • Zero control

Thanks! @s1ddok @s1ddok [email protected] 145

@s1ddok 146 https://medium.com/@s1ddok

Как стать GPU-инженером за час

Как стать GPU-инженером за час

More Decks by CocoaHeads

Featured

Transcript