Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Как стать GPU-инженером за час

CocoaHeads
February 13, 2018
51

Как стать GPU-инженером за час

Современный мир не был бы таким, какой он есть, без GPU-вычислений. Современные консольные игры, VR, AR, криптовалюты, машинное обучение — всё это работает на горячих графических процессорах.

Однако среди мобильных разработчиков видеокарты не пользуются большой популярностью: многие думают, что это очень сложно, а некоторые вообще не замечают, что iPhone в принципе имеет видеокарту.

Этим докладом хотелось бы ознакомить широкие массы разработчиков с программированием графических процессоров, с прицелом на мобильные платформы и, конечно же, трендовые темы.

CocoaHeads

February 13, 2018
Tweet

More Decks by CocoaHeads

Transcript

  1. Agenda • Computer graphics history • Modern rendering • Apple

    side of things • What is GPGPU? • Metal Compute Shaders • Hype train 2
  2. 1977 Atari 2600 128 bytes of RAM including call stack

    and the state of game world • Typical resolution: 160x192 • 128 colors in palette • 160 * 192 * 7 bits = 26 880 bytes per frame No framebuffer, graphics were generated in real-time. Literally. 5
  3. 6

  4. 7

  5. 8

  6. 9

  7. 10

  8. 11

  9. 1977 Atari 2600 VCS could only display five interactive objects

    at any one time: • 2 «player» sprites • 2 «missile» sprites • 1 «ball» sprite 12
  10. «Racing the beam» the VCS could only display five interactive

    objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it 13
  11. «Racing the beam» the VCS could only display five interactive

    objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it 14
  12. «Racing the beam» the VCS could only display five interactive

    objects at any one time: two "player" sprites, two "missile" sprites, and one «ball», but once the electron beam had drawn a sprite, the program could shift the position of said sprite horizontally and redraw it 15
  13. Blind spots are the only times programmers could do anything

    that didn't involve drawing graphics on the screen, such as computing joystick inputs, player movements, scoring 16
  14. 1977 Nintendo Entertainment System 1983 • 8-bit colors • Still

    no framebuffer • PPU (Picture Processing Unit) • Tiled graphics (aka «character graphics») • Operates with tiles of 8x8 (or 8x16) pixels • 8 sprites per scanline • Another advantage is collision detection 
 (movable/non-movable sprites) 18
  15. 20

  16. 1977 Second Generation - Shaded Solids 1983 1987 1991 …

    • Very expensive, mostly used in professional simulators • Vertex lighting • Rasterization of filled polygons • Depth buffer and blending 21
  17. 1977 First «GPU» 1983 1987 1991 … 1999 NVidia releases

    first «Graphics Processing Unit» - GeForce 256 • Made a definition of what GPU should be • Achieved 10 million polygons processed in 1 second • Vertex transform • Lighting • Barely programmable 23
  18. 1977 GeForce3 with GeForceFX first programmable GPU 1983 1987 1991

    … 1999 2001 • Introduced a concept of shaders • Vertex and fragment operations • Macro assembly language • Very limited ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R0.xyzx, R0.xyzx; RSQR R0.w, R0.w; MULR R0.xyz, R0.w, R0.xyzx; ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx; DP3R R0.w, R1.xyzx, R1.xyzx; RSQR R0.w, R0.w; MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx; MULR R1.xyz, R0.w, R1.xyzx; DP3R R0.w, R1.xyzx, f[TEX1].xyzx; MAXR R0.w, R0.w, {0}.x; 24
  19. Recent trends Time Trans MHz GFLOPS Aug02 121M 500 8

    Jan03 130M 475 20 Dec03 222M 400 53 • 1.8x increase of transistors • 20% decrease in clock speed • 6.6x GFLOP speedup 26
  20. • GPUs are very limited in what they can do

    • Can only draw primitives: triangles, lines, points • Highly optimized for floating point operations 28
  21. Custom shader example Next, we will take gradient texture and

    will read as high as current displacement is 50
  22. 58

  23. 60

  24. 2007 OpenGL ES 1.1 (iPhone 2G) 2010 OpenGL ES 2.0

    (iPhone 4) 2016 OpenGL ES 3.0 (iPhone 7) 61
  25. 64

  26. Low CPU overhead Modern GPU features Do expensive tasks less

    often Optimized for CPU behaviour Thinnest possible API 65
  27. 66

  28. Low CPU overhead Modern GPU features Do expensive tasks less

    often Optimized for CPU behaviour Thinest possible API 67
  29. Low CPU overhead Modern GPU features Do expensive tasks less

    often Optimized for CPU behaviour Thinest possible API 68
  30. Low CPU overhead Modern GPU features Do expensive tasks less

    often Optimized for CPU behaviour Thinest possible API 69
  31. 70

  32. Low CPU overhead Modern GPU features Do expensive tasks less

    often Optimized for CPU behaviour Thinest possible API 71
  33. MTLDevice MTLCommandQueue = MTLCommandBuffer guard let commandBuffer = commandQueue.makeCommandBuffer() else

    { fatalError("Could not create command buffer") } Made on per-queue basis: 80
  34. 83

  35. Pipeline state • Represents GPU state that is need to

    be set for the current command • Must be initialized with shader functions • Each pipeline state type has its own optional parameters • Usually being cached and reused 86
  36. Pipeline state // Create a reusable pipeline state for rendering

    geometry let stateDescriptor = MTLRenderPipelineDescriptor() stateDescriptor.vertexFunction = vertexFunc stateDescriptor.fragmentFunction = fragmentFunc 87
  37. +1 +1 // Send buffer to the command queue commandBuffer.commit()

    // Wait until all command are executed commandBuffer.waitUntilCompleted() // Subscribe to completion event commandBuffer.addCompletionHandler {} 95
  38. Early GPGPU 1999-2001 • Hoff (1999): Voronoi diagrams on NVIDIA

    TNT2 • Larsen &McAllister (2001): first GPU matrix multiplication (8-bit) • Rumpf & Strzodka (2001): first GPU PDEs (diffusion, image segmentation) • NVIDIA SDK Game of Life, Shallow Water (Greg James, 2001) 97
  39. Early GPGPU 1999-2001 • PHD in computer graphics to do

    this • Financial companies hired game developers 98
  40. 2002 1999-2001 R G B A R G B A

    0.17 0.21 0.1 0.2 0.1 0.21 0.2 0.0 102
  41. 2002 1999-2001 2007 CUDA • First GPU arch. and software

    platform designed for computing • First C/C++ language and compiler for GPUs • 2007 began a massive surge in GPGPU development 103
  42. 2002 1999-2001 2007 CUDA Output registers Output registers Thread ID

    Input registers Fragment program Thread program 104
  43. Metal Compute Shaders • Act just like fragment or vertex

    shader, but general purposed • Programmed with keyword kernel • Suitable for highly parallel tasks • Can be put in the same command buffer with render/blit commands 105
  44. Task: multiply every element in float buffer by a certain

    value Purely parallel thing - suitable for compute shaders 106
  45. 1. Declare class ArrayProcessor 2. Use MTLDevice or MTLCommandQueue as

    a dependency injection 3. Cache static elements in init(:) 108
  46. public class ArrayProcessor { public let commandQueue: MTLCommandQueue public let

    device: MTLDevice public let bufferMultiplierPipelineState: MTLComputePipelineState public init(commandQueue: MTLCommandQueue) { … } … } 109
  47. Next, implement encoding GPU work on CPU side 4. Prepare

    type-container for kernel’s parameters fileprivate struct Uniforms { public let multiplier: Float public let count: UInt32 } NOTE: Be careful with Swift’s memory layout, use C/C++ delcarations to avoid tricky bugs 110
  48. 5. Encode compute kernel command into command queue public class

    ArrayProcessor { … public func process(array: [Float], multiplier: Float) { … } … } MTLDevice MTLCommandQueue MTLComputePipelineState 111
  49. public class ArrayProcessor { public func process(array: [Float], multiplier: Float)

    { } } MTLDevice MTLCommandQueue MTLComputePipelineState 112
  50. public class ArrayProcessor { public func process(array: [Float], multiplier: Float)

    { } } MTLDevice MTLCommandQueue MTLComputePipelineState Array buffer Uniform buffer 113
  51. public class ArrayProcessor { public func process(array: [Float], multiplier: Float)

    { } } MTLDevice MTLCommandQueue MTLComputePipelineState Array buffer Uniform buffer 114
  52. public class ArrayProcessor { public func process(array: [Float], multiplier: Float)

    { } } MTLDevice MTLCommandQueue MTLComputePipelineState Array buffer Uniform buffer MTLComputeCommandEncoder 115
  53. MTLDevice MTLCommandQueue MTLComputePipelineState Array buffer Uniform buffer MTLComputeCommandEncoder public class

    ArrayProcessor { public func process(array: [Float], multiplier: Float) { } } 116
  54. MTLDevice MTLCommandQueue MTLComputePipelineState Array buffer Uniform buffer MTLComputeCommandEncoder public class

    ArrayProcessor { public func process(array: [Float], multiplier: Float) { } } 117
  55. Threads and threadgroups • Metal executes your kernel function over

    1D, 2D or 3D grid • Each point in the grid represents a single instance of your kernel function • That is called thread • Threads are organized together into threadgroups that can share common block of memory 118
  56. Threads and threadgroups kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],

    uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]]) 120
  57. Threads and threadgroups kernel void myKernel(uint2 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],

    uint2 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], uint2 threads_per_threadgroup [[ threads_per_threadgroup ]]) 121
  58. Threads and threadgroups Threads in a threadgroup are executed in

    SIMD way (Single Instruction Multiple Data) if All threads execute both branches, keep divergence to minimum 124
  59. Threads and threadgroups The division of threadgroups into SIMD groups

    is defined by Metal SIMD group size is returned by threadExecutionWidth of compute pipeline state object All you have to do is define threadgroup size 125
  60. 6. Calculate threadgroup count and size et executionWidth = bufferMultiplierPipelineState.threadExecutionWidth

    et threadgroupsPerGrid = MTLSize(width: (buffer.count + executionWidth - 1) / executionWidth, height: 1, depth: 1) et threadsPerThreadgroup = MTLSize(width: executionWidth, height: 1, depth: 1) 126
  61. MTLDevice MTLCommandQueue MTLComputePipelineState Array buffer Uniform buffer MTLComputeCommandEncoder public class

    ArrayProcessor { public func process(array: [Float], multiplier: Float) { } } computeEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadsPerThreadgroup) 127
  62. 7. Write shaders Write your kernels kernel void bufferMultiplier(device float*

    inputBuffer [[buffer(0)]], const device BufferMultiplierUniforms& uniforms [[buffer(1)]], const uint threadIndex [[ thread_position_in_grid ]]) { if (threadIndex >= uniforms.bufferSize) { return; } const float initialValue = inputBuffer[threadIndex]; inputBuffer[threadIndex] = initialValue * uniforms.multiplier; } 128
  63. Benchmarks What we will be playing with: var inputBuffer =

    [Float](repeating: 1.0, count: 1_000_000) let multiplier: Float = 2.0 What we will be comparing to: // CPU Implementation for i in 0..<inputBuffer.count { inputBuffer[i] = inputBuffer[i] * multiplier } 129
  64. • Metal finished in 0.006s • CPU finished in 0.1s

    • Which is ~17 times slower Benchmarks 130
  65. • Metal finished in 0.003s • CPU finished in 0.0001s

    • Which is ~30 times faster Benchmarks 132
  66. Tips 1. Beware of memory alignment 2. Beware of CPU-side

    encoding overhead 3. Keep code divergence to minimum 4. Use half instead of float whenever possible 5. Avoid using ints 6. Calculate threadgroup sizes thoughtfully 7. Cache reusable CPU-side objects 8. Don’t wait for GPU to finish execution 133
  67. Metal Performance Shaders 134 • A framework of data-parallel algorithms

    for the GPU • Optimized for iOS • As simple as calling a library function
  68. CoreML 141 • Easy to use • Wide range of

    desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is not customizable • Sometimes buggy • Zero control
  69. CoreML 142 • Easy to use • Wide range of

    desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is not customizable • Sometimes buggy • Zero control DEPRECATED
  70. CoreML 144 • Easy to use • Wide range of

    desktop frameworks • Almost as fast as manual encoding • GPU/CPU optimizations • Is customizable • Still a bit buggy • Zero control