Compute Kernel with Metal

Slide 1

Slide 1 text

Compute Kernel with Metal Kaz Yoshikawa [email protected] Feb 2017 – Yokohama iOS Developers Meet-up

Slide 2

Slide 2 text

Understanding   Computing Kernel

Slide 3

Slide 3 text

Executive Summery • Computing method using GPU • Super Parallel Computing • May not suitable for complex algorithm

Slide 4

Slide 4 text

Players MTLDevice MTLCommandQueue MTLCommandBuffer MTLLibrary MTLFunction MTLComputePipelineState MTLComputeComman dEncoder MTLBuffer MTLTexture

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Device, CommandQueue and  CommandBuffer • MTLDevice • MTLCreateSystemDefaultDevice() • MTLCommandQueue • device.makeCommandQueue() • MTLCommandBuﬀer • commandQueue.makeCommandBuffer() MTLCommandQueue MTLCommandBuffer MTLComputeComman dEncoder

Slide 7

Slide 7 text

Library, Function and MTLComputePipelineState • MTLLibrary • try! device.makeLibrary(source: shaderSource, options: nil) • device.newDefaultLibrary() • MTLFunction • library.makeFunction(name: "bezier_kernel") • MTLComputePipelineState • .library.makeComputePipelineState(function: function) MTLLibrary MTLFunction MTLComputePipelineState

Slide 8

Slide 8 text

Compute Pipeline State • Blue Print for Computing Parameters • Buﬀers • Textures • etc…

Slide 9

Slide 9 text

Computing Kernel

Slide 10

Slide 10 text

Computing Kernel • C++ 14 subset shading language • Restrictions • lambda expressions, dynamic_cast operator, type identification, recursive function calls, new and delete operators, noexcept operator, goto statement, register, thread_local storage qualifiers, virtual function qualifier, derived classes and exception handling

Slide 11

Slide 11 text

Scalar Types • bool, char, int8_t, unsigned char, uchar • short, unsigned short, ushort • int, unsigned int, uint – 32bit • half – 16bit half precision, ﬂoat – 32bit single precision • size_t, ptrdiﬀ_t, void • no double

Slide 12

Slide 12 text

Vector and Matrix Types • booln • charn, shortn, ucharn, ushortn, uintn • halfn, ﬂoatn • halfnxm, ﬂoatnxm * n is a number

Slide 13

Slide 13 text

Glance a code TUSVDU.Z7FSUFY*O\ GMPBUQPTJUJPO GMPBUDPMPS ^    TUSVDU.Z7FSUFY0VU\ GMPBUQPTJUJPO GMPBUDPMPS ^  LFSOFMWPJENZ@DPNQVUF@LFSOFM DPOTUBOU.Z7FSUFY*OWFSUJDFT<<CVGGFS >> EFWJDF.Z7FSUFY0VUPVU7FSUFYFT<<CVGGFS >> VJOUJE<<UISFBE@QPTJUJPO@JO@HSJE>> \ ǘ ^ * Just for getting an idea, not working code ←Defining a structure ↓ Defining a kernel code ↓ Specifying Buffer Index

Slide 14

Slide 14 text

Qualifiers LFSOFMWPJENZ@DPNQVUF@LFSOFM DPOTUBOU.Z7FSUFY*OWFSUJDFT<<CVGGFS >> EFWJDF.Z7FSUFY0VUPVU7FSUFYFT<<CVGGFS >> VJOUJE<<UISFBE@QPTJUJPO@JO@HSJE>> \ ǘ ^

Slide 15

Slide 15 text

Address Space • device Address Space • buffer memory objects allocated from the device memory pool that are both readable and writeable • threadgroup Address Space • Variables allocated in the threadgroup address space in a kernel function are allocated for each threadgroup executing the kernel, are shared by all threads in a threadgroup and exist only for the lifetime of the threadgroup that is executing the kernel • constant Address Space • The constant address space name refers to buffer memory objects allocated from the device memory pool but are read-only • thread Address Space • The thread address space refers to the per-thread memory address space

Slide 16

Slide 16 text

Compute Command Encoder • MTLComputeCommandEncoder MFUFODPEFSDPNNBOE#VGGFSNBLF$PNQVUF$PNNBOE&ODPEFS FODPEFSTFU$PNQVUF1JQFMJOF4UBUF DPNQVUF1JQFMJOF4UBUF FODPEFSTFU#VGGFS FMFNFOUT#VGGFS PGGTFU BU FODPEFSTFU#VGGFS WFSUFY#VGGFS PGGTFU BU

Slide 17

Slide 17 text

Thread Group • Kernel requires a task broken into small pieces MFUUISFBEHSPVQT1FS(SJE.5-4J[F.BLF FMFNFOUTDPVOU MFUUISFBET1FS5ISFBEHSPVQ.5-4J[F.BLF FODPEFSEJTQBUDI5ISFBEHSPVQT UISFBEHSPVQT1FS(SJE   UISFBET1FS5ISFBEHSPVQUISFBET1FS5ISFBEHSPVQ * I am still not fully understood

Slide 18

Slide 18 text

Commit • Finally ready to commit FODPEFSFOE&ODPEJOH DPNNBOE#VGGFSDPNNJU • Wait or Add Completion Handler… DPNNBOE#VGGFSXBJU6OUJM$PNQMFUFE   DPNNBOE#VGGFSBEE$PNQMFUFE)BOEMFS\ CVGGFS JO EPTPNFXPSLIFSF ^ • Check the buﬀer • there must be something good in there!

Slide 19

Slide 19 text

Computing Bezier Positions Living example

Slide 20

Slide 20 text

Goal • Give Shader an array of Path Elements or equivalent • Produce many consequent positions using Kernel • Using Bezier Calculation Method of my Qiita atricle

Slide 21

Slide 21 text

http://qiita.com/codelynx/items/f7e6a844aac3746a6b79

Slide 22

Slide 22 text

Strategies • Path elements buﬀer • Vertex buﬀer • CPU estimates the length of   path elements • A Kernel produces vertices for  a path element • There may be a better way… #0 p0 … p3 0 #1 p0 … p3 m1 #n p0 … p3 m2 #0 pt … pt #m1 pt … #m2 … pt Element Buffer Vertex Buffer

Slide 23

Slide 23 text

bezier_kernel shader DBTF1BUI&MFNFOU5ZQF2VBE$VSWF5P GPS JOUJOEFYJOEFYOVNCFS0G7FSUFYFTJOEFY \ GMPBUUGMPBU JOEFY GMPBU OVNCFS0G7FSUFYFT GMPBURQ QQ U GMPBURQ QQ U GMPBUSR RR U GMPBUXX XX U 7FSUFYW7FSUFY IBMG SY SZ IBMG X PVU7FSUFYFTW ^ CSFBL DBTF1BUI&MFNFOU5ZQF$VSWF5P ǘTOJQǘ CSFBL ^ ^ LFSOFMWPJECF[JFS@LFSOFM DPOTUBOU1BUI&MFNFOUFMFNFOUT<<CVGGFS >> EFWJDF7FSUFYPVU7FSUFYFT<<CVGGFS >> VJOUJE<<UISFBE@QPTJUJPO@JO@HSJE>> \ 1BUI&MFNFOUFMFNFOUFMFNFOUT JOUOVNCFS0G7FSUFYFTFMFNFOUOVNCFS0G7FSUFYFT GMPBUQFMFNFOUQ GMPBUQFMFNFOUQ GMPBUQFMFNFOUQ GMPBUQFMFNFOUQ TXJUDI FMFNFOUUZQF \ DBTF1BUI&MFNFOU5ZQF-JOF5P ǘTOJQǘ CSFBL DBTF1BUI&MFNFOU5ZQF2VBE$VSWF5P GPS JOUJOEFYJOEFYOVNCFS0G7FSUFYFTJOEFY \ GMPBUUGMPBU JOEFY GMPBU OVNCFS0G7FSUFYFT GMPBURQ QQ U GMPBURQ QQ U GMPBUSR RR U GMPBUXX XX U 7FSUFYW7FSUFY IBMG SY SZ IBMG X PVU7FSUFYFTW ^ CSFBL * not a whole code

Slide 24

Slide 24 text

Playground! • Yellow – Core Graphics • Red – Compute Kernel • Look Good https://github.com/codelynx/BezierKernelPlayground

Slide 25

Slide 25 text

Other Considerations

Slide 26

Slide 26 text

Double or Triple Buffering • Avoid access collision between CPU and GPU Buffer#1 Buffer#2 Buffer#1 Buffer#2 Buffer#1 Buffer#1 Buffer#1 →time Buffer#1 ⚡CPU ⚡GPU ⚡CPU ⚡GPU Buffer#1 Buffer#2 Buffer#1 Buffer#2 ⚡CPU ⚡CPU ⚡CPU ⚡CPU ⚡GPU ⚡GPU ⚡GPU ⚡GPU →time

Slide 27

Slide 27 text

Buffer Management • Memory resource is finite, recycle them where possible • System crashes at device.makeBuffer() rather returning nil • It is hard to find out the reason (as of iOS10) • Save memory resources and be a good citizen

Slide 28

Slide 28 text

Wrap Up

Slide 29

Slide 29 text

Wrap Up • Computing Shaders are much easier than Rendering Shader • Memory Management could be pain if you wants one more step toward high performance shader • Be aware memory alignment • Shader is hard to debug – no break point nor printf()