Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Compute Kernel with Metal

codelynx
February 11, 2017

Compute Kernel with Metal

This is a presentation materials at Yokohama iOS developer meeting in Feb. 2017. It gives you a very basic idea of Metal computing Kernel, and show the example codes.

codelynx

February 11, 2017
Tweet

More Decks by codelynx

Other Decks in Programming

Transcript

  1. Compute Kernel
    with Metal
    Kaz Yoshikawa
    [email protected]
    Feb 2017 – Yokohama iOS Developers Meet-up

    View Slide

  2. Understanding 

    Computing Kernel

    View Slide

  3. Executive Summery
    • Computing method using GPU
    • Super Parallel Computing
    • May not suitable for complex algorithm

    View Slide

  4. Players
    MTLDevice
    MTLCommandQueue
    MTLCommandBuffer
    MTLLibrary
    MTLFunction
    MTLComputePipelineState
    MTLComputeComman
    dEncoder
    MTLBuffer
    MTLTexture

    View Slide

  5. View Slide

  6. Device, CommandQueue and

    CommandBuffer
    • MTLDevice
    • MTLCreateSystemDefaultDevice()
    • MTLCommandQueue
    • device.makeCommandQueue()
    • MTLCommandBuffer
    • commandQueue.makeCommandBuffer()
    MTLCommandQueue
    MTLCommandBuffer
    MTLComputeComman
    dEncoder

    View Slide

  7. Library, Function and
    MTLComputePipelineState
    • MTLLibrary
    • try! device.makeLibrary(source: shaderSource, options: nil)
    • device.newDefaultLibrary()
    • MTLFunction
    • library.makeFunction(name: "bezier_kernel")
    • MTLComputePipelineState
    • .library.makeComputePipelineState(function: function)
    MTLLibrary
    MTLFunction
    MTLComputePipelineState

    View Slide

  8. Compute Pipeline State
    • Blue Print for Computing Parameters
    • Buffers
    • Textures
    • etc…

    View Slide

  9. Computing Kernel

    View Slide

  10. Computing Kernel
    • C++ 14 subset shading language
    • Restrictions
    • lambda expressions, dynamic_cast operator, type identification, recursive function calls, new and delete
    operators, noexcept operator, goto statement, register, thread_local storage qualifiers, virtual function
    qualifier, derived classes and exception handling

    View Slide

  11. Scalar Types
    • bool, char, int8_t, unsigned char, uchar
    • short, unsigned short, ushort
    • int, unsigned int, uint – 32bit
    • half – 16bit half precision, float – 32bit single precision
    • size_t, ptrdiff_t, void
    • no double

    View Slide

  12. Vector and Matrix Types
    • booln
    • charn, shortn, ucharn, ushortn, uintn
    • halfn, floatn
    • halfnxm, floatnxm
    * n is a number

    View Slide

  13. Glance a code
    TUSVDU.Z7FSUFY*O\
    GMPBUQPTJUJPO
    GMPBUDPMPS
    ^


    TUSVDU.Z7FSUFY0VU\
    GMPBUQPTJUJPO
    GMPBUDPMPS
    ^

    LFSOFMWPJENZ@DPNQVUF@LFSOFM
    DPOTUBOU.Z7FSUFY*OWFSUJDFT<<CVGGFS
    >>
    EFWJDF.Z7FSUFY0VUPVU7FSUFYFT<<CVGGFS
    >>
    VJOUJE<<UISFBE@QPTJUJPO@JO@HSJE>>

    \
    ǘ
    ^
    * Just for getting an idea, not working code
    ←Defining a structure
    ↓ Defining a kernel code
    ↓ Specifying Buffer Index

    View Slide

  14. Qualifiers
    LFSOFMWPJENZ@DPNQVUF@LFSOFM
    DPOTUBOU.Z7FSUFY*OWFSUJDFT<<CVGGFS
    >>
    EFWJDF.Z7FSUFY0VUPVU7FSUFYFT<<CVGGFS
    >>
    VJOUJE<<UISFBE@QPTJUJPO@JO@HSJE>>

    \
    ǘ
    ^

    View Slide

  15. Address Space
    • device Address Space
    • buffer memory objects allocated from the device memory pool that are both readable and writeable
    • threadgroup Address Space
    • Variables allocated in the threadgroup address space in a kernel function are allocated for each threadgroup executing the
    kernel, are shared by all threads in a threadgroup and exist only for the lifetime of the threadgroup that is executing the kernel
    • constant Address Space
    • The constant address space name refers to buffer memory objects allocated from the device memory pool but are read-only
    • thread Address Space
    • The thread address space refers to the per-thread memory address space

    View Slide

  16. Compute Command Encoder
    • MTLComputeCommandEncoder
    MFUFODPEFSDPNNBOE#VGGFSNBLF$PNQVUF$PNNBOE&ODPEFS

    FODPEFSTFU$PNQVUF1JQFMJOF4UBUF DPNQVUF1JQFMJOF4UBUF

    FODPEFSTFU#VGGFS FMFNFOUT#VGGFS PGGTFU BU

    FODPEFSTFU#VGGFS WFSUFY#VGGFS PGGTFU BU

    View Slide

  17. Thread Group
    • Kernel requires a task broken into small pieces
    MFUUISFBEHSPVQT1FS(SJE.5-4J[F.BLF FMFNFOUTDPVOU

    MFUUISFBET1FS5ISFBEHSPVQ.5-4J[F.BLF

    FODPEFSEJTQBUDI5ISFBEHSPVQT UISFBEHSPVQT1FS(SJE 

    UISFBET1FS5ISFBEHSPVQUISFBET1FS5ISFBEHSPVQ

    * I am still not fully understood

    View Slide

  18. Commit
    • Finally ready to commit
    FODPEFSFOE&ODPEJOH

    DPNNBOE#VGGFSDPNNJU

    • Wait or Add Completion Handler…
    DPNNBOE#VGGFSXBJU6OUJM$PNQMFUFE

    DPNNBOE#VGGFSBEE$PNQMFUFE)BOEMFS\ CVGGFS
    JO
    EPTPNFXPSLIFSF
    ^
    • Check the buffer
    • there must be something good in there!

    View Slide

  19. Computing Bezier
    Positions
    Living example

    View Slide

  20. Goal
    • Give Shader an array of Path Elements or equivalent
    • Produce many consequent positions using Kernel
    • Using Bezier Calculation Method of my Qiita atricle

    View Slide

  21. http://qiita.com/codelynx/items/f7e6a844aac3746a6b79

    View Slide

  22. Strategies
    • Path elements buffer
    • Vertex buffer
    • CPU estimates the length of 

    path elements
    • A Kernel produces vertices for

    a path element
    • There may be a better way…
    #0 p0 … p3 0
    #1 p0 … p3 m1
    #n p0 … p3 m2
    #0 pt
    … pt
    #m1
    pt

    #m2
    … pt
    Element Buffer
    Vertex Buffer

    View Slide

  23. bezier_kernel shader
    DBTF1BUI&MFNFOU5ZQF2VBE$VSWF5P
    GPS JOUJOEFYJOEFYOVNCFS0G7FSUFYFTJOEFY
    \
    GMPBUUGMPBU JOEFY
    GMPBU OVNCFS0G7FSUFYFT

    GMPBURQ QQ
    U
    GMPBURQ QQ
    U
    GMPBUSR RR
    U
    GMPBUXX XX
    U
    7FSUFYW7FSUFY IBMG SY SZ
    IBMG X


    PVU7FSUFYFTW
    ^
    CSFBL
    DBTF1BUI&MFNFOU5ZQF$VSWF5P
    ǘTOJQǘ
    CSFBL
    ^
    ^
    LFSOFMWPJECF[JFS@LFSOFM
    DPOTUBOU1BUI&MFNFOUFMFNFOUT<<CVGGFS
    >>
    EFWJDF7FSUFYPVU7FSUFYFT<<CVGGFS
    >>
    VJOUJE<<UISFBE@QPTJUJPO@JO@HSJE>>

    \
    1BUI&MFNFOUFMFNFOUFMFNFOUT
    JOUOVNCFS0G7FSUFYFTFMFNFOUOVNCFS0G7FSUFYFT
    GMPBUQFMFNFOUQ
    GMPBUQFMFNFOUQ
    GMPBUQFMFNFOUQ
    GMPBUQFMFNFOUQ

    TXJUDI FMFNFOUUZQF
    \
    DBTF1BUI&MFNFOU5ZQF-JOF5P
    ǘTOJQǘ
    CSFBL
    DBTF1BUI&MFNFOU5ZQF2VBE$VSWF5P
    GPS JOUJOEFYJOEFYOVNCFS0G7FSUFYFTJOEFY
    \
    GMPBUUGMPBU JOEFY
    GMPBU OVNCFS0G7FSUFYFT

    GMPBURQ QQ
    U
    GMPBURQ QQ
    U
    GMPBUSR RR
    U
    GMPBUXX XX
    U
    7FSUFYW7FSUFY IBMG SY SZ
    IBMG X


    PVU7FSUFYFTW
    ^
    CSFBL
    * not a whole code

    View Slide

  24. Playground!
    • Yellow – Core Graphics
    • Red – Compute Kernel
    • Look Good
    https://github.com/codelynx/BezierKernelPlayground

    View Slide

  25. Other Considerations

    View Slide

  26. Double or Triple Buffering
    • Avoid access collision between CPU and GPU
    Buffer#1 Buffer#2
    Buffer#1 Buffer#2
    Buffer#1 Buffer#1 Buffer#1
    →time
    Buffer#1
    ⚡CPU
    ⚡GPU
    ⚡CPU
    ⚡GPU
    Buffer#1 Buffer#2
    Buffer#1 Buffer#2
    ⚡CPU ⚡CPU ⚡CPU ⚡CPU
    ⚡GPU ⚡GPU ⚡GPU ⚡GPU
    →time

    View Slide

  27. Buffer Management
    • Memory resource is finite, recycle them where possible
    • System crashes at device.makeBuffer() rather returning nil
    • It is hard to find out the reason (as of iOS10)
    • Save memory resources and be a good citizen

    View Slide

  28. Wrap Up

    View Slide

  29. Wrap Up
    • Computing Shaders are much easier than Rendering Shader
    • Memory Management could be pain if you wants one more
    step toward high performance shader
    • Be aware memory alignment
    • Shader is hard to debug – no break point nor printf()

    View Slide

  30. One More Thing

    View Slide

  31. MetalBlendTester
    https://github.com/codelynx/MetalBlendTester

    View Slide

  32. Thanks
    Kaz Yoshikawa

    View Slide