Slide 1

Slide 1 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey iOS GPU Programming with Swift & Metal Is that a Cray in your pocket…? Simon Gladman for Swift Summit March 2015 Twitter: @FlexMonkey Blog: flexmonkey.blogspot.com GitHub: github.com/FlexMonkey

Slide 2

Slide 2 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey What is Metal? • Metal is a framework for GPU programming • It’s the lowest level of abstraction compared to technologies such as SpriteKit and SceneKit • More work for the developer, but more powerful • Designed for GPU accelerated 3D graphics and data-parallel computation • It’s tied to iOS and A7 processors or later • Shaders are written in a C++ based language

Slide 3

Slide 3 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey CPU versus GPU • CPU’s handle a few software threads at a time • GPUs handle hundreds or thousands of threads in parallel

Slide 4

Slide 4 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Different Types of Shader • Vertex • Takes the three dimensional position of each vertex of each triangle in a scene and maps it the the two dimensional screen coordinate. • Fragment • Compute the colour of each pixel. • Manages textures, shadows, highlights, shading.

Slide 5

Slide 5 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Different Types of Shader • Kernel • Allows us to build programs that operate in parallel on arrays or grids of data

Slide 6

Slide 6 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Setting up Metal in a Swift Project • Create a device let device: MTLDevice = MTLCreateSystemDefaultDevice() • Create a default library let defaultLibrary: MTLLibrary = device.newDefaultLibrary() • Create a command queue let commandQueue: MTLCommandQueue = device.newCommandQueue()

Slide 7

Slide 7 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Setting up Metal: The kernel function • Define the kernel function let kernelFunction: MTLFunction = defaultLibrary.newFunctionWithName(“particleRendererShader") • Define the pipeline state let pipelineState: MTLComputePipelineState = device.newComputePipelineStateWithFunction(kernelFunction!, error: nil)

Slide 8

Slide 8 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Setting up Metal: The Command Buffer • Define the command buffer let commandBuffer: MTLCommandBuffer = commandQueue.commandBuffer()

Slide 9

Slide 9 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Setting up Metal: The Command Encoder • Define the command encoder let commandEncoder: MTLComputeCommandEncoder = commandBuffer.computeCommandEncoder() commandEncoder.setComputePipelineState(pipelineState)

Slide 10

Slide 10 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Preparing the kernel function: resources • Set some parameters var particleBrightness: Float = 0.8 let particleBrightnessBuffer: MTLBuffer = device.newBufferWithBytes(&particleBrightness, length: sizeof(Float), options: nil) commandEncoder.setBuffer(particleBrightnessBuffer, offset: 0, atIndex: 2)

Slide 11

Slide 11 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Texture instantiation • Initialise textures let textureDescriptor = MTLTextureDescriptor.texture2DDescriptorWithPixelFormat( MTLPixelFormat.RGBA8Unorm, width: Int(imageSide), height: Int(imageSide), mipmapped: false) var textureA: MTLTexture = device.newTextureWithDescriptor(textureDescriptor)

Slide 12

Slide 12 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Passing the texture to the kernel function • Define input and output textures var textureA: MTLTexture! commandEncoder.setTexture(textureA, atIndex: 0)

Slide 13

Slide 13 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Creating a Simple Particle System • Create a Particle type in Swift struct Particle { var positionX: Float = 0 var positionY: Float = 0 var velocityX: Float = 0 var velocityY: Float = 0 } • Create its equivalent in Metal struct Particle { float positionX; float positionY; float velocityX; float velocityY; };

Slide 14

Slide 14 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Creating a Simple Particle System • Create and populate an array of Particle instances in Swift var particles = [Particle]() func setUpParticles() { for _ in 0 ..< particleCount { var positionX = Float(arc4random() % UInt32(imageSide)) var positionY = Float(arc4random() % UInt32(imageSide)) let velocityX = (Float(arc4random() % 10) - 5) / 10.0 let velocityY = (Float(arc4random() % 10) - 5) / 10.0 let particle = Particle( positionX: positionX, positionY: positionY, velocityX: velocityX, velocityY: velocityY) particles.append(particle) } }

Slide 15

Slide 15 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Create one buffer for input into kernel function • Determine length of buffer let particleVectorByteLength = particles.count*sizeofValue(particles[0]) • Create and populate the buffer var inVectorBuffer = device.newBufferWithBytes( &particles, length: particleVectorByteLength, options: nil) • Pass the buffer to the shader commandEncoder.setBuffer(inVectorBuffer, offset: 0, atIndex: 0)

Slide 16

Slide 16 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Create one buffer for output from kernel function • Define the Swift variable to receive the kernel results var resultdata = [Particle]( count:particles.count, repeatedValue: Particle()) • Create the buffer var outVectorBuffer = device.newBufferWithBytes( &resultdata, length: particleVectorByteLength, options: nil) • Pass the buffer to the shader commandEncoder.setBuffer(outVectorBuffer, offset: 0, atIndex: 1)

Slide 17

Slide 17 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Setting up Metal: Threadgroups • Two dimensional thread groups are used for image processing threadGroupCount = MTLSize(width: 8,height: 8,depth: 1) threadGroups = MTLSize(width: 1024 / threadGroupCount.width, width: 1024 / threadGroupCount.height, depth: 1) • A one dimensional thread group Swift to pass Metal a one dimensional array threadGroupCount = MTLSize(width:32, height:1, depth:1) threadGroups = MTLSize(width:(4096 + 31) / 32, height:1, depth:1)

Slide 18

Slide 18 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Executing the kernel function • Execute the shader commandEncoder.dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupCount) commandEncoder.endEncoding() commandBuffer.commit()

Slide 19

Slide 19 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Accessing the arrays in the shader • When the kernel function executes, both the input and output arrays are available. kernel void particleRendererShader( const device Particle *inParticle [[ buffer(0) ]], device Particle *outParticle [[ buffer(1) ]], constant float &particleBrightness [[buffer(2)]], uint id [[thread_position_in_grid]]) { const Particle thisParticle = inParticle[id]; outParticle[id].positionX = thisParticle.positionX + thisParticle.velocityX; outParticle[id].positionY = thisParticle.positionY + thisParticle.velocityY; }

Slide 20

Slide 20 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Writing to the output texture • Set the color of the pixel at the particle coordinates kernel void particleRendererShader( texture2d outTexture [[texture(0)]], const device Particle *inParticle [[ buffer(0) ]], uint id [[thread_position_in_grid]]) { const uint2 particlePosition(inParticle[id].positionX, inParticle[id].positionY); const float4 outColor(1.0, 1.0, 0.0, 1.0); // RGBA yellow outTexture.write(outColor, particlePosition); }

Slide 21

Slide 21 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Accessing the updated array in Swift • Once the kernel function has completed, we can access the updated array var data = NSData(bytesNoCopy: outVectorBuffer.contents(), length: particles.count*sizeof(Particle), freeWhenDone: false) data.getBytes(&particles, length:particles.count * sizeof(Particle))

Slide 22

Slide 22 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Converting the output texture to a UIImage • Create a UIImage from the texture region = MTLRegionMake2D(0, 0, Int(imageSide), Int(imageSide)) textureA.getBytes(&imageBytes, bytesPerRow: bytesPerRowInt, fromRegion: region, mipmapLevel: 0) let providerRef = CGDataProviderCreateWithCFData( NSData(bytes: &self.imageBytes, length: self.providerLength)) imageRef = CGImageCreate(imageSide, imageSide, bitsPerComponent, bitsPerPixel, bytesPerRow, rgbColorSpace, bitmapInfo, providerRef, nil, false, renderingIntent) imageView.image = UIImage(CGImage: imageRef)!

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Issues • The main performance issue isn’t running the kernel function, it’s moving the data between the GPU and CPU

Slide 25

Slide 25 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Improving Performance With Shared Memory • To eliminate this bottleneck, we can share memory between the GPU and CPU • No data being copied - improved performance! • posix_memalign() allocates memory for use by both GPU and CPU

Slide 26

Slide 26 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Shared GPU/CPU Memory • Declaring buffers and pointers let particleCount: Int = 2097152 var particlesMemory:UnsafeMutablePointer = nil let particlesMemoryByteSize:UInt = UInt(2097152) * UInt(sizeof(Particle)) var particlesVoidPtr: COpaquePointer! var particlesParticlePtr: UnsafeMutablePointer! var particlesParticleBufferPtr: UnsafeMutableBufferPointer! • posix_memalign() for shared memory posix_memalign(&particlesMemory, 0x4000, particlesMemoryByteSize) particlesVoidPtr = COpaquePointer(particlesMemory) particlesParticlePtr = UnsafeMutablePointer(particlesVoidPtr) particlesParticleBufferPtr = UnsafeMutableBufferPointer( start: particlesParticlePtr, count: particleCount)

Slide 27

Slide 27 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Shared GPU/CPU Memory • Populating data for index in particlesParticleBufferPtr.startIndex ..< particlesParticleBufferPtr.endIndex { let particle = Particle( … ) particlesParticleBufferPtr[index] = particle } • Passing into kernel shader let particlesBufferNoCopy = device.newBufferWithBytesNoCopy( particlesMemory, length: Int(particlesMemoryByteSize), options: nil, deallocator: nil) commandEncoder.setBuffer(particlesBufferNoCopy, offset: 0, atIndex: 0) commandEncoder.setBuffer(particlesBufferNoCopy, offset: 0, atIndex: 1)

Slide 28

Slide 28 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Image processing inside particle compute shader • If the number of particles is more than the number of pixels uint id [[thread_position_in_grid]] // passed in constructor uint2 textureCoordinate(fast::floor(id / imageWidth),id % int(imageWidth)); if (textureCoordinate.x < imageWidth && textureCoordinate.y < imageWidth) { float4 accumColor = inTexture.read(textureCoordinate); for (int j = -1; j <= 1; j++) { for (int i = -1; i <= 1; i++) { uint2 kernelIndex(textureCoordinate.x + i, textureCoordinate.y + j); accumColor.rgb += inTexture.read(kernelIndex).rgb; } } accumColor.rgb = (accumColor.rgb / 10.5f); accumColor.a = 1.0f; outTexture.write(accumColor, textureCoordinate); }

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Advanced Particle Systems • Particle array is passed into kernel function const device Particle *inParticles [[ buffer(0) ]] • The current index is also passed into kernel function uint id [[thread_position_in_grid]] • Accessing an item Particle thisParticle = inParticles[id]; • Looping over entire array for (uint i = 0; i < 4096; i++) { const Particle otherParticle = inParticles[i]; […]

Slide 31

Slide 31 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Advanced Particle Systems kernel void particleRendererShader(texture2d outTexture [[texture(0)]], texture2d inTexture [[texture(1)]], const device Particle *inParticles [[ buffer(0) ]], device Particle *outParticles [[ buffer(1) ]], constant SwarmGenome &genomeOne [[buffer(2)]], constant SwarmGenome &genomeTwo [[buffer(3)]], constant SwarmGenome &genomeThree [[buffer(4)]], constant float &particleBrightness [[buffer(5)]], constant float &gravityWellX [[buffer(6)]], constant float &gravityWellY [[buffer(7)]], uint id [[thread_position_in_grid]]) { Particle inParticle = inParticles[id]; const uint2 particlePosition(inParticle.positionX, inParticle.positionY); const int type = int(inParticle.type); const float4 outColor((type == 0 ? particleBrightness : particleBrightness / 2.0), (type == 1 ? particleBrightness : particleBrightness / 2.0), (type == 2 ? particleBrightness : particleBrightness / 2.0), 1.0); float neigbourCount = 0; float localCentreX = 0; float localCentreY = 0; float localDx = 0; float localDy = 0; float tempAx = 0; float tempAy = 0; if (gravityWellX > 1 && gravityWellY > 1) { const float dist = fast::distance(float2(inParticle.positionX, inParticle.positionY), float2(gravityWellX, gravityWellY)); const float factor = (1 / (dist < 1 ? 1 : dist)) * 10; inParticle.velocityX = inParticle.velocityX + (inParticle.positionX - gravityWellX) * factor; inParticle.velocityY = inParticle.velocityY + (inParticle.positionY - gravityWellY) * factor; } const SwarmGenome genome = type == 0 ? genomeOne : type == 1 ? genomeTwo : genomeThree; for (uint i = 0; i < 4096; i++) { if (i != id) { const Particle candidateNeighbour = inParticles[i]; const float dist = fast::distance(float2(inParticle.positionX, inParticle.positionY), float2(candidateNeighbour.positionX, candidateNeighbour.positionY)); if (dist < genome.radius * 100) { localCentreX = localCentreX + candidateNeighbour.positionX; localCentreY = localCentreY + candidateNeighbour.positionY; localDx = localDx + candidateNeighbour.velocityX; localDy = localDy + candidateNeighbour.velocityY; neigbourCount = neigbourCount + 1.0f; float foo = (dist < 1 ? 1 : dist) * (0.001 + genome.c3_seperation) * 50.0f; tempAx = tempAx + (inParticle.positionX - candidateNeighbour.positionX) / foo; tempAy = tempAy + (inParticle.positionY - candidateNeighbour.positionY) / foo; const float randomThree = fast::abs(fast::cos(candidateNeighbour.velocityX + candidateNeighbour.velocityY)); if (randomThree < genome.c4_steering) { const int randomOne = fast::cos(candidateNeighbour.positionX + candidateNeighbour.velocityY); const int randomTwo = fast::sin(candidateNeighbour.positionY + candidateNeighbour.velocityX); tempAx = tempAx + randomOne * 3; tempAy = tempAy + randomTwo * 3; } } } } if (neigbourCount > 0) { localCentreX = localCentreX / neigbourCount; localCentreY = localCentreY / neigbourCount; localDx = localDx / neigbourCount; localDy = localDy / neigbourCount; tempAx = tempAx + (localCentreX - inParticle.positionX) * genome.c1_cohesion; tempAy = tempAy + (localCentreY - inParticle.positionY) * genome.c1_cohesion; tempAx = tempAx + (localDx - inParticle.velocityX) * genome.c2_alignment; tempAy = tempAy + (localDy - inParticle.velocityY) * genome.c2_alignment; // accellerate inParticle.velocityX2 += tempAx; inParticle.velocityY2 += tempAy; float d = fast::sqrt(inParticle.velocityX2 * inParticle.velocityX2 + inParticle.velocityY2 * inParticle.velocityY2); if (d == 0) { d = 0.001f; } float accelerateMultiplier = ((genome.normalSpeed * 4.0f) - d) / d * genome.c5_paceKeeping; inParticle.velocityX2 += inParticle.velocityX2 * accelerateMultiplier; inParticle.velocityY2 += inParticle.velocityY2 * accelerateMultiplier; } inParticle.velocityX = inParticle.velocityX2; inParticle.velocityY = inParticle.velocityY2; inParticle.positionX += inParticle.velocityX; inParticle.positionY += inParticle.velocityY; outParticles[id] = inParticle; if (outParticles[id].positionX <= 0) { outParticles[id].positionX = 800; } else if (outParticles[id].positionX >= 800) { outParticles[id].positionX = 0; } if (outParticles[id].positionY <= 0) { outParticles[id].positionY = 800; } else if (outParticles[id].positionY >= 800) { outParticles[id].positionY = 0; } const float4 inColor = inTexture.read(particlePosition).rgba; outTexture.write(inColor + outColor, particlePosition); const float4 inColor2 = inTexture.read(particlePosition - uint2(1, 1)).rgba; outTexture.write(inColor2 + outColor, particlePosition - uint2(1, 1)); const float4 inColor3 = inTexture.read(particlePosition - uint2(0, 1)).rgba; outTexture.write(inColor3 + outColor, particlePosition - uint2(0, 1)); const float4 inColor4 = inTexture.read(particlePosition - uint2(1, 0)).rgba; outTexture.write(inColor4 + outColor, particlePosition - uint2(1, 0)); }

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey Acknowledgements • http://metalbyexample.com/ • http://www.raywenderlich.com/ • http://memkite.com/ • http://bingweb.binghamton.edu/~sayama/ • Ida, Beren and Morgan for Swift Summit!

Slide 34

Slide 34 text

Simon Gladman | flexmonkey.blogspot.com | @FlexMonkey iOS GPU Programming with Swift & Metal Simon Gladman for Swift Summit March 2015 Twitter: @FlexMonkey Blog: flexmonkey.blogspot.com GitHub: github.com/FlexMonkey