Tim Mattson

1 1 Programming Heterogeneous computers: CUDA, OpenCL, OpenACC, and OpenMP
Tim Mattson, Kayak instructor (ACA certified, Advanced Open Ocean) Intel Labs

2 2 Disclaimer READ THIS … its very important •
The views expressed in this talk are those of the speaker and not his employer. • I am in a research group. I know nothing about Intel products and if asked about them, I’ll probably lie.

3 3 It’s a Heterogeneous world GMCH GPU ICH CPU
CPU DRAM GMCH = graphics memory control hub, ICH = Input/output control hub • A modern platform Includes: –One or more CPUs –One or more GPUs Source: SC10 OpenCL tutorial, Gaster and Mattson, 2010

4 4 Heterogeneity: beyond the GPU FPGA DSP Many-core CPU
(MIC & Xeon PhiTM) Integrated CPU+ FPGA (Intel® Atom™ Processor E6x5C Series) … and who knows what the future will bring? *Other names and brands may be claimed as the property of others

SW for Heterogeneous platforms With so much diversity, why would
any sane programmer lock themselves to a single vendor’s platform?

NVIDIA® CUDATM • Ian Buck1 and company at NVIDIA® put
GPGPU computing “on the map” with CUDATM in 2006. • CUDATM is pushing beyond NVIDIA® GPUs – NVIDIA® Released their LLVM compiler for CUDATM as Open Source2. – PGI compilers3 support CUDATM programs for x86 CPUs. 1Ian Buck is widely acknowledged as the father of CUDA. It is closely related to his Ph.D. work on the Brook GPU streaming language described in his dissertation with Pat Hanrahan’s group at Stanford 2 www.eetimes.com/electronics-news/4372661/Nvidia-contributes-CUDA-compiler-to-open-source 5/9/2012 3 http://www.pgroup.com/resources/cuda-x86.htm *Other names and brands may be claimed as the property of others NVIDIA® maintains complete control over development of CUDATM It is a proprietary solution.

Outline •Don’t reward BAD behavior … Avoid proprietary SW programming
models! •Open Standards for heterogeneous computing –OpenCL –Programming with directives: OpenACC and OpenMP

8 8 OpenCL Working Group within Khronos • Diverse industry
participation … –Processor vendors, system OEMs, middleware vendors, application developers. • OpenCL became an important standard “on release” by virtue of the market coverage of the companies behind it. Third party names are the property of their owners. Apple

9 9 The BIG idea behind OpenCL •OpenCL execution model
… execute a kernel at each point in a problem domain. –E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions void trad_vadd(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] + b[i]; } Traditional loops kernel void vec_add(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items Kernel Parallelism OpenCL

10 10 OpenCL Platform Model • One Host + one
or more Compute Devices – Each Compute Device is composed of one or more Compute Units – Each Compute Unit is further divided into one or more Processing Elements

Execution Model • Host defines a command queue and associates
it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue Gy Gx (wx , wy ) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0, Sy -1) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1, Sy - 1) Index Space Work items execute together as a work-group. Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space A (Gy by Gx ) index space

OpenCL vs. CUDA Terminology • Host defines a command queue
and associates it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue Gy Gx (wx , wy ) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0, Sy -1) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1, Sy - 1) Index Space Work items execute together as a work-group. Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space A (Gy by Gx ) index space CUDA Stream Grid Threads Thread Block

OpenCL Memory model • Implements a relaxed consistency, shared memory
model Global memory: visible to host and compute devices Private memory: Local to each work-item Local memory: Shared within a work group

Vector Addition - Host • The host program … the
code that runs on the host to: – Setup the environment for the OpenCL program – Create and mange kernels • 5 simple steps in a basic Host program 1. Define the platform … platform = devices+context+queues 2. Create and Build the program (dynamic library for kernels) 3. Setup memory objects 4. Define kernel (attach kernel function to arguments) 5. Submit commands … move memory objects and execute kernels Our goal is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … the re-use makes the verbosity a non-issue

15 15 Vector Addition - Host Program // create the
OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Define platform and queues Define Memory objects Create the program Build the program Create and setup kernel Execute the kernel Read results on the host It’s complicated, but most of this is “boilerplate” and not as bad as it looks.

- Page 17 Host programs can be “ugly” Our goal
is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … this re-use makes the verbosity less of an issue.

18 18 arg [0] value arg [1] value arg [2]
value arg [0] value arg [1] value arg [2] value In Order Queue Out of Order Queue GPU Context __kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary Programs Kernels arg[0] value arg[1] value arg[2] value Images Buffers In Order Queue Out of Order Queue Compute Device GPU CPU dp_mul Programs Kernels Memory Objects Command Queues OpenCL summary Third party names are the property of their owners.

Outline •Don’t reward BAD behavior … Avoid proprietary SW programming
models! •Open Standards for heterogeneous computing –OpenCL –Programming with directives: OpenACC and OpenMP

20 20 The serial “vadd” program •Let’s add two vectors
together …. C = A + B void vadd(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] + b[i]; } } int main(){ float *a, *b, *c; int n = 10000; // allocate and fill a and b vadd(n, a, b, c); }

21 21 The OpenCL vadd program kernel void vec_add(global const
float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items •Host program plus kernel just to add two vectors

22 22 The OpenCL vadd program kernel void vec_add(global const
float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items •Host program plus kernel just to add two vectors Shouldn’t a compiler with some hints be able to generate all this? • Host: basic setup ops for most cases are the same from one program to the next. • Kernel: wrap some “glue code” around the body of a loop

Directive driven programming of Heterogeneous systems • Portland group (PGI)
introduced proprietary directives for programming GPUs • OpenMP (with help from PGI) launched a working group to define “accelerator directives” In OpenMP. • A subset of the participants grew tired of the cautious, slow and methodical approach in the OpenMP group … and split off to form their own group (OpenACC) – NVIDIA, Cray, PGI, and CAPS • They launched the OpenACC directive set in November of 2011 at SC11. • At SC12: – The OpenACC group released a review draft of OpenACC 2.0 – The OpenACC and OpenMP groups publically stated their intent to rejoin the two efforts. – OpenMP group released a technical report with the OpenMP accelerator directives

24 24 Directive driven “vadd” program •Let’s add two vectors
together …. C = A + B void vadd(int n, const float *a, const float *b, float *restrict c) { int i; #pragma acc kernels for (i=0; i<n; i++) c[i] = a[i] + b[i]; } } int main(){ float *a, *b, *c; int n = 10000; // allocate and fill a and b vadd(n, a, b, c); } Assure the compiler that c is not aliased with other pointers Turn the loop into a kernel, move data to a device, and launch the kernel. Host waits here until the kernel is done. Then the output array c is copied back to the host.

A more complicated example: Jacobi iteration Source: Mark Harris of
NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012 while (err>tol && iter < iter_mas){ err = 0.0; for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Solve Laplace's equation for heat diffusion using an explicit finite difference, relaxation method (a four point stencil).

A more complicated example: Jacobi iteration: OpenMP (shared memory CPU)
while (err>tol && iter < iter_mas){ err = 0.0; #pragma omp parallel for reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j][i])); } } #pragma omp parallel for for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012

A more complicated example: Jacobi iteration: OpenACC (GPU) while (err>tol
&& iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma acc kernels for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012

Results Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”,
GPU technology Conf., 2012

A more complicated example: Jacobi iteration: OpenACC (GPU) while (err>tol
&& iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma acc kernels for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } A, and Anew copied between the host and the GPU on each iteration Performance was poor due to excess memory movement overhead Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012

A more complicated example: Jacobi iteration: OpenACC (GPU) #pragma acc
data copy(A), create(Anew) while (err>tol && iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma acc kernels for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Create a data region on the GPU. Copy A once onto the GPU, and create Anew on the device (no copy from host) Copy A back out to host … but only once Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012

Results Unfortunately, they didn’t try very hard to optimize the
OpenMP program as seen by its max speedup of 1.76 on 6 threads Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012

Jacobi iteration: A more carefully written OpenMP program #pragma omp
parallel firstprivate(iter) { #pragma omp for for(int j=1; j< n-1; j++) for(int i=1; i<M-1; i++) A[j][i] = A_init; while (err>tol && iter < iter_mas){ err = 0.0; #pragma omp for reduction(max:err) for(int j=1; j< n-1; j++) for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j][i])); } #pragma single { Atmp = Anew; Anew = A; A = Atmp;} iter ++; } Pull thread creation out of the while loop. Exploit shared address space … swap pointers instead of a copy Initialize A with the same threads and for loop structures as in the computation so data is aligned with threads

OpenACC vs. OpenMP • OpenACC suffers from a form of
the CUDATM problem … it is focused on GPUs only. • OpenACC is an open standard (which is great) but its only a small subset of the industry … not the broad coverage of OpenMP or OpenCL. • The OpenMP accelerator directives: – Mesh with the OpenMP directives so you can use both in a single program. – They are designed to support many core CPUs (such as MIC), multicore CPUs, and GPUs. – Support a wider range of algorithms (though OpenACC 2.0 closes this gap). • So … OpenMP directive set will hopefully displace the OpenACC directives as they are finalized and deployed into the market.

A more complicated example: Jacobi iteration: OpenMP accelerator directives #pragma
omp target data map(A, Anew) while (err>tol && iter < iter_mas){ err = 0.0; #pragma target #pragma omp parallel for reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma omp target #pragma omp parallel for for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Create a data region on the GPU. Map A and Anew onto the target device Copy A back out to host … but only once Uses existing OpenMP constructs such as parallel and for

Conclusion I’d rather be surfing • Industry standards for Heterogeneous
platforms are in place. – Don’t reward bad behavior … insist on standards and make you life easier. • OpenACC is ready … on Cray, PGI, CAPS and Nvidia. • OpenMP accelerator directives SHOULD roll out in 2013.

Backup • OpenCL: detailed host program example

Vector Addition - Host • The host program … the
code that runs on the host to: – Setup the environment for the OpenCL program – Create and mange kernels • 5 simple steps in a basic Host program 1. Define the platform … platform = devices+context+queues 2. Create and Build the program (dynamic library for kernels) 3. Setup memory objects 4. Define kernel (attach kernel function to arguments) 5. Submit commands … move memory objects and execute kernels Our goal is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … the re-use makes the verbosity a non-issue

1. Define the platform err = clGetDeviceIDs(firstPlatformId, CL_DEVICE_TYPE_CPU, 1, &device_id,
NULL); • Grab the first available Platform: err = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms); • Use the first CPU device the platform provides context = clCreateContext(firstPlatformId, 1, &device_id, NULL, NULL, &err); • Create a simple context with a single device commands = clCreateCommandQueue(context, device_id, 0, &err); • Create a simple command queue to feed our compute device

2. Create and Build the program program = clCreateProgramWithSource(context, 1,
(const char **) & KernelSource, NULL, &err); err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); • Define source code for the kernel-program as a string literal (great for toy programs) or read from a file (common in real apps). • Build the program object: • Compile the program to create a “dynamic library” from which specific kernels can be pulled: • Fetch and print error messages (if(err != CL_SUCCESS) ) size_t len; char buffer[2048]; clGetProgramBuildInfo(program, device_id,CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len); printf("%s\n", buffer);

3. Setup Memory Objects • For vector addition, 3 memory
objects … one for each input vector (A and B) and one for the output vector (C). • Create input vectors and assign values on the host: a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL); float a_data[LENGTH], b_data[LENGTH], c_res [LENGTH]; for(i = 0; i < count; i++){ a_data[i] = rand() / (float)RAND_MAX; b_data[i] = rand() / (float)RAND_MAX; } •Define OpenCL memory objects

4. Define the kernel • Create kernel object from the
kernel function “vadd” err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &a_in); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &b_in); err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &c_out); err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &count); kernel = clCreateKernel(program, "vadd", &err); • Attach arguments to the kernel function “vadd” to memory objects

5. Submit commands err = clEnqueueWriteBuffer(commands, a_in, CL_FALSE, 0, sizeof(float)
* count, a_data, 0, NULL, NULL); err = clEnqueueWriteBuffer(commands, b_in, CL_FALSE, 0, sizeof(float) * count, b_data, 0, NULL, NULL); • Write Buffers from host into global memory (as non-blocking operations) err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL); err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * count, c_res, 0, NULL, NULL ); • Enqueue the kernel for execution (note: in-order queue so this is OK) • Read back the result (as a blocking operation). Use the fact that we have an in-order queue which assures the previous commands are done before the read begins.

OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Define platform and queues Define Memory objects Create the program Build the program Create and setup kernel Execute the kernel Read results on the host It’s complicated, but most of this is “boilerplate” and not as bad as it looks.

Tim Mattson

Tim Mattson

More Decks by Multicore World 2013

Featured

Transcript