Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tim Mattson

Multicore World 2013
February 19, 2013
150

Tim Mattson

A review of open industry standards for programming heterogeneoud platforms.

Multicore World 2013

February 19, 2013
Tweet

Transcript

  1. 1 1 Programming Heterogeneous computers: CUDA, OpenCL, OpenACC, and OpenMP

    Tim Mattson, Kayak instructor (ACA certified, Advanced Open Ocean) Intel Labs
  2. 2 2 Disclaimer READ THIS … its very important •

    The views expressed in this talk are those of the speaker and not his employer. • I am in a research group. I know nothing about Intel products and if asked about them, I’ll probably lie.
  3. 3 3 It’s a Heterogeneous world GMCH GPU ICH CPU

    CPU DRAM GMCH = graphics memory control hub, ICH = Input/output control hub • A modern platform Includes: –One or more CPUs –One or more GPUs Source: SC10 OpenCL tutorial, Gaster and Mattson, 2010
  4. 4 4 Heterogeneity: beyond the GPU FPGA DSP Many-core CPU

    (MIC & Xeon PhiTM) Integrated CPU+ FPGA (Intel® Atom™ Processor E6x5C Series) … and who knows what the future will bring? *Other names and brands may be claimed as the property of others
  5. SW for Heterogeneous platforms With so much diversity, why would

    any sane programmer lock themselves to a single vendor’s platform?
  6. NVIDIA® CUDATM • Ian Buck1 and company at NVIDIA® put

    GPGPU computing “on the map” with CUDATM in 2006. • CUDATM is pushing beyond NVIDIA® GPUs – NVIDIA® Released their LLVM compiler for CUDATM as Open Source2. – PGI compilers3 support CUDATM programs for x86 CPUs. 1Ian Buck is widely acknowledged as the father of CUDA. It is closely related to his Ph.D. work on the Brook GPU streaming language described in his dissertation with Pat Hanrahan’s group at Stanford 2 www.eetimes.com/electronics-news/4372661/Nvidia-contributes-CUDA-compiler-to-open-source 5/9/2012 3 http://www.pgroup.com/resources/cuda-x86.htm *Other names and brands may be claimed as the property of others NVIDIA® maintains complete control over development of CUDATM It is a proprietary solution.
  7. Outline •Don’t reward BAD behavior … Avoid proprietary SW programming

    models! •Open Standards for heterogeneous computing –OpenCL –Programming with directives: OpenACC and OpenMP
  8. 8 8 OpenCL Working Group within Khronos • Diverse industry

    participation … –Processor vendors, system OEMs, middleware vendors, application developers. • OpenCL became an important standard “on release” by virtue of the market coverage of the companies behind it. Third party names are the property of their owners. Apple
  9. 9 9 The BIG idea behind OpenCL •OpenCL execution model

    … execute a kernel at each point in a problem domain. –E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions void trad_vadd(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] + b[i]; } Traditional loops kernel void vec_add(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items Kernel Parallelism OpenCL
  10. 10 10 OpenCL Platform Model • One Host + one

    or more Compute Devices – Each Compute Device is composed of one or more Compute Units – Each Compute Unit is further divided into one or more Processing Elements
  11. Execution Model • Host defines a command queue and associates

    it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue Gy Gx (wx , wy ) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0, Sy -1) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1, Sy - 1) Index Space Work items execute together as a work-group. Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space A (Gy by Gx ) index space
  12. OpenCL vs. CUDA Terminology • Host defines a command queue

    and associates it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue Gy Gx (wx , wy ) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0, Sy -1) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1, Sy - 1) Index Space Work items execute together as a work-group. Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space A (Gy by Gx ) index space CUDA Stream Grid Threads Thread Block
  13. OpenCL Memory model • Implements a relaxed consistency, shared memory

    model Global memory: visible to host and compute devices Private memory: Local to each work-item Local memory: Shared within a work group
  14. Vector Addition - Host • The host program … the

    code that runs on the host to: – Setup the environment for the OpenCL program – Create and mange kernels • 5 simple steps in a basic Host program 1. Define the platform … platform = devices+context+queues 2. Create and Build the program (dynamic library for kernels) 3. Setup memory objects 4. Define kernel (attach kernel function to arguments) 5. Submit commands … move memory objects and execute kernels Our goal is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … the re-use makes the verbosity a non-issue
  15. 15 15 Vector Addition - Host Program // create the

    OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
  16. 16 16 Vector Addition - Host Program // create the

    OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Define platform and queues Define Memory objects Create the program Build the program Create and setup kernel Execute the kernel Read results on the host It’s complicated, but most of this is “boilerplate” and not as bad as it looks.
  17. - Page 17 Host programs can be “ugly” Our goal

    is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … this re-use makes the verbosity less of an issue.
  18. 18 18 arg [0] value arg [1] value arg [2]

    value arg [0] value arg [1] value arg [2] value In Order Queue Out of Order Queue GPU Context __kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary Programs Kernels arg[0] value arg[1] value arg[2] value Images Buffers In Order Queue Out of Order Queue Compute Device GPU CPU dp_mul Programs Kernels Memory Objects Command Queues OpenCL summary Third party names are the property of their owners.
  19. Outline •Don’t reward BAD behavior … Avoid proprietary SW programming

    models! •Open Standards for heterogeneous computing –OpenCL –Programming with directives: OpenACC and OpenMP
  20. 20 20 The serial “vadd” program •Let’s add two vectors

    together …. C = A + B void vadd(int n, const float *a, const float *b, float *c) { int i; for (i=0; i<n; i++) c[i] = a[i] + b[i]; } } int main(){ float *a, *b, *c; int n = 10000; // allocate and fill a and b vadd(n, a, b, c); }
  21. 21 21 The OpenCL vadd program kernel void vec_add(global const

    float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items •Host program plus kernel just to add two vectors
  22. 22 22 The OpenCL vadd program kernel void vec_add(global const

    float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items •Host program plus kernel just to add two vectors Shouldn’t a compiler with some hints be able to generate all this? • Host: basic setup ops for most cases are the same from one program to the next. • Kernel: wrap some “glue code” around the body of a loop
  23. Directive driven programming of Heterogeneous systems • Portland group (PGI)

    introduced proprietary directives for programming GPUs • OpenMP (with help from PGI) launched a working group to define “accelerator directives” In OpenMP. • A subset of the participants grew tired of the cautious, slow and methodical approach in the OpenMP group … and split off to form their own group (OpenACC) – NVIDIA, Cray, PGI, and CAPS • They launched the OpenACC directive set in November of 2011 at SC11. • At SC12: – The OpenACC group released a review draft of OpenACC 2.0 – The OpenACC and OpenMP groups publically stated their intent to rejoin the two efforts. – OpenMP group released a technical report with the OpenMP accelerator directives
  24. 24 24 Directive driven “vadd” program •Let’s add two vectors

    together …. C = A + B void vadd(int n, const float *a, const float *b, float *restrict c) { int i; #pragma acc kernels for (i=0; i<n; i++) c[i] = a[i] + b[i]; } } int main(){ float *a, *b, *c; int n = 10000; // allocate and fill a and b vadd(n, a, b, c); } Assure the compiler that c is not aliased with other pointers Turn the loop into a kernel, move data to a device, and launch the kernel. Host waits here until the kernel is done. Then the output array c is copied back to the host.
  25. A more complicated example: Jacobi iteration Source: Mark Harris of

    NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012 while (err>tol && iter < iter_mas){ err = 0.0; for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Solve Laplace's equation for heat diffusion using an explicit finite difference, relaxation method (a four point stencil).
  26. A more complicated example: Jacobi iteration: OpenMP (shared memory CPU)

    while (err>tol && iter < iter_mas){ err = 0.0; #pragma omp parallel for reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j][i])); } } #pragma omp parallel for for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012
  27. A more complicated example: Jacobi iteration: OpenACC (GPU) while (err>tol

    && iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma acc kernels for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012
  28. A more complicated example: Jacobi iteration: OpenACC (GPU) while (err>tol

    && iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma acc kernels for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } A, and Anew copied between the host and the GPU on each iteration Performance was poor due to excess memory movement overhead Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012
  29. A more complicated example: Jacobi iteration: OpenACC (GPU) #pragma acc

    data copy(A), create(Anew) while (err>tol && iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma acc kernels for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Create a data region on the GPU. Copy A once onto the GPU, and create Anew on the device (no copy from host) Copy A back out to host … but only once Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012
  30. Results Unfortunately, they didn’t try very hard to optimize the

    OpenMP program as seen by its max speedup of 1.76 on 6 threads Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012
  31. Jacobi iteration: A more carefully written OpenMP program #pragma omp

    parallel firstprivate(iter) { #pragma omp for for(int j=1; j< n-1; j++) for(int i=1; i<M-1; i++) A[j][i] = A_init; while (err>tol && iter < iter_mas){ err = 0.0; #pragma omp for reduction(max:err) for(int j=1; j< n-1; j++) for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j][i])); } #pragma single { Atmp = Anew; Anew = A; A = Atmp;} iter ++; } Pull thread creation out of the while loop. Exploit shared address space … swap pointers instead of a copy Initialize A with the same threads and for loop structures as in the computation so data is aligned with threads
  32. OpenACC vs. OpenMP • OpenACC suffers from a form of

    the CUDATM problem … it is focused on GPUs only. • OpenACC is an open standard (which is great) but its only a small subset of the industry … not the broad coverage of OpenMP or OpenCL. • The OpenMP accelerator directives: – Mesh with the OpenMP directives so you can use both in a single program. – They are designed to support many core CPUs (such as MIC), multicore CPUs, and GPUs. – Support a wider range of algorithms (though OpenACC 2.0 closes this gap). • So … OpenMP directive set will hopefully displace the OpenACC directives as they are finalized and deployed into the market.
  33. A more complicated example: Jacobi iteration: OpenMP accelerator directives #pragma

    omp target data map(A, Anew) while (err>tol && iter < iter_mas){ err = 0.0; #pragma target #pragma omp parallel for reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ Anew[j][i] = 0.25* (A[j][i+1] + A[j][i-1]+ A[j-1][i] + A[j+1][i]); err = max(err,abs(Anew[j][i] – A[j]i])); } } #pragma omp target #pragma omp parallel for for(int j=1; j< n-1; j++){ for(int i=1; i<M-1; i++){ A[j][i] = Anew[j]i]; } } iter ++; } Create a data region on the GPU. Map A and Anew onto the target device Copy A back out to host … but only once Uses existing OpenMP constructs such as parallel and for
  34. Conclusion I’d rather be surfing • Industry standards for Heterogeneous

    platforms are in place. – Don’t reward bad behavior … insist on standards and make you life easier. • OpenACC is ready … on Cray, PGI, CAPS and Nvidia. • OpenMP accelerator directives SHOULD roll out in 2013.
  35. Vector Addition - Host • The host program … the

    code that runs on the host to: – Setup the environment for the OpenCL program – Create and mange kernels • 5 simple steps in a basic Host program 1. Define the platform … platform = devices+context+queues 2. Create and Build the program (dynamic library for kernels) 3. Setup memory objects 4. Define kernel (attach kernel function to arguments) 5. Submit commands … move memory objects and execute kernels Our goal is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … the re-use makes the verbosity a non-issue
  36. 1. Define the platform err = clGetDeviceIDs(firstPlatformId, CL_DEVICE_TYPE_CPU, 1, &device_id,

    NULL); • Grab the first available Platform: err = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms); • Use the first CPU device the platform provides context = clCreateContext(firstPlatformId, 1, &device_id, NULL, NULL, &err); • Create a simple context with a single device commands = clCreateCommandQueue(context, device_id, 0, &err); • Create a simple command queue to feed our compute device
  37. 2. Create and Build the program program = clCreateProgramWithSource(context, 1,

    (const char **) & KernelSource, NULL, &err); err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); • Define source code for the kernel-program as a string literal (great for toy programs) or read from a file (common in real apps). • Build the program object: • Compile the program to create a “dynamic library” from which specific kernels can be pulled: • Fetch and print error messages (if(err != CL_SUCCESS) ) size_t len; char buffer[2048]; clGetProgramBuildInfo(program, device_id,CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len); printf("%s\n", buffer);
  38. 3. Setup Memory Objects • For vector addition, 3 memory

    objects … one for each input vector (A and B) and one for the output vector (C). • Create input vectors and assign values on the host: a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL); float a_data[LENGTH], b_data[LENGTH], c_res [LENGTH]; for(i = 0; i < count; i++){ a_data[i] = rand() / (float)RAND_MAX; b_data[i] = rand() / (float)RAND_MAX; } •Define OpenCL memory objects
  39. 4. Define the kernel • Create kernel object from the

    kernel function “vadd” err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &a_in); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &b_in); err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &c_out); err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &count); kernel = clCreateKernel(program, "vadd", &err); • Attach arguments to the kernel function “vadd” to memory objects
  40. 5. Submit commands err = clEnqueueWriteBuffer(commands, a_in, CL_FALSE, 0, sizeof(float)

    * count, a_data, 0, NULL, NULL); err = clEnqueueWriteBuffer(commands, b_in, CL_FALSE, 0, sizeof(float) * count, b_data, 0, NULL, NULL); • Write Buffers from host into global memory (as non-blocking operations) err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL); err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * count, c_res, 0, NULL, NULL ); • Enqueue the kernel for execution (note: in-order queue so this is OK) • Read back the result (as a blocking operation). Use the fact that we have an in-order queue which assures the previous commands are done before the read begins.
  41. 43 43 Vector Addition - Host Program // create the

    OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
  42. 44 44 Vector Addition - Host Program // create the

    OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Define platform and queues Define Memory objects Create the program Build the program Create and setup kernel Execute the kernel Read results on the host It’s complicated, but most of this is “boilerplate” and not as bad as it looks.