Tim Mattson - Speaker Deck

Slide 1

Slide 1 text

1 1 Programming Heterogeneous computers: CUDA, OpenCL, OpenACC, and OpenMP Tim Mattson, Kayak instructor (ACA certified, Advanced Open Ocean) Intel Labs

Slide 2

Slide 2 text

2 2 Disclaimer READ THIS … its very important • The views expressed in this talk are those of the speaker and not his employer. • I am in a research group. I know nothing about Intel products and if asked about them, I’ll probably lie.

Slide 3

Slide 3 text

3 3 It’s a Heterogeneous world GMCH GPU ICH CPU CPU DRAM GMCH = graphics memory control hub, ICH = Input/output control hub • A modern platform Includes: –One or more CPUs –One or more GPUs Source: SC10 OpenCL tutorial, Gaster and Mattson, 2010

Slide 4

Slide 4 text

4 4 Heterogeneity: beyond the GPU FPGA DSP Many-core CPU (MIC & Xeon PhiTM) Integrated CPU+ FPGA (Intel® Atom™ Processor E6x5C Series) … and who knows what the future will bring? *Other names and brands may be claimed as the property of others

Slide 5

Slide 5 text

SW for Heterogeneous platforms With so much diversity, why would any sane programmer lock themselves to a single vendor’s platform?

Slide 6

Slide 6 text

NVIDIA® CUDATM • Ian Buck1 and company at NVIDIA® put GPGPU computing “on the map” with CUDATM in 2006. • CUDATM is pushing beyond NVIDIA® GPUs – NVIDIA® Released their LLVM compiler for CUDATM as Open Source2. – PGI compilers3 support CUDATM programs for x86 CPUs. 1Ian Buck is widely acknowledged as the father of CUDA. It is closely related to his Ph.D. work on the Brook GPU streaming language described in his dissertation with Pat Hanrahan’s group at Stanford 2 www.eetimes.com/electronics-news/4372661/Nvidia-contributes-CUDA-compiler-to-open-source 5/9/2012 3 http://www.pgroup.com/resources/cuda-x86.htm *Other names and brands may be claimed as the property of others NVIDIA® maintains complete control over development of CUDATM It is a proprietary solution.

Slide 7

Slide 7 text

Outline •Don’t reward BAD behavior … Avoid proprietary SW programming models! •Open Standards for heterogeneous computing –OpenCL –Programming with directives: OpenACC and OpenMP

Slide 8

Slide 8 text

8 8 OpenCL Working Group within Khronos • Diverse industry participation … –Processor vendors, system OEMs, middleware vendors, application developers. • OpenCL became an important standard “on release” by virtue of the market coverage of the companies behind it. Third party names are the property of their owners. Apple

Slide 9

Slide 9 text

9 9 The BIG idea behind OpenCL •OpenCL execution model … execute a kernel at each point in a problem domain. –E.g., process a 1024 x 1024 image with one kernel invocation per pixel or 1024 x 1024 = 1,048,576 kernel executions void trad_vadd(int n, const float *a, const float *b, float *c) { int i; for (i=0; i

Slide 10

Slide 10 text

10 10 OpenCL Platform Model • One Host + one or more Compute Devices – Each Compute Device is composed of one or more Compute Units – Each Compute Unit is further divided into one or more Processing Elements

Slide 11

Slide 11 text

Execution Model • Host defines a command queue and associates it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue Gy Gx (wx , wy ) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0, Sy -1) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1, Sy - 1) Index Space Work items execute together as a work-group. Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space A (Gy by Gx ) index space

Slide 12

Slide 12 text

OpenCL vs. CUDA Terminology • Host defines a command queue and associates it with a context (devices, kernels, memory, etc). • Host enqueues commands to the command queue Gy Gx (wx , wy ) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1,0) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (0, Sy -1) (wx Sx + sx, wy Sy + sy ) (sx , sy ) = (Sx -1, Sy - 1) Index Space Work items execute together as a work-group. Kernel execution commands launch work-items: i.e. a kernel for each point in an abstract Index Space A (Gy by Gx ) index space CUDA Stream Grid Threads Thread Block

Slide 13

Slide 13 text

OpenCL Memory model • Implements a relaxed consistency, shared memory model Global memory: visible to host and compute devices Private memory: Local to each work-item Local memory: Shared within a work group

Slide 14

Slide 14 text

Vector Addition - Host • The host program … the code that runs on the host to: – Setup the environment for the OpenCL program – Create and mange kernels • 5 simple steps in a basic Host program 1. Define the platform … platform = devices+context+queues 2. Create and Build the program (dynamic library for kernels) 3. Setup memory objects 4. Define kernel (attach kernel function to arguments) 5. Submit commands … move memory objects and execute kernels Our goal is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … the re-use makes the verbosity a non-issue

Slide 15

Slide 15 text

15 15 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Slide 16

Slide 16 text

16 16 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Define platform and queues Define Memory objects Create the program Build the program Create and setup kernel Execute the kernel Read results on the host It’s complicated, but most of this is “boilerplate” and not as bad as it looks.

Slide 17

Slide 17 text

- Page 17 Host programs can be “ugly” Our goal is extreme portability so we expose everything (i.e. we are a bit verbose). But most of a host code is the same from one application to the next … this re-use makes the verbosity less of an issue.

Slide 18

Slide 18 text

18 18 arg [0] value arg [1] value arg [2] value arg [0] value arg [1] value arg [2] value In Order Queue Out of Order Queue GPU Context __kernel void dp_mul(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] * b[id]; } dp_mul CPU program binary dp_mul GPU program binary Programs Kernels arg[0] value arg[1] value arg[2] value Images Buffers In Order Queue Out of Order Queue Compute Device GPU CPU dp_mul Programs Kernels Memory Objects Command Queues OpenCL summary Third party names are the property of their owners.

Slide 19

Slide 19 text

Outline •Don’t reward BAD behavior … Avoid proprietary SW programming models! •Open Standards for heterogeneous computing –OpenCL –Programming with directives: OpenACC and OpenMP

Slide 20

Slide 20 text

20 20 The serial “vadd” program •Let’s add two vectors together …. C = A + B void vadd(int n, const float *a, const float *b, float *c) { int i; for (i=0; i

Slide 21

Slide 21 text

21 21 The OpenCL vadd program kernel void vec_add(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items •Host program plus kernel just to add two vectors

Slide 22

Slide 22 text

22 22 The OpenCL vadd program kernel void vec_add(global const float *a, global const float *b, global float *c) { int id = get_global_id(0); c[id] = a[id] + b[id]; } // execute over “n” work-items •Host program plus kernel just to add two vectors Shouldn’t a compiler with some hints be able to generate all this? • Host: basic setup ops for most cases are the same from one program to the next. • Kernel: wrap some “glue code” around the body of a loop

Slide 23

Slide 23 text

Directive driven programming of Heterogeneous systems • Portland group (PGI) introduced proprietary directives for programming GPUs • OpenMP (with help from PGI) launched a working group to define “accelerator directives” In OpenMP. • A subset of the participants grew tired of the cautious, slow and methodical approach in the OpenMP group … and split off to form their own group (OpenACC) – NVIDIA, Cray, PGI, and CAPS • They launched the OpenACC directive set in November of 2011 at SC11. • At SC12: – The OpenACC group released a review draft of OpenACC 2.0 – The OpenACC and OpenMP groups publically stated their intent to rejoin the two efforts. – OpenMP group released a technical report with the OpenMP accelerator directives

Slide 24

Slide 24 text

24 24 Directive driven “vadd” program •Let’s add two vectors together …. C = A + B void vadd(int n, const float *a, const float *b, float *restrict c) { int i; #pragma acc kernels for (i=0; i

Slide 25

Slide 25 text

A more complicated example: Jacobi iteration Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012 while (err>tol && iter < iter_mas){ err = 0.0; for(int j=1; j< n-1; j++){ for(int i=1; i

Slide 26

Slide 26 text

A more complicated example: Jacobi iteration: OpenMP (shared memory CPU) while (err>tol && iter < iter_mas){ err = 0.0; #pragma omp parallel for reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i

Slide 27

Slide 27 text

A more complicated example: Jacobi iteration: OpenACC (GPU) while (err>tol && iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i

Slide 28

Slide 28 text

Results Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012

Slide 29

Slide 29 text

A more complicated example: Jacobi iteration: OpenACC (GPU) while (err>tol && iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i

Slide 30

Slide 30 text

A more complicated example: Jacobi iteration: OpenACC (GPU) #pragma acc data copy(A), create(Anew) while (err>tol && iter < iter_mas){ err = 0.0; #pragma acc kernels reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i

Slide 31

Slide 31 text

Results Unfortunately, they didn’t try very hard to optimize the OpenMP program as seen by its max speedup of 1.76 on 6 threads Source: Mark Harris of NVIDIA®, “Getting Started with OpenACC”, GPU technology Conf., 2012

Slide 32

Slide 32 text

Jacobi iteration: A more carefully written OpenMP program #pragma omp parallel firstprivate(iter) { #pragma omp for for(int j=1; j< n-1; j++) for(int i=1; itol && iter < iter_mas){ err = 0.0; #pragma omp for reduction(max:err) for(int j=1; j< n-1; j++) for(int i=1; i

Slide 33

Slide 33 text

OpenACC vs. OpenMP • OpenACC suffers from a form of the CUDATM problem … it is focused on GPUs only. • OpenACC is an open standard (which is great) but its only a small subset of the industry … not the broad coverage of OpenMP or OpenCL. • The OpenMP accelerator directives: – Mesh with the OpenMP directives so you can use both in a single program. – They are designed to support many core CPUs (such as MIC), multicore CPUs, and GPUs. – Support a wider range of algorithms (though OpenACC 2.0 closes this gap). • So … OpenMP directive set will hopefully displace the OpenACC directives as they are finalized and deployed into the market.

Slide 34

Slide 34 text

A more complicated example: Jacobi iteration: OpenMP accelerator directives #pragma omp target data map(A, Anew) while (err>tol && iter < iter_mas){ err = 0.0; #pragma target #pragma omp parallel for reduction(max:err) for(int j=1; j< n-1; j++){ for(int i=1; i

Slide 35

Slide 35 text

Conclusion I’d rather be surfing • Industry standards for Heterogeneous platforms are in place. – Don’t reward bad behavior … insist on standards and make you life easier. • OpenACC is ready … on Cray, PGI, CAPS and Nvidia. • OpenMP accelerator directives SHOULD roll out in 2013.

Slide 36

Slide 36 text

Backup • OpenCL: detailed host program example

Slide 37

Slide 37 text

Slide 38

Slide 38 text

1. Define the platform err = clGetDeviceIDs(firstPlatformId, CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); • Grab the first available Platform: err = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms); • Use the first CPU device the platform provides context = clCreateContext(firstPlatformId, 1, &device_id, NULL, NULL, &err); • Create a simple context with a single device commands = clCreateCommandQueue(context, device_id, 0, &err); • Create a simple command queue to feed our compute device

Slide 39

Slide 39 text

2. Create and Build the program program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err); err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); • Define source code for the kernel-program as a string literal (great for toy programs) or read from a file (common in real apps). • Build the program object: • Compile the program to create a “dynamic library” from which specific kernels can be pulled: • Fetch and print error messages (if(err != CL_SUCCESS) ) size_t len; char buffer[2048]; clGetProgramBuildInfo(program, device_id,CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len); printf("%s\n", buffer);

Slide 40

Slide 40 text

3. Setup Memory Objects • For vector addition, 3 memory objects … one for each input vector (A and B) and one for the output vector (C). • Create input vectors and assign values on the host: a_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); b_in = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); c_out = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL); float a_data[LENGTH], b_data[LENGTH], c_res [LENGTH]; for(i = 0; i < count; i++){ a_data[i] = rand() / (float)RAND_MAX; b_data[i] = rand() / (float)RAND_MAX; } •Define OpenCL memory objects

Slide 41

Slide 41 text

4. Define the kernel • Create kernel object from the kernel function “vadd” err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &a_in); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &b_in); err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &c_out); err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &count); kernel = clCreateKernel(program, "vadd", &err); • Attach arguments to the kernel function “vadd” to memory objects

Slide 42

Slide 42 text

5. Submit commands err = clEnqueueWriteBuffer(commands, a_in, CL_FALSE, 0, sizeof(float) * count, a_data, 0, NULL, NULL); err = clEnqueueWriteBuffer(commands, b_in, CL_FALSE, 0, sizeof(float) * count, b_data, 0, NULL, NULL); • Write Buffers from host into global memory (as non-blocking operations) err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL); err = clEnqueueReadBuffer( commands, c_out, CL_TRUE, 0, sizeof(float) * count, c_res, 0, NULL, NULL ); • Enqueue the kernel for execution (note: in-order queue so this is OK) • Read back the result (as a blocking operation). Use the fact that we have an in-order queue which assures the previous commands are done before the read begins.

Slide 43

Slide 43 text

43 43 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Slide 44

Slide 44 text

44 44 Vector Addition - Host Program // create the OpenCL context on a GPU device cl_context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL);} memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context,CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL,NULL); // create the program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel kernel = clCreateKernel(program, “vec_add”, NULL); // set the args values err = clSetKernelArg(kernel, 0, (void *) &memobjs[0], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); err |= clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem)); // set work-item dimensions global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL); Define platform and queues Define Memory objects Create the program Build the program Create and setup kernel Execute the kernel Read results on the host It’s complicated, but most of this is “boilerplate” and not as bad as it looks.