Parallel Programming with (Py)OpenCL for Fun and Profit by Gordon Inggs

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 1/74 Parallel Programming with (Py)OpenCL for
Fun and Parallel Programming with (Py)OpenCL for Fun and Pro t Pro t Gordon Inggs Github: Gordonei Day Job(s): Science Data, City of Cape Town Application Acceleration Consultant, My Lounge/Back Garden* *weather permitting

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 2/74 Talk Outline Talk Outline Why
PyOpenCL (~5 mins) How to x with PyOpenCL (~15 mins) Program CPUs, GPUs and more Manipulate Memory Do things concurrently in parallel

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 3/74 Github Repo: Gordonei/pyconza-october-2018 (https://github.com/Gordonei/pyconza- october-2018)
Why PyOpenCL? ( Why PyOpenCL? ( Why Heterogeneous Compute? ) Why Heterogeneous Compute? ) ≈

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 4/74

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 5/74 Bad News: The Free Lunch
is over Bad News: The Free Lunch is over

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 6/74 Free Lunch is over graph
Source: Herb Sutter, The Free Lunch is Over

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 7/74 Good New: There are alternatives
Good New: There are alternatives

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 8/74 Performance-wise Performance-wise

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 10/74 Source: Karl Rupp, ViennaCL Energy-wise
Energy-wise

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 12/74 Source: Karl Rupp, ViennaCL

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 13/74 Performance vs Energy Performance vs
Energy

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 14/74 Sounds great, but... Sounds great,
but...

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 15/74 The Challenges of Heterogeneous Computing
The Challenges of Heterogeneous Computing 1. The Orientation Problem - turning things on! 2. The IO Problem - moving data around! 3. The Conceptual Problem - what am I doing?!? Programming Fancy Devices Programming Fancy Devices

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 16/74 In this section: What an
OpenCL program is How to compile an OpenCL program How to run an OpenCL program

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 17/74 Khronos Group Promoters Khronos Group
Promoters

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 18/74 Selecting platforms &/ devices Selecting
platforms &/ devices Looking at what is available, then selecting the NVIDIA one. In [1]: import pyopencl import numpy

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 19/74 In [2]: ocl_platforms = (platform.name
for platform in pyopencl.get_platforms()) print("\n".join(ocl_platforms)) In [3]: nvidia_platform = [platform for platform in pyopencl.get_platforms() if platform.name == "NVIDIA CUDA"][0] Intel(R) OpenCL Portable Computing Language NVIDIA CUDA OpenCL Programming Abstractions OpenCL Programming Abstractions

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 20/74 Building Programs Building Programs Using
the PyOpenCL bindings to create an OpenCL context, then declaring OpenCL kernel code, and compiling it.

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 21/74 The code is for a
simple vector sum, i.e. = + c⃗ a⃗ b ⃗ In [4]: nvidia_context = pyopencl.Context(devices=nvidia_devices) program_source = """ kernel void sum(global float *a, global float *b, global float *c){ int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } """ nvidia_program_source = pyopencl.Program(nvidia_context, program_source) nvidia_program = nvidia_program_source.build() In [5]: program_kernel_names = nvidia_program.get_info(pyopencl.program_info.KERNEL_NAME S) print("Kernel Names: {}".format(program_kernel_names)) Kernel Names: sum

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 22/74 OpenCL Runtime Abstractions OpenCL Runtime
Abstractions

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 23/74 Running Programs Running Programs Actually
running the code using the Python bindings. Some memory and data needs to be setup rst, but I'll elaborate on this later.

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 24/74 In [6]: def run_ocl_kernel(queue, kernel,
global_size, input_tuples, output_tuples, local_size = (32,)): # copying data onto the device for (array, buffer) in input_tuples: pyopencl.enqueue_copy(queue, src=array, dest=buffer) # running program on the device kernel arguments = [buffer for ( buffer) in input tuples] Trust, but verify: Trust, but verify:

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 25/74 In [7]: def check_sum_results(a,b,c): c_ref
= a + b err = numpy.abs(c - c_ref) if((err.sum() > 0.0).any()): print("result does not match") else: print("result matches!") In [8]: # Synthetic data setup N = int(2**20) a = numpy.random.rand(N).astype(numpy.float32) b = numpy.random.rand(N).astype(numpy.float32) c = numpy.empty_like(a) # Device Memory setup a_nvidia_buffer = pyopencl.Buffer(nvidia_context, flags=pyopencl.mem_flags.READ_ONLY, size=a.nbytes) b_nvidia_buffer = pyopencl.Buffer(nvidia_context,

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 26/74 flags=pyopencl.mem_flags.READ_ONLY, size=b.nbytes) c_nvidia_buffer = pyopencl.Buffer(nvidia_context,
flags=pyopencl.mem_flags.WRITE_ONLY, size=c.nbytes) In [9]: nvidia_queue = pyopencl.CommandQueue(nvidia_context) input_tuples = ((a, a_nvidia_buffer), (b, b_nvidia_buffer), ) output_tuples = ((c, c_nvidia_buffer),) run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, output_tupl es) In [10]: check_sum_results(a, b, c) In [11]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, out put_tuples) result matches! 2.65 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) How to Manipulate Memory How to Manipulate Memory

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 27/74 In this section: Moving data
between the host and OpenCL device Specifying storage on the device being used

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 29/74 OpenCL Memory Abstractions OpenCL Memory
Abstractions

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 30/74 Using global memory Using global
memory

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 31/74 In [12]: def create_input_memory(context, input_arrays):
return [(array, pyopencl.Buffer(context, flags=pyopencl.mem_flags.READ_ONLY, size=array.nbytes)) for array in input_arrays] In [13]: def create_output_memory(context, output_arrays): return [(array, pyopencl.Buffer(context, flags=pyopencl.mem_flags.WRITE_ONLY, size=array.nbytes)) for array in output_arrays]

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 32/74 In [15]: a = numpy.random.rand(N).astype(numpy.float32)
b = numpy.random.rand(N).astype(numpy.float32) c = numpy.empty_like(a) # Device Memory setup input_tuples = create_input_memory(nvidia_context, (a,b,)) output_tuples = create_output_memory(nvidia_context, (c,),) run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, output_tupl es) check_sum_results(a,b,c) result matches!

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 33/74 Batching Batching kernel void sum_batched(global
float *a, global float *b, global float *c){ int gid = get_global_id(0)*BATCH_SIZE; for(int i=0; i<BATCH_SIZE;++i) c[gid + i] = a[gid + i] + b[gid + i]; }

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 34/74 In [18]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum_batched,
(N//batch_siz e,), input_tuples, output_tuples) 3.62 ms ± 543 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Using local and private memory Using local and private memory

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 35/74 Batched + Private Memory Batched
+ Private Memory

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 36/74 kernel void sum_batched_private(global float *a,
global float *b, global float *c){ int gid = get_global_id(0)*BATCH_SIZE; float a_tmp[BATCH_SIZE], b_tmp[BATCH_SIZE], c_tmp[BATCH_SIZE]; for(int i=0; i<BATCH_SIZE;++i){ a_tmp[i] = a[gid + i]; b_tmp[i] = b[gid + i]; } for(int i=0; i<BATCH_SIZE;++i) c_tmp[i] = a_tmp[i] + b_tmp[i]; for(int i=0; i<BATCH_SIZE;++i) c[gid + i] = c_tmp[i]; } How to Do Things in Parallel How to Do Things in Parallel In this section:

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 37/74 Inspecting the characteristics of the
available devices Use the work-items and work-groups abstraction to chose between task and data parallelism.

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 39/74 Device characteristics Device characteristics Using
the host code API to inspect the characteristics of the device.

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 40/74 In [21]: intel_platform = [platform
for platform in pyopencl.get_platforms() if platform.name == "Intel(R) OpenCL"][0]

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 41/74 In [22]: for device in
intel_platform.get_devices(): #print out all of the device name properties, except the device type for property_name in sorted(name_properties.keys() - {"Device Type"}): property_string_args = (property_name,device.get_info(name_properties[pr operty_name])) print("{}: {}".format(*property_string_args)) #look up the device type print("Device Types: {}".format(device_types[device.get_info(name_properties ["Device Type"])])) #print out all of the processing properties for property_name in sorted(processing_properties.keys()):

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 42/74 property_string_args = (property_name,device.get_info(processing_propert ies[property_name])) print("{}:
{}".format(*property_string_args)) #print out all of the memory properties for property_name in sorted(memory_properties.keys()): property_string_args = (property_name,device.get_info(memory_properties[ property_name])) print("{}: {}".format(*property_string_args)) print("\n") Device Name: Intel(R) HD Graphics Device Platform: <pyopencl.Platform 'Intel(R) OpenCL' at 0x321d400> Device Types: GPU Available Compute Units: 23 Clockrate: 1100 Available Constant Memory: 4294959103 Available Global Memory: 13321633792 Available Local Memory: 65536 Device Name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Device Platform: <pyopencl.Platform 'Intel(R) OpenCL' at 0x321d400> Device Types: CPU Available Compute Units: 8 Clockrate: 2800 Available Constant Memory: 131072 Available Global Memory: 16662528000 Available Local Memory: 32768

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 44/74 In [23]: !clinfo Number of
platforms 3 Platform Name Intel(R) OpenCL Platform Vendor Intel(R) Corporation Platform Version OpenCL 2.0 Platform Profile FULL_PROFILE Platform Extensions cl_khr_3d_image_writes cl_kh r_byte_addressable_store cl_khr_depth_images cl_khr_fp64 cl_khr_global_int32_b ase_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_fro m_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics c l_khr_spir Platform Extensions function suffix INTEL Platform Name Portable Computing Language Platform Vendor The pocl project Platform Version OpenCL 1.2 pocl 1.1 None+Ass erts, LLVM 6.0.0, SPIR, SLEEF, DISTRO, POCL_DEBUG Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions function suffix POCL Platform Name NVIDIA CUDA Platform Vendor NVIDIA Corporation Platform Version OpenCL 1.2 CUDA 9.1.84 Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_ato mics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_k hr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_k hr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query c

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 45/74 l_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer Platform Extensions
function suffix NV Platform Name Intel(R) OpenCL Number of devices 2 Device Name Intel(R) HD Graphics Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 Driver Version r5.0.63503 Device OpenCL C Version OpenCL C 2.0 Device Type GPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 23 Max clock frequency 1100MHz Device Partition (core) Max number of sub-devices 0 Supported partition types by <unknown> (0x558D0000000 0) Supported affinity domains 0x558D00000000 Max work item dimensions 3 Max work item sizes 256x256x256 Max work group size 256 Preferred work group size multiple 32 Sub-group sizes (Intel) 8, 16, 32 Preferred / native vector sizes char 16 / 16 short 8 / 8

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 46/74 int 4 / 4 long
1 / 1 half 8 / 8 (cl_khr_ fp16) float 1 / 1 double 1 / 1 (cl_khr_ fp64) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 47/74 Support is emulated in software
No Address bits 64, Little-Endian Global memory size 13321633792 (12.41GiB) Error Correction support No Max memory allocation 4294959103 (4GiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 64 bytes Global 64 bytes Local 64 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 4294959103 (4GiB) Global Memory cache type Read/Write Global Memory cache size 524288 (512KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 268434943 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 4 bytes Pitch alignment for 2D image buffers 4 pixels Max 2D image size 16384x16384 pixels Max planar YUV image size 16384x16380 pixels Max 3D image size 16384x16384x2048 pixels

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 48/74 Max number of read image
args 128 Max number of write image args 128 Max number of read/write image args 128 Max number of pipe args 16 Max active pipe reservations 1 Max pipe packet size 1024 Local memory type Local Local memory size 65536 (64KiB) Max number of constant args 8 Max constant buffer size 4294959103 (4GiB) Max size of kernel argument 1024 Queue properties (on host) Out-of-order execution Yes Profiling Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 131072 (128KiB) Max size 67108864 (64MiB) Max queues on device 1 Max events on device 1024 Prefer user sync for interop Yes Profiling timer resolution 83ns Execution capabilities Run OpenCL kernels Yes Run native kernels No SPIR versions 1.2 printf() buffer size 4194304 (4MiB) Built-in kernels block_motion_estimate_intel; block_advanced_motion_estimate_check_intel;block_advanced_motion_estimate_bidi rectional_check_intel

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 49/74 Motion Estimation accelerator version (Intel)
2 Device-side AVC Motion Estimation version 1 Supports texture sampler use Yes Supports preemption No Device Extensions cl_intel_accelerator cl_inte l_advanced_motion_estimation cl_intel_device_side_avc_motion_estimation cl_int el_driver_diagnostics cl_intel_media_block_io cl_intel_motion_estimation cl_in tel_planar_yuv cl_intel_packed_yuv cl_intel_required_subgroup_size cl_intel_su bgroups cl_intel_subgroups_short cl_intel_va_api_media_sharing cl_khr_3d_image _writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp16 cl_khr_f p64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_k hr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local _int32_extended_atomics cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_khr_ spir cl_khr_subgroups Device Name Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 (Build 475) Driver Version 1.2.0.475 Device OpenCL C Version OpenCL C 2.0 Device Type CPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 8 Max clock frequency 2800MHz Device Partition (core) Max number of sub-devices 8

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 50/74 Supported partition types by counts,
equally, by names (Intel) Max work item dimensions 3 Max work item sizes 8192x8192x8192 Max work group size 8192 Preferred work group size multiple 128 Preferred / native vector sizes char 1 / 32 short 1 / 16 int 1 / 8 long 1 / 4 half 0 / 0 (n/a) float 1 / 8 double 1 / 4 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero No Round to infinity No IEEE754-2008 fused multiply-add No Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 51/74 IEEE754-2008 fused multiply-add Yes Support
is emulated in software No Address bits 64, Little-Endian Global memory size 16662528000 (15.52GiB) Error Correction support No Max memory allocation 4165632000 (3.88GiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 64 bytes Global 64 bytes Local 0 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 65536 (64KiB) Global Memory cache type Read/Write Global Memory cache size 262144 (256KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 480 Max size for 1D images from buffer 260352000 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 64 bytes Pitch alignment for 2D image buffers 64 pixels Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 52/74 Max number of read image
args 480 Max number of write image args 480 Max number of read/write image args 480 Max number of pipe args 16 Max active pipe reservations 32767 Max pipe packet size 1024 Local memory type Global Local memory size 32768 (32KiB) Max number of constant args 480 Max constant buffer size 131072 (128KiB) Max size of kernel argument 3840 (3.75KiB) Queue properties (on host) Out-of-order execution Yes Profiling Yes Local thread execution (Intel) Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 4294967295 (4GiB) Max size 4294967295 (4GiB) Max queues on device 4294967295 Max events on device 4294967295 Prefer user sync for interop No Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels Yes SPIR versions 1.2 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Extensions cl_khr_icd cl_khr_global_int

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 53/74 32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_a tomics cl_khr_local_int32_extended_atomics
cl_khr_byte_addressable_store cl_kh r_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spi r cl_khr_fp64 cl_khr_image2d_from_buffer Platform Name Portable Computing Language Number of devices 1 Device Name pthread-Intel(R) Core(TM) i7 -7700HQ CPU @ 2.80GHz Device Vendor GenuineIntel Device Vendor ID 0x8086 Device Version OpenCL 1.2 pocl HSTR: pthrea d-x86_64-pc-linux-gnu-skylake Driver Version 1.1 Device OpenCL C Version OpenCL C 1.2 pocl Device Type CPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 8 Max clock frequency 3800MHz Device Partition (core) Max number of sub-devices 8 Supported partition types equally, by counts Max work item dimensions 3 Max work item sizes 4096x4096x4096 Max work group size 4096 Preferred work group size multiple 8 Preferred / native vector sizes char 16 / 16

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 54/74 short 16 / 16 int
8 / 8 long 4 / 4 half 0 / 0 (n/a) float 8 / 8 double 4 / 4 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 14515044352 (13.52GiB) Error Correction support No Max memory allocation 4294967296 (4GiB) Unified memory for Host and Device Yes Minimum alignment for any data type 128 bytes

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 55/74 Alignment of base address 1024
bits (128 bytes) Global Memory cache type Read/Write Global Memory cache size 6291456 (6MiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 268435456 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels Max number of read image args 128 Max number of write image args 128 Local memory type Global Local memory size 4194304 (4MiB) Max number of constant args 8 Max constant buffer size 4194304 (4MiB) Max size of kernel argument 1024 Queue properties Out-of-order execution No Profiling Yes Prefer user sync for interop Yes Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels Yes SPIR versions 1.2 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Extensions cl_khr_byte_addressable_stor e cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr _local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 56/74 writes cl_khr_spir cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended
_atomics cl_khr_fp64 Platform Name NVIDIA CUDA Number of devices 1 Device Name GeForce GTX 1050 Device Vendor NVIDIA Corporation Device Vendor ID 0x10de Device Version OpenCL 1.2 CUDA Driver Version 390.48 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Topology (NV) PCI-E, 01:00.0 Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 5 Max clock frequency 1493MHz Compute Capability (NV) 6.1 Device Partition (core) Max number of sub-devices 1 Supported partition types None Max work item dimensions 3 Max work item sizes 1024x1024x64 Max work group size 1024 Preferred work group size multiple 32 Warp size (NV) 32 Preferred / native vector sizes char 1 / 1 short 1 / 1

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 57/74 int 1 / 1 long
1 / 1 half 0 / 0 (n/a) float 1 / 1 double 1 / 1 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 4238737408 (3.948GiB) Error Correction support No Max memory allocation 1059684352 (1011MiB) Unified memory for Host and Device No Integrated memory (NV) No Minimum alignment for any data type 128 bytes

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 58/74 Alignment of base address 4096
bits (512 bytes) Global Memory cache type Read/Write Global Memory cache size 81920 (80KiB) Global Memory cache line size 128 bytes Image support Yes Max number of samplers per kernel 32 Max size for 1D images from buffer 134217728 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x32768 pixels Max 3D image size 16384x16384x16384 pixels Max number of read image args 256 Max number of write image args 16 Local memory type Local Local memory size 49152 (48KiB) Registers per block (NV) 65536 Max number of constant args 9 Max constant buffer size 65536 (64KiB) Max size of kernel argument 4352 (4.25KiB) Queue properties Out-of-order execution Yes Profiling Yes Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities Run OpenCL kernels Yes Run native kernels No Kernel execution timeout (NV) Yes Concurrent copy and kernel execution (NV) Yes Number of async copy engines 2 printf() buffer size 1048576 (1024KiB) Built-in kernels

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 59/74 Device Extensions cl_khr_global_int32_base_ato mics cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics cl_k hr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_k hr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query c l_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] Success [INTEL] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 60/74 Workitems vs Workgroups Workitems vs
Workgroups Exploring different types of parallelism, with our trusty vector sum. OpenCL Task vs Data Parallelism Abstractions OpenCL Task vs Data Parallelism Abstractions

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 61/74 Vectorising Vectorising kernel void sum16(global
float16 *a, global float16 *b, global float16 *c){ int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 62/74 In [24]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16,
(N//16,), input_tuple s, output_tuples) 2.36 ms ± 528 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Vectorised + Local Memory Vectorised + Local Memory

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 63/74 kernel void sum16_local(global float16 *a,
global float16 *b, global float16 *c){ int wid = get_group_id(0)*WG_SIZE; int lid = get_local_id(0); local float16 a_local[WG_SIZE], b_local[WG_SIZE], c_local[WG_SIZE]; // Copying on event_t copyon[2]; In [30]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16_local, (N//16,), input _tuples, output_tuples) 1.97 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 64/74 But, before we think that
we're too clever: But, before we think that we're too clever: What about good old numpy ? In [26]: %timeit (a+b) 668 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 65/74 Increasing the computational work Increasing
the computational work Now we make the kernel: Make = c⃗ ( + ) a⃗ b ⃗ x kernel void sum16_pow(global float16 *a, global float16 *b, global float16 *c){ int gid = get_global_id(0); c[gid] = pow(a[gid] + b[gid], POW); } x = 1000 In [28]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16_pow, (N//16,), input_t uples, output_tuples) 3.26 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 66/74 In [29]: %timeit (a+b)**power Ca(t)veats
Ca(t)veats

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 71/74 Thank you! Thank you!

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 72/74 ?

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 73/74 Recommended Reading List Recommended Reading
List

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 74/74 Gene Amdahl, Validity of the
Single Processor Approach to Achieving Large-Scale Computing Capabilities Page and Luk, Compiling Occam into eld-programmable gate arrays Herb Sutter, The Free Lunch is Over Asanovic et al., The Landscape of Parallel Computing Research: A View from Berkeley Tsugio Makimoto, The Hot Decade of Field Programmable Technologies Lee et al., Debunking the 100X GPU vs CPU myth Che et al., Rodinia: A Benchmark Suite for Heterogeneous Computing Thomas et al., Hardware architectures for Monte-Carlo based nancial simulations Mike Giles, Some (strong) opinions on HPC and the use of GPUs

Parallel Programming with (Py)OpenCL for Fun an...

Parallel Programming with (Py)OpenCL for Fun and Profit by Gordon Inggs

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript