Parallel Programming with (Py)OpenCL for Fun and Profit by Gordon Inggs

Slide 1

Slide 1 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 1/74 Parallel Programming with (Py)OpenCL for Fun and Parallel Programming with (Py)OpenCL for Fun and Pro t Pro t Gordon Inggs Github: Gordonei Day Job(s): Science Data, City of Cape Town Application Acceleration Consultant, My Lounge/Back Garden* *weather permitting

Slide 2

Slide 2 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 2/74 Talk Outline Talk Outline Why PyOpenCL (~5 mins) How to x with PyOpenCL (~15 mins) Program CPUs, GPUs and more Manipulate Memory Do things concurrently in parallel

Slide 3

Slide 3 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 3/74 Github Repo: Gordonei/pyconza-october-2018 (https://github.com/Gordonei/pyconza- october-2018) Why PyOpenCL? ( Why PyOpenCL? ( Why Heterogeneous Compute? ) Why Heterogeneous Compute? ) ≈

Slide 4

Slide 4 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 4/74

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 9/74

Slide 10

Slide 10 text

Slide 11

Slide 11 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 11/74

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 15/74 The Challenges of Heterogeneous Computing The Challenges of Heterogeneous Computing 1. The Orientation Problem - turning things on! 2. The IO Problem - moving data around! 3. The Conceptual Problem - what am I doing?!? Programming Fancy Devices Programming Fancy Devices

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 18/74 Selecting platforms &/ devices Selecting platforms &/ devices Looking at what is available, then selecting the NVIDIA one. In [1]: import pyopencl import numpy

Slide 19

Slide 19 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 19/74 In [2]: ocl_platforms = (platform.name for platform in pyopencl.get_platforms()) print("\n".join(ocl_platforms)) In [3]: nvidia_platform = [platform for platform in pyopencl.get_platforms() if platform.name == "NVIDIA CUDA"][0] Intel(R) OpenCL Portable Computing Language NVIDIA CUDA OpenCL Programming Abstractions OpenCL Programming Abstractions

Slide 20

Slide 20 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 20/74 Building Programs Building Programs Using the PyOpenCL bindings to create an OpenCL context, then declaring OpenCL kernel code, and compiling it.

Slide 21

Slide 21 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 21/74 The code is for a simple vector sum, i.e. = + c⃗ a⃗ b ⃗ In [4]: nvidia_context = pyopencl.Context(devices=nvidia_devices) program_source = """ kernel void sum(global float *a, global float *b, global float *c){ int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } """ nvidia_program_source = pyopencl.Program(nvidia_context, program_source) nvidia_program = nvidia_program_source.build() In [5]: program_kernel_names = nvidia_program.get_info(pyopencl.program_info.KERNEL_NAME S) print("Kernel Names: {}".format(program_kernel_names)) Kernel Names: sum

Slide 22

Slide 22 text

Slide 23

Slide 23 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 23/74 Running Programs Running Programs Actually running the code using the Python bindings. Some memory and data needs to be setup rst, but I'll elaborate on this later.

Slide 24

Slide 24 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 24/74 In [6]: def run_ocl_kernel(queue, kernel, global_size, input_tuples, output_tuples, local_size = (32,)): # copying data onto the device for (array, buffer) in input_tuples: pyopencl.enqueue_copy(queue, src=array, dest=buffer) # running program on the device kernel arguments = [buffer for ( buffer) in input tuples] Trust, but verify: Trust, but verify:

Slide 25

Slide 25 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 25/74 In [7]: def check_sum_results(a,b,c): c_ref = a + b err = numpy.abs(c - c_ref) if((err.sum() > 0.0).any()): print("result does not match") else: print("result matches!") In [8]: # Synthetic data setup N = int(2**20) a = numpy.random.rand(N).astype(numpy.float32) b = numpy.random.rand(N).astype(numpy.float32) c = numpy.empty_like(a) # Device Memory setup a_nvidia_buffer = pyopencl.Buffer(nvidia_context, flags=pyopencl.mem_flags.READ_ONLY, size=a.nbytes) b_nvidia_buffer = pyopencl.Buffer(nvidia_context,

Slide 26

Slide 26 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 26/74 flags=pyopencl.mem_flags.READ_ONLY, size=b.nbytes) c_nvidia_buffer = pyopencl.Buffer(nvidia_context, flags=pyopencl.mem_flags.WRITE_ONLY, size=c.nbytes) In [9]: nvidia_queue = pyopencl.CommandQueue(nvidia_context) input_tuples = ((a, a_nvidia_buffer), (b, b_nvidia_buffer), ) output_tuples = ((c, c_nvidia_buffer),) run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, output_tupl es) In [10]: check_sum_results(a, b, c) In [11]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, out put_tuples) result matches! 2.65 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) How to Manipulate Memory How to Manipulate Memory

Slide 27

Slide 27 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 27/74 In this section: Moving data between the host and OpenCL device Specifying storage on the device being used

Slide 28

Slide 28 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 28/74

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 31/74 In [12]: def create_input_memory(context, input_arrays): return [(array, pyopencl.Buffer(context, flags=pyopencl.mem_flags.READ_ONLY, size=array.nbytes)) for array in input_arrays] In [13]: def create_output_memory(context, output_arrays): return [(array, pyopencl.Buffer(context, flags=pyopencl.mem_flags.WRITE_ONLY, size=array.nbytes)) for array in output_arrays]

Slide 32

Slide 32 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 32/74 In [15]: a = numpy.random.rand(N).astype(numpy.float32) b = numpy.random.rand(N).astype(numpy.float32) c = numpy.empty_like(a) # Device Memory setup input_tuples = create_input_memory(nvidia_context, (a,b,)) output_tuples = create_output_memory(nvidia_context, (c,),) run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, output_tupl es) check_sum_results(a,b,c) result matches!

Slide 33

Slide 33 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 33/74 Batching Batching kernel void sum_batched(global float *a, global float *b, global float *c){ int gid = get_global_id(0)*BATCH_SIZE; for(int i=0; i

Slide 34

Slide 34 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 34/74 In [18]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum_batched, (N//batch_siz e,), input_tuples, output_tuples) 3.62 ms ± 543 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Using local and private memory Using local and private memory

Slide 35

Slide 35 text

Slide 36

Slide 36 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 36/74 kernel void sum_batched_private(global float *a, global float *b, global float *c){ int gid = get_global_id(0)*BATCH_SIZE; float a_tmp[BATCH_SIZE], b_tmp[BATCH_SIZE], c_tmp[BATCH_SIZE]; for(int i=0; i

Slide 37

Slide 37 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 37/74 Inspecting the characteristics of the available devices Use the work-items and work-groups abstraction to chose between task and data parallelism.

Slide 38

Slide 38 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 38/74

Slide 39

Slide 39 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 39/74 Device characteristics Device characteristics Using the host code API to inspect the characteristics of the device.

Slide 40

Slide 40 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 40/74 In [21]: intel_platform = [platform for platform in pyopencl.get_platforms() if platform.name == "Intel(R) OpenCL"][0]

Slide 41

Slide 41 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 41/74 In [22]: for device in intel_platform.get_devices(): #print out all of the device name properties, except the device type for property_name in sorted(name_properties.keys() - {"Device Type"}): property_string_args = (property_name,device.get_info(name_properties[pr operty_name])) print("{}: {}".format(*property_string_args)) #look up the device type print("Device Types: {}".format(device_types[device.get_info(name_properties ["Device Type"])])) #print out all of the processing properties for property_name in sorted(processing_properties.keys()):

Slide 42

Slide 42 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 42/74 property_string_args = (property_name,device.get_info(processing_propert ies[property_name])) print("{}: {}".format(*property_string_args)) #print out all of the memory properties for property_name in sorted(memory_properties.keys()): property_string_args = (property_name,device.get_info(memory_properties[ property_name])) print("{}: {}".format(*property_string_args)) print("\n") Device Name: Intel(R) HD Graphics Device Platform: Device Types: GPU Available Compute Units: 23 Clockrate: 1100 Available Constant Memory: 4294959103 Available Global Memory: 13321633792 Available Local Memory: 65536 Device Name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Device Platform: Device Types: CPU Available Compute Units: 8 Clockrate: 2800 Available Constant Memory: 131072 Available Global Memory: 16662528000 Available Local Memory: 32768

Slide 43

Slide 43 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 43/74

Slide 44

Slide 44 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 44/74 In [23]: !clinfo Number of platforms 3 Platform Name Intel(R) OpenCL Platform Vendor Intel(R) Corporation Platform Version OpenCL 2.0 Platform Profile FULL_PROFILE Platform Extensions cl_khr_3d_image_writes cl_kh r_byte_addressable_store cl_khr_depth_images cl_khr_fp64 cl_khr_global_int32_b ase_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_fro m_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics c l_khr_spir Platform Extensions function suffix INTEL Platform Name Portable Computing Language Platform Vendor The pocl project Platform Version OpenCL 1.2 pocl 1.1 None+Ass erts, LLVM 6.0.0, SPIR, SLEEF, DISTRO, POCL_DEBUG Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions function suffix POCL Platform Name NVIDIA CUDA Platform Vendor NVIDIA Corporation Platform Version OpenCL 1.2 CUDA 9.1.84 Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_ato mics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_k hr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_k hr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query c

Slide 45

Slide 45 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 45/74 l_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer Platform Extensions function suffix NV Platform Name Intel(R) OpenCL Number of devices 2 Device Name Intel(R) HD Graphics Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 Driver Version r5.0.63503 Device OpenCL C Version OpenCL C 2.0 Device Type GPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 23 Max clock frequency 1100MHz Device Partition (core) Max number of sub-devices 0 Supported partition types by (0x558D0000000 0) Supported affinity domains 0x558D00000000 Max work item dimensions 3 Max work item sizes 256x256x256 Max work group size 256 Preferred work group size multiple 32 Sub-group sizes (Intel) 8, 16, 32 Preferred / native vector sizes char 16 / 16 short 8 / 8

Slide 46

Slide 46 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 46/74 int 4 / 4 long 1 / 1 half 8 / 8 (cl_khr_ fp16) float 1 / 1 double 1 / 1 (cl_khr_ fp64) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes

Slide 47

Slide 47 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 47/74 Support is emulated in software No Address bits 64, Little-Endian Global memory size 13321633792 (12.41GiB) Error Correction support No Max memory allocation 4294959103 (4GiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 64 bytes Global 64 bytes Local 64 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 4294959103 (4GiB) Global Memory cache type Read/Write Global Memory cache size 524288 (512KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 268434943 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 4 bytes Pitch alignment for 2D image buffers 4 pixels Max 2D image size 16384x16384 pixels Max planar YUV image size 16384x16380 pixels Max 3D image size 16384x16384x2048 pixels

Slide 48

Slide 48 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 48/74 Max number of read image args 128 Max number of write image args 128 Max number of read/write image args 128 Max number of pipe args 16 Max active pipe reservations 1 Max pipe packet size 1024 Local memory type Local Local memory size 65536 (64KiB) Max number of constant args 8 Max constant buffer size 4294959103 (4GiB) Max size of kernel argument 1024 Queue properties (on host) Out-of-order execution Yes Profiling Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 131072 (128KiB) Max size 67108864 (64MiB) Max queues on device 1 Max events on device 1024 Prefer user sync for interop Yes Profiling timer resolution 83ns Execution capabilities Run OpenCL kernels Yes Run native kernels No SPIR versions 1.2 printf() buffer size 4194304 (4MiB) Built-in kernels block_motion_estimate_intel; block_advanced_motion_estimate_check_intel;block_advanced_motion_estimate_bidi rectional_check_intel

Slide 49

Slide 49 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 49/74 Motion Estimation accelerator version (Intel) 2 Device-side AVC Motion Estimation version 1 Supports texture sampler use Yes Supports preemption No Device Extensions cl_intel_accelerator cl_inte l_advanced_motion_estimation cl_intel_device_side_avc_motion_estimation cl_int el_driver_diagnostics cl_intel_media_block_io cl_intel_motion_estimation cl_in tel_planar_yuv cl_intel_packed_yuv cl_intel_required_subgroup_size cl_intel_su bgroups cl_intel_subgroups_short cl_intel_va_api_media_sharing cl_khr_3d_image _writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp16 cl_khr_f p64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_k hr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local _int32_extended_atomics cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_khr_ spir cl_khr_subgroups Device Name Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 (Build 475) Driver Version 1.2.0.475 Device OpenCL C Version OpenCL C 2.0 Device Type CPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 8 Max clock frequency 2800MHz Device Partition (core) Max number of sub-devices 8

Slide 50

Slide 50 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 50/74 Supported partition types by counts, equally, by names (Intel) Max work item dimensions 3 Max work item sizes 8192x8192x8192 Max work group size 8192 Preferred work group size multiple 128 Preferred / native vector sizes char 1 / 32 short 1 / 16 int 1 / 8 long 1 / 4 half 0 / 0 (n/a) float 1 / 8 double 1 / 4 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero No Round to infinity No IEEE754-2008 fused multiply-add No Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes

Slide 51

Slide 51 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 51/74 IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 16662528000 (15.52GiB) Error Correction support No Max memory allocation 4165632000 (3.88GiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 64 bytes Global 64 bytes Local 0 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 65536 (64KiB) Global Memory cache type Read/Write Global Memory cache size 262144 (256KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 480 Max size for 1D images from buffer 260352000 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 64 bytes Pitch alignment for 2D image buffers 64 pixels Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels

Slide 52

Slide 52 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 52/74 Max number of read image args 480 Max number of write image args 480 Max number of read/write image args 480 Max number of pipe args 16 Max active pipe reservations 32767 Max pipe packet size 1024 Local memory type Global Local memory size 32768 (32KiB) Max number of constant args 480 Max constant buffer size 131072 (128KiB) Max size of kernel argument 3840 (3.75KiB) Queue properties (on host) Out-of-order execution Yes Profiling Yes Local thread execution (Intel) Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 4294967295 (4GiB) Max size 4294967295 (4GiB) Max queues on device 4294967295 Max events on device 4294967295 Prefer user sync for interop No Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels Yes SPIR versions 1.2 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Extensions cl_khr_icd cl_khr_global_int

Slide 53

Slide 53 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 53/74 32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_a tomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_kh r_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spi r cl_khr_fp64 cl_khr_image2d_from_buffer Platform Name Portable Computing Language Number of devices 1 Device Name pthread-Intel(R) Core(TM) i7 -7700HQ CPU @ 2.80GHz Device Vendor GenuineIntel Device Vendor ID 0x8086 Device Version OpenCL 1.2 pocl HSTR: pthrea d-x86_64-pc-linux-gnu-skylake Driver Version 1.1 Device OpenCL C Version OpenCL C 1.2 pocl Device Type CPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 8 Max clock frequency 3800MHz Device Partition (core) Max number of sub-devices 8 Supported partition types equally, by counts Max work item dimensions 3 Max work item sizes 4096x4096x4096 Max work group size 4096 Preferred work group size multiple 8 Preferred / native vector sizes char 16 / 16

Slide 54

Slide 54 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 54/74 short 16 / 16 int 8 / 8 long 4 / 4 half 0 / 0 (n/a) float 8 / 8 double 4 / 4 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 14515044352 (13.52GiB) Error Correction support No Max memory allocation 4294967296 (4GiB) Unified memory for Host and Device Yes Minimum alignment for any data type 128 bytes

Slide 55

Slide 55 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 55/74 Alignment of base address 1024 bits (128 bytes) Global Memory cache type Read/Write Global Memory cache size 6291456 (6MiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 268435456 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels Max number of read image args 128 Max number of write image args 128 Local memory type Global Local memory size 4194304 (4MiB) Max number of constant args 8 Max constant buffer size 4194304 (4MiB) Max size of kernel argument 1024 Queue properties Out-of-order execution No Profiling Yes Prefer user sync for interop Yes Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels Yes SPIR versions 1.2 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Extensions cl_khr_byte_addressable_stor e cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr _local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_

Slide 56

Slide 56 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 56/74 writes cl_khr_spir cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended _atomics cl_khr_fp64 Platform Name NVIDIA CUDA Number of devices 1 Device Name GeForce GTX 1050 Device Vendor NVIDIA Corporation Device Vendor ID 0x10de Device Version OpenCL 1.2 CUDA Driver Version 390.48 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Topology (NV) PCI-E, 01:00.0 Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 5 Max clock frequency 1493MHz Compute Capability (NV) 6.1 Device Partition (core) Max number of sub-devices 1 Supported partition types None Max work item dimensions 3 Max work item sizes 1024x1024x64 Max work group size 1024 Preferred work group size multiple 32 Warp size (NV) 32 Preferred / native vector sizes char 1 / 1 short 1 / 1

Slide 57

Slide 57 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 57/74 int 1 / 1 long 1 / 1 half 0 / 0 (n/a) float 1 / 1 double 1 / 1 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 4238737408 (3.948GiB) Error Correction support No Max memory allocation 1059684352 (1011MiB) Unified memory for Host and Device No Integrated memory (NV) No Minimum alignment for any data type 128 bytes

Slide 58

Slide 58 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 58/74 Alignment of base address 4096 bits (512 bytes) Global Memory cache type Read/Write Global Memory cache size 81920 (80KiB) Global Memory cache line size 128 bytes Image support Yes Max number of samplers per kernel 32 Max size for 1D images from buffer 134217728 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x32768 pixels Max 3D image size 16384x16384x16384 pixels Max number of read image args 256 Max number of write image args 16 Local memory type Local Local memory size 49152 (48KiB) Registers per block (NV) 65536 Max number of constant args 9 Max constant buffer size 65536 (64KiB) Max size of kernel argument 4352 (4.25KiB) Queue properties Out-of-order execution Yes Profiling Yes Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities Run OpenCL kernels Yes Run native kernels No Kernel execution timeout (NV) Yes Concurrent copy and kernel execution (NV) Yes Number of async copy engines 2 printf() buffer size 1048576 (1024KiB) Built-in kernels

Slide 59

Slide 59 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 59/74 Device Extensions cl_khr_global_int32_base_ato mics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_k hr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_k hr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query c l_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] Success [INTEL] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform

Slide 60

Slide 60 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 60/74 Workitems vs Workgroups Workitems vs Workgroups Exploring different types of parallelism, with our trusty vector sum. OpenCL Task vs Data Parallelism Abstractions OpenCL Task vs Data Parallelism Abstractions

Slide 61

Slide 61 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 61/74 Vectorising Vectorising kernel void sum16(global float16 *a, global float16 *b, global float16 *c){ int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

Slide 62

Slide 62 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 62/74 In [24]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16, (N//16,), input_tuple s, output_tuples) 2.36 ms ± 528 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Vectorised + Local Memory Vectorised + Local Memory

Slide 63

Slide 63 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 63/74 kernel void sum16_local(global float16 *a, global float16 *b, global float16 *c){ int wid = get_group_id(0)*WG_SIZE; int lid = get_local_id(0); local float16 a_local[WG_SIZE], b_local[WG_SIZE], c_local[WG_SIZE]; // Copying on event_t copyon[2]; In [30]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16_local, (N//16,), input _tuples, output_tuples) 1.97 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Slide 64

Slide 64 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 64/74 But, before we think that we're too clever: But, before we think that we're too clever: What about good old numpy ? In [26]: %timeit (a+b) 668 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Slide 65

Slide 65 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 65/74 Increasing the computational work Increasing the computational work Now we make the kernel: Make = c⃗ ( + ) a⃗ b ⃗ x kernel void sum16_pow(global float16 *a, global float16 *b, global float16 *c){ int gid = get_global_id(0); c[gid] = pow(a[gid] + b[gid], POW); } x = 1000 In [28]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16_pow, (N//16,), input_t uples, output_tuples) 3.26 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Slide 66

Slide 66 text

Slide 67

Slide 67 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 67/74

Slide 68

Slide 68 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 68/74

Slide 69

Slide 69 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 69/74

Slide 70

Slide 70 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 70/74

Slide 71

Slide 71 text

Slide 72

Slide 72 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 72/74 ?

Slide 73

Slide 73 text

Slide 74

Slide 74 text

10/13/2018 PyConZA_OpenCL_Talk slides ﬁle:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 74/74 Gene Amdahl, Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities Page and Luk, Compiling Occam into eld-programmable gate arrays Herb Sutter, The Free Lunch is Over Asanovic et al., The Landscape of Parallel Computing Research: A View from Berkeley Tsugio Makimoto, The Hot Decade of Field Programmable Technologies Lee et al., Debunking the 100X GPU vs CPU myth Che et al., Rodinia: A Benchmark Suite for Heterogeneous Computing Thomas et al., Hardware architectures for Monte-Carlo based nancial simulations Mike Giles, Some (strong) opinions on HPC and the use of GPUs