Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel Programming with (Py)OpenCL for Fun an...

Pycon ZA
October 12, 2018

Parallel Programming with (Py)OpenCL for Fun and Profit by Gordon Inggs

Overview
It's never been easier to use all manner of interesting computing devices such as multicore CPUs, GPUs and FPGAs using OpenCL, an open heterogeneous computing standard, supported by major hardware vendors: Intel, NVIDIA, AMD, ARM, etc. And it's never been easier to use OpenCL via the excellent Python bindings,

Pycon ZA

October 12, 2018
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 1/74 Parallel Programming with (Py)OpenCL for

    Fun and Parallel Programming with (Py)OpenCL for Fun and Pro t Pro t Gordon Inggs Github: Gordonei Day Job(s): Science Data, City of Cape Town Application Acceleration Consultant, My Lounge/Back Garden* *weather permitting
  2. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 15/74 The Challenges of Heterogeneous Computing

    The Challenges of Heterogeneous Computing 1. The Orientation Problem - turning things on! 2. The IO Problem - moving data around! 3. The Conceptual Problem - what am I doing?!? Programming Fancy Devices Programming Fancy Devices
  3. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 19/74 In [2]: ocl_platforms = (platform.name

    for platform in pyopencl.get_platforms()) print("\n".join(ocl_platforms)) In [3]: nvidia_platform = [platform for platform in pyopencl.get_platforms() if platform.name == "NVIDIA CUDA"][0] Intel(R) OpenCL Portable Computing Language NVIDIA CUDA OpenCL Programming Abstractions OpenCL Programming Abstractions
  4. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 21/74 The code is for a

    simple vector sum, i.e. = + c⃗ a⃗ b ⃗ In [4]: nvidia_context = pyopencl.Context(devices=nvidia_devices) program_source = """ kernel void sum(global float *a, global float *b, global float *c){ int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; } """ nvidia_program_source = pyopencl.Program(nvidia_context, program_source) nvidia_program = nvidia_program_source.build() In [5]: program_kernel_names = nvidia_program.get_info(pyopencl.program_info.KERNEL_NAME S) print("Kernel Names: {}".format(program_kernel_names)) Kernel Names: sum
  5. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 24/74 In [6]: def run_ocl_kernel(queue, kernel,

    global_size, input_tuples, output_tuples, local_size = (32,)): # copying data onto the device for (array, buffer) in input_tuples: pyopencl.enqueue_copy(queue, src=array, dest=buffer) # running program on the device kernel arguments = [buffer for ( buffer) in input tuples] Trust, but verify: Trust, but verify:
  6. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 25/74 In [7]: def check_sum_results(a,b,c): c_ref

    = a + b err = numpy.abs(c - c_ref) if((err.sum() > 0.0).any()): print("result does not match") else: print("result matches!") In [8]: # Synthetic data setup N = int(2**20) a = numpy.random.rand(N).astype(numpy.float32) b = numpy.random.rand(N).astype(numpy.float32) c = numpy.empty_like(a) # Device Memory setup a_nvidia_buffer = pyopencl.Buffer(nvidia_context, flags=pyopencl.mem_flags.READ_ONLY, size=a.nbytes) b_nvidia_buffer = pyopencl.Buffer(nvidia_context,
  7. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 26/74 flags=pyopencl.mem_flags.READ_ONLY, size=b.nbytes) c_nvidia_buffer = pyopencl.Buffer(nvidia_context,

    flags=pyopencl.mem_flags.WRITE_ONLY, size=c.nbytes) In [9]: nvidia_queue = pyopencl.CommandQueue(nvidia_context) input_tuples = ((a, a_nvidia_buffer), (b, b_nvidia_buffer), ) output_tuples = ((c, c_nvidia_buffer),) run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, output_tupl es) In [10]: check_sum_results(a, b, c) In [11]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, out put_tuples) result matches! 2.65 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) How to Manipulate Memory How to Manipulate Memory
  8. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 31/74 In [12]: def create_input_memory(context, input_arrays):

    return [(array, pyopencl.Buffer(context, flags=pyopencl.mem_flags.READ_ONLY, size=array.nbytes)) for array in input_arrays] In [13]: def create_output_memory(context, output_arrays): return [(array, pyopencl.Buffer(context, flags=pyopencl.mem_flags.WRITE_ONLY, size=array.nbytes)) for array in output_arrays]
  9. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 32/74 In [15]: a = numpy.random.rand(N).astype(numpy.float32)

    b = numpy.random.rand(N).astype(numpy.float32) c = numpy.empty_like(a) # Device Memory setup input_tuples = create_input_memory(nvidia_context, (a,b,)) output_tuples = create_output_memory(nvidia_context, (c,),) run_ocl_kernel(nvidia_queue, nvidia_program.sum, (N,), input_tuples, output_tupl es) check_sum_results(a,b,c) result matches!
  10. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 33/74 Batching Batching kernel void sum_batched(global

    float *a, global float *b, global float *c){ int gid = get_global_id(0)*BATCH_SIZE; for(int i=0; i<BATCH_SIZE;++i) c[gid + i] = a[gid + i] + b[gid + i]; }
  11. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 34/74 In [18]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum_batched,

    (N//batch_siz e,), input_tuples, output_tuples) 3.62 ms ± 543 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Using local and private memory Using local and private memory
  12. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 36/74 kernel void sum_batched_private(global float *a,

    global float *b, global float *c){ int gid = get_global_id(0)*BATCH_SIZE; float a_tmp[BATCH_SIZE], b_tmp[BATCH_SIZE], c_tmp[BATCH_SIZE]; for(int i=0; i<BATCH_SIZE;++i){ a_tmp[i] = a[gid + i]; b_tmp[i] = b[gid + i]; } for(int i=0; i<BATCH_SIZE;++i) c_tmp[i] = a_tmp[i] + b_tmp[i]; for(int i=0; i<BATCH_SIZE;++i) c[gid + i] = c_tmp[i]; } How to Do Things in Parallel How to Do Things in Parallel In this section:
  13. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 41/74 In [22]: for device in

    intel_platform.get_devices(): #print out all of the device name properties, except the device type for property_name in sorted(name_properties.keys() - {"Device Type"}): property_string_args = (property_name,device.get_info(name_properties[pr operty_name])) print("{}: {}".format(*property_string_args)) #look up the device type print("Device Types: {}".format(device_types[device.get_info(name_properties ["Device Type"])])) #print out all of the processing properties for property_name in sorted(processing_properties.keys()):
  14. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 42/74 property_string_args = (property_name,device.get_info(processing_propert ies[property_name])) print("{}:

    {}".format(*property_string_args)) #print out all of the memory properties for property_name in sorted(memory_properties.keys()): property_string_args = (property_name,device.get_info(memory_properties[ property_name])) print("{}: {}".format(*property_string_args)) print("\n") Device Name: Intel(R) HD Graphics Device Platform: <pyopencl.Platform 'Intel(R) OpenCL' at 0x321d400> Device Types: GPU Available Compute Units: 23 Clockrate: 1100 Available Constant Memory: 4294959103 Available Global Memory: 13321633792 Available Local Memory: 65536 Device Name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Device Platform: <pyopencl.Platform 'Intel(R) OpenCL' at 0x321d400> Device Types: CPU Available Compute Units: 8 Clockrate: 2800 Available Constant Memory: 131072 Available Global Memory: 16662528000 Available Local Memory: 32768
  15. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 44/74 In [23]: !clinfo Number of

    platforms 3 Platform Name Intel(R) OpenCL Platform Vendor Intel(R) Corporation Platform Version OpenCL 2.0 Platform Profile FULL_PROFILE Platform Extensions cl_khr_3d_image_writes cl_kh r_byte_addressable_store cl_khr_depth_images cl_khr_fp64 cl_khr_global_int32_b ase_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_fro m_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics c l_khr_spir Platform Extensions function suffix INTEL Platform Name Portable Computing Language Platform Vendor The pocl project Platform Version OpenCL 1.2 pocl 1.1 None+Ass erts, LLVM 6.0.0, SPIR, SLEEF, DISTRO, POCL_DEBUG Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd Platform Extensions function suffix POCL Platform Name NVIDIA CUDA Platform Vendor NVIDIA Corporation Platform Version OpenCL 1.2 CUDA 9.1.84 Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_ato mics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_k hr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_k hr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query c
  16. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 45/74 l_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer Platform Extensions

    function suffix NV Platform Name Intel(R) OpenCL Number of devices 2 Device Name Intel(R) HD Graphics Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 Driver Version r5.0.63503 Device OpenCL C Version OpenCL C 2.0 Device Type GPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 23 Max clock frequency 1100MHz Device Partition (core) Max number of sub-devices 0 Supported partition types by <unknown> (0x558D0000000 0) Supported affinity domains 0x558D00000000 Max work item dimensions 3 Max work item sizes 256x256x256 Max work group size 256 Preferred work group size multiple 32 Sub-group sizes (Intel) 8, 16, 32 Preferred / native vector sizes char 16 / 16 short 8 / 8
  17. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 46/74 int 4 / 4 long

    1 / 1 half 8 / 8 (cl_khr_ fp16) float 1 / 1 double 1 / 1 (cl_khr_ fp64) Half-precision Floating-point support (cl_khr_fp16) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes
  18. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 47/74 Support is emulated in software

    No Address bits 64, Little-Endian Global memory size 13321633792 (12.41GiB) Error Correction support No Max memory allocation 4294959103 (4GiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 64 bytes Global 64 bytes Local 64 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 4294959103 (4GiB) Global Memory cache type Read/Write Global Memory cache size 524288 (512KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 268434943 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 4 bytes Pitch alignment for 2D image buffers 4 pixels Max 2D image size 16384x16384 pixels Max planar YUV image size 16384x16380 pixels Max 3D image size 16384x16384x2048 pixels
  19. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 48/74 Max number of read image

    args 128 Max number of write image args 128 Max number of read/write image args 128 Max number of pipe args 16 Max active pipe reservations 1 Max pipe packet size 1024 Local memory type Local Local memory size 65536 (64KiB) Max number of constant args 8 Max constant buffer size 4294959103 (4GiB) Max size of kernel argument 1024 Queue properties (on host) Out-of-order execution Yes Profiling Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 131072 (128KiB) Max size 67108864 (64MiB) Max queues on device 1 Max events on device 1024 Prefer user sync for interop Yes Profiling timer resolution 83ns Execution capabilities Run OpenCL kernels Yes Run native kernels No SPIR versions 1.2 printf() buffer size 4194304 (4MiB) Built-in kernels block_motion_estimate_intel; block_advanced_motion_estimate_check_intel;block_advanced_motion_estimate_bidi rectional_check_intel
  20. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 49/74 Motion Estimation accelerator version (Intel)

    2 Device-side AVC Motion Estimation version 1 Supports texture sampler use Yes Supports preemption No Device Extensions cl_intel_accelerator cl_inte l_advanced_motion_estimation cl_intel_device_side_avc_motion_estimation cl_int el_driver_diagnostics cl_intel_media_block_io cl_intel_motion_estimation cl_in tel_planar_yuv cl_intel_packed_yuv cl_intel_required_subgroup_size cl_intel_su bgroups cl_intel_subgroups_short cl_intel_va_api_media_sharing cl_khr_3d_image _writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp16 cl_khr_f p64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_k hr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local _int32_extended_atomics cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_khr_ spir cl_khr_subgroups Device Name Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Device Vendor Intel(R) Corporation Device Vendor ID 0x8086 Device Version OpenCL 2.0 (Build 475) Driver Version 1.2.0.475 Device OpenCL C Version OpenCL C 2.0 Device Type CPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 8 Max clock frequency 2800MHz Device Partition (core) Max number of sub-devices 8
  21. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 50/74 Supported partition types by counts,

    equally, by names (Intel) Max work item dimensions 3 Max work item sizes 8192x8192x8192 Max work group size 8192 Preferred work group size multiple 128 Preferred / native vector sizes char 1 / 32 short 1 / 16 int 1 / 8 long 1 / 4 half 0 / 0 (n/a) float 1 / 8 double 1 / 4 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero No Round to infinity No IEEE754-2008 fused multiply-add No Support is emulated in software No Correctly-rounded divide and sqrt operations No Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes
  22. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 51/74 IEEE754-2008 fused multiply-add Yes Support

    is emulated in software No Address bits 64, Little-Endian Global memory size 16662528000 (15.52GiB) Error Correction support No Max memory allocation 4165632000 (3.88GiB) Unified memory for Host and Device Yes Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes Fine-grained buffer sharing No Fine-grained system sharing No Atomics No Minimum alignment for any data type 128 bytes Alignment of base address 1024 bits (128 bytes) Preferred alignment for atomics SVM 64 bytes Global 64 bytes Local 0 bytes Max size for global variable 65536 (64KiB) Preferred total size of global vars 65536 (64KiB) Global Memory cache type Read/Write Global Memory cache size 262144 (256KiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 480 Max size for 1D images from buffer 260352000 pixels Max 1D or 2D image array size 2048 images Base address alignment for 2D image buffers 64 bytes Pitch alignment for 2D image buffers 64 pixels Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels
  23. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 52/74 Max number of read image

    args 480 Max number of write image args 480 Max number of read/write image args 480 Max number of pipe args 16 Max active pipe reservations 32767 Max pipe packet size 1024 Local memory type Global Local memory size 32768 (32KiB) Max number of constant args 480 Max constant buffer size 131072 (128KiB) Max size of kernel argument 3840 (3.75KiB) Queue properties (on host) Out-of-order execution Yes Profiling Yes Local thread execution (Intel) Yes Queue properties (on device) Out-of-order execution Yes Profiling Yes Preferred size 4294967295 (4GiB) Max size 4294967295 (4GiB) Max queues on device 4294967295 Max events on device 4294967295 Prefer user sync for interop No Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels Yes SPIR versions 1.2 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Extensions cl_khr_icd cl_khr_global_int
  24. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 53/74 32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_a tomics cl_khr_local_int32_extended_atomics

    cl_khr_byte_addressable_store cl_kh r_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spi r cl_khr_fp64 cl_khr_image2d_from_buffer Platform Name Portable Computing Language Number of devices 1 Device Name pthread-Intel(R) Core(TM) i7 -7700HQ CPU @ 2.80GHz Device Vendor GenuineIntel Device Vendor ID 0x8086 Device Version OpenCL 1.2 pocl HSTR: pthrea d-x86_64-pc-linux-gnu-skylake Driver Version 1.1 Device OpenCL C Version OpenCL C 1.2 pocl Device Type CPU Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 8 Max clock frequency 3800MHz Device Partition (core) Max number of sub-devices 8 Supported partition types equally, by counts Max work item dimensions 3 Max work item sizes 4096x4096x4096 Max work group size 4096 Preferred work group size multiple 8 Preferred / native vector sizes char 16 / 16
  25. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 54/74 short 16 / 16 int

    8 / 8 long 4 / 4 half 0 / 0 (n/a) float 8 / 8 double 4 / 4 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 14515044352 (13.52GiB) Error Correction support No Max memory allocation 4294967296 (4GiB) Unified memory for Host and Device Yes Minimum alignment for any data type 128 bytes
  26. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 55/74 Alignment of base address 1024

    bits (128 bytes) Global Memory cache type Read/Write Global Memory cache size 6291456 (6MiB) Global Memory cache line size 64 bytes Image support Yes Max number of samplers per kernel 16 Max size for 1D images from buffer 268435456 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x16384 pixels Max 3D image size 2048x2048x2048 pixels Max number of read image args 128 Max number of write image args 128 Local memory type Global Local memory size 4194304 (4MiB) Max number of constant args 8 Max constant buffer size 4194304 (4MiB) Max size of kernel argument 1024 Queue properties Out-of-order execution No Profiling Yes Prefer user sync for interop Yes Profiling timer resolution 1ns Execution capabilities Run OpenCL kernels Yes Run native kernels Yes SPIR versions 1.2 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Extensions cl_khr_byte_addressable_stor e cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr _local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_
  27. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 56/74 writes cl_khr_spir cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended

    _atomics cl_khr_fp64 Platform Name NVIDIA CUDA Number of devices 1 Device Name GeForce GTX 1050 Device Vendor NVIDIA Corporation Device Vendor ID 0x10de Device Version OpenCL 1.2 CUDA Driver Version 390.48 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Topology (NV) PCI-E, 01:00.0 Device Profile FULL_PROFILE Device Available Yes Compiler Available Yes Linker Available Yes Max compute units 5 Max clock frequency 1493MHz Compute Capability (NV) 6.1 Device Partition (core) Max number of sub-devices 1 Supported partition types None Max work item dimensions 3 Max work item sizes 1024x1024x64 Max work group size 1024 Preferred work group size multiple 32 Warp size (NV) 32 Preferred / native vector sizes char 1 / 1 short 1 / 1
  28. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 57/74 int 1 / 1 long

    1 / 1 half 0 / 0 (n/a) float 1 / 1 double 1 / 1 (cl_khr_ fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Address bits 64, Little-Endian Global memory size 4238737408 (3.948GiB) Error Correction support No Max memory allocation 1059684352 (1011MiB) Unified memory for Host and Device No Integrated memory (NV) No Minimum alignment for any data type 128 bytes
  29. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 58/74 Alignment of base address 4096

    bits (512 bytes) Global Memory cache type Read/Write Global Memory cache size 81920 (80KiB) Global Memory cache line size 128 bytes Image support Yes Max number of samplers per kernel 32 Max size for 1D images from buffer 134217728 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x32768 pixels Max 3D image size 16384x16384x16384 pixels Max number of read image args 256 Max number of write image args 16 Local memory type Local Local memory size 49152 (48KiB) Registers per block (NV) 65536 Max number of constant args 9 Max constant buffer size 65536 (64KiB) Max size of kernel argument 4352 (4.25KiB) Queue properties Out-of-order execution Yes Profiling Yes Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities Run OpenCL kernels Yes Run native kernels No Kernel execution timeout (NV) Yes Concurrent copy and kernel execution (NV) Yes Number of async copy engines 2 printf() buffer size 1048576 (1024KiB) Built-in kernels
  30. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 59/74 Device Extensions cl_khr_global_int32_base_ato mics cl_khr_global_int32_extended_atomics

    cl_khr_local_int32_base_atomics cl_k hr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_k hr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query c l_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] Success [INTEL] clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
  31. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 60/74 Workitems vs Workgroups Workitems vs

    Workgroups Exploring different types of parallelism, with our trusty vector sum. OpenCL Task vs Data Parallelism Abstractions OpenCL Task vs Data Parallelism Abstractions
  32. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 63/74 kernel void sum16_local(global float16 *a,

    global float16 *b, global float16 *c){ int wid = get_group_id(0)*WG_SIZE; int lid = get_local_id(0); local float16 a_local[WG_SIZE], b_local[WG_SIZE], c_local[WG_SIZE]; // Copying on event_t copyon[2]; In [30]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16_local, (N//16,), input _tuples, output_tuples) 1.97 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  33. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 64/74 But, before we think that

    we're too clever: But, before we think that we're too clever: What about good old numpy ? In [26]: %timeit (a+b) 668 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  34. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 65/74 Increasing the computational work Increasing

    the computational work Now we make the kernel: Make = c⃗ ( + ) a⃗ b ⃗ x kernel void sum16_pow(global float16 *a, global float16 *b, global float16 *c){ int gid = get_global_id(0); c[gid] = pow(a[gid] + b[gid], POW); } x = 1000 In [28]: %timeit run_ocl_kernel(nvidia_queue, nvidia_program.sum16_pow, (N//16,), input_t uples, output_tuples) 3.26 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  35. 10/13/2018 PyConZA_OpenCL_Talk slides file:///home/neil/Trees/ctpug/Pycon_organisers/PyConZA_2018_Slides/Reveal.js/PyConZA_Inggs_Slides/PyConZA_OpenCL_Talk.slides.html?print-pdf#/ 74/74 Gene Amdahl, Validity of the

    Single Processor Approach to Achieving Large-Scale Computing Capabilities Page and Luk, Compiling Occam into eld-programmable gate arrays Herb Sutter, The Free Lunch is Over Asanovic et al., The Landscape of Parallel Computing Research: A View from Berkeley Tsugio Makimoto, The Hot Decade of Field Programmable Technologies Lee et al., Debunking the 100X GPU vs CPU myth Che et al., Rodinia: A Benchmark Suite for Heterogeneous Computing Thomas et al., Hardware architectures for Monte-Carlo based nancial simulations Mike Giles, Some (strong) opinions on HPC and the use of GPUs