Comparison of Parallel Programming APIs

COMPARISON OF PARALLEL PROGRAMMING APIS SAMUEL ALFAGEME SAINZ (UNIVERSIDAD DE
VALLADOLID, SPAIN) BRNO UNIVERSITY OF TECHNOLOGY FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF INTELLIGENT SYSTEMS

Motivation • Nowadays, many parallel frameworks are available, popular and
mature enough to be considered for industry. • Incipient need to exploit parallelism: hardware is already there – but the software-side depends in developers ability and knowledge. • Need for time- and power-saving algorithms for consumer applications like video cameras or smartphones.

Practical Approach: Video Stabilization FAST DIGITAL IMAGE STABILIZER BASED ON
GRAY-CODED BIT-PLANE MATCHING: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=793546 Why? • Based on many regular operations ☞ exploitable parallelism. • Video stabilization is a recurring and important problem in the market.

Gray-Code Bit-Plane Matching algorithm 1. Video wrapper: 2. Algorithm output:
Global Motion Vectors (GMV) • Median of the 4 Local Motion Vectors and the previous frame’s GMV • Measures the instability of each frame over the previous one GC-BPM VIDEO.mp4 Single frames (.png) 0 1 2 3 4 5 6 7 Regular bit-planes Remove color information 0 1 2 3 4 5 6 7 Gray-Code bit-planes Relevant ,

Meaning of the Global Motion Vector 60px 60px 55px 51px
FRAMES EXTRACTED FROM MICHAEL MESSER VINE https://vine.co/v/iVWbAjB1XmH Reference Frame (n) Compared Frame (n+4) GC-BPM algorithm yields a GMV = (-5,-9) Displacing the stabilized window accordingly

Implementation • Two versions on the sequential algorithm • v1.
Using OpenCV functions • v2. Own helper functions • Parallel solutions with their particular requirements (e.g. C99 compliance) • Modular approach to the full application • Single entry point for all implementations to help running the experiments: $./sequential /path/to/video.mp4 <API-code> • Makefile that eases compiling the application • Toolbox: • timecollector.py as a code-description of the experiments • Gnuplot scripts to generate the speedup graphics • Test code & examples for the helper functions

Parallelization SHARED MEMORY CURRENT FRAME SEARCH WINDOW PREVIOUS FRAME SUBIMAGE
CORRELATION MATRIX C PROCESSING ELEMENTS 3.COUNT THE BLACK PIXELS 4.FILL C (2p,2p) LAST ITERATION 1.CALCULATE THE SW. 2. 3.COUNT THE BLACK PIXELS 4.FILL C (0,0) FIRST ITERATION 1.CALCULATE THE SW. 2. (...) PE #0 PE #3 3 options considered: 1. Parallelize the 4 local motion vector calculation 2. Distribute the filled in of the Correlation Matrix 3. Calculate the GMV for all frames in parallel

Problems found • Removing OpenCV dependencies on the core was
crucial to parallelize on GPU • Need to implement and test computer-vision helper functions (xor2img(), count1s(), …) • Sequential algorithm runs way slower • Combine different APIs in the same program can be extremely hard • No CUDA-capable machine was found to run all the experiments before the deadline • Running the algorithms in my laptop only could produce significant data for OpenMP (up to 4 threads)

Results • Obtained last week in a CUDA server provided
by GTI – UVa • Access provided thanks to Dr. Mario Martínez Zarzuela • Available in new project CD & for download in http://alumnos.inf.uva.es/samalfa/thesis/results.zip • Machine characterization (cuttlas.gti.tel.uva.es) • 2x Intel Xeon X5650 – 6 cores / 12 threads • NVIDIA GeForce GTX 970 – Maxwell Microarchitecture • Raw data generated by script (results.csv) • Speedup calculations & graphics

Speedup experiments 15 16 17 18 19 20 21 22
0 128 256 384 512 Speedup Threads per Block Speedup achieved Speedup with CUDA 10 11 12 13 14 15 16 17 18 0 128 256 384 512 Speedup Work Group Size Speedup achieved Speedup with OpenCL 0 2 4 6 8 10 12 0 2 4 6 8 10 12 14 16 Speedup Processors Speedup achieved Speedup with OpenMP • OpenMP: 1, 2, 4, 8 and 12 threads • CUDA: bi-dimensional blocks of 128x128, 256x256 and 512x512 threads • OpenCL: executed on GPU with same config. than CUDA

0.96910227 1.893293297 3.73415816 7.129033256 10.70272883 18.6549191 19.17890871 18.83332244 12.08180988 12.25493491
12.02863412 0 5 10 15 20 25 1 2 3 4 5 Normalized6Speedup OpenMP CUDA OpenCL Speedup normalized to the number of experiments (1-5)

Conclusions • We achieved the biggest speedup, in order, with
this versions: CUDA >> OpenCL > OpenMP • The parallel approach does not take full advantage of GPU’s capabilities as the correlation matrix does not fill the whole grid for small video resolutions. • Nearly constant speedup for different block/work-group sizes • Consumer applications could benefit hugely from relatively simple parallel implementations

Future work lines • Complete computer vision application to show
the results of the video stabilization • Look into some algorithm that works with rotational and zooming video • Improve the performance: • Determine the cause of delay when removing OpenCV dependencies (see 4.2.1) • Aim for bigger video resolutions - split the frames into more search windows • Combine parallel options to increase performance on GPU

Reviewer Questions • Which parallel toolkit is best for GC-BPM
algorithm (in terms of the speed-up)? • CUDA has proved to be the fastest of the implementations by a factor of +6 (~4s) compared to OpenCL version. • Which parallel APIs/frameworks do you recommend for parallel programming, and why? • Definitely, not OpenCL • Online compilation model for kernels is painful to debug • Documentation is limited and underdeveloped • OpenMP is a really easy and reliable way to parallelize code on CPU • Almost the same speedup than OpenCL without many of its disadvantages • CUDA is here to stay and it is worth to know it • Has to become more architecture-independent • Really powerful and accessible way to GP-GPU

Comparison of Parallel Programming APIs

Comparison of Parallel Programming APIs

Samuel Alfageme

More Decks by Samuel Alfageme

Other Decks in Research

Featured

Transcript

COMPARISON OF PARALLEL PROGRAMMING APIS SAMUEL ALFAGEME SAINZ (UNIVERSIDAD DE

Motivation • Nowadays, many parallel frameworks are available, popular and

Practical Approach: Video Stabilization FAST DIGITAL IMAGE STABILIZER BASED ON

Gray-Code Bit-Plane Matching algorithm 1. Video wrapper: 2. Algorithm output:

Meaning of the Global Motion Vector 60px 60px 55px 51px

Implementation • Two versions on the sequential algorithm • v1.

Parallelization SHARED MEMORY CURRENT FRAME SEARCH WINDOW PREVIOUS FRAME SUBIMAGE

Problems found • Removing OpenCV dependencies on the core was

Results • Obtained last week in a CUDA server provided

Speedup experiments 15 16 17 18 19 20 21 22

0.96910227 1.893293297 3.73415816 7.129033256 10.70272883 18.6549191 19.17890871 18.83332244 12.08180988 12.25493491

Conclusions • We achieved the biggest speedup, in order, with

Future work lines • Complete computer vision application to show

Reviewer Questions • Which parallel toolkit is best for GC-BPM