Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Comparison of Parallel Programming APIs

Comparison of Parallel Programming APIs

Slides for the defense of my BSc Thesis "Comparison of Parallel Programming APIs" on the Faculty of Information Technology (BUT). They present the work carried developing a video stabilization algorithm based on Gray-Code Bit Plane Matching and porting it to different parallel programming frameworks such as OpenMP, CUDA and OpenCL to compare the speedup results obtained.

Samuel Alfageme

August 30, 2016
Tweet

More Decks by Samuel Alfageme

Other Decks in Research

Transcript

  1. COMPARISON OF PARALLEL PROGRAMMING APIS SAMUEL ALFAGEME SAINZ (UNIVERSIDAD DE

    VALLADOLID, SPAIN) BRNO UNIVERSITY OF TECHNOLOGY FACULTY OF INFORMATION TECHNOLOGY DEPARTMENT OF INTELLIGENT SYSTEMS
  2. Motivation • Nowadays, many parallel frameworks are available, popular and

    mature enough to be considered for industry. • Incipient need to exploit parallelism: hardware is already there – but the software-side depends in developers ability and knowledge. • Need for time- and power-saving algorithms for consumer applications like video cameras or smartphones.
  3. Practical Approach: Video Stabilization FAST DIGITAL IMAGE STABILIZER BASED ON

    GRAY-CODED BIT-PLANE MATCHING: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=793546 Why? • Based on many regular operations ☞ exploitable parallelism. • Video stabilization is a recurring and important problem in the market.
  4. Gray-Code Bit-Plane Matching algorithm 1. Video wrapper: 2. Algorithm output:

    Global Motion Vectors (GMV) • Median of the 4 Local Motion Vectors and the previous frame’s GMV • Measures the instability of each frame over the previous one GC-BPM VIDEO.mp4 Single frames (.png) 0 1 2 3 4 5 6 7 Regular bit-planes Remove color information 0 1 2 3 4 5 6 7 Gray-Code bit-planes Relevant ,
  5. Meaning of the Global Motion Vector 60px 60px 55px 51px

    FRAMES EXTRACTED FROM MICHAEL MESSER VINE https://vine.co/v/iVWbAjB1XmH Reference Frame (n) Compared Frame (n+4) GC-BPM algorithm yields a GMV = (-5,-9) Displacing the stabilized window accordingly
  6. Implementation • Two versions on the sequential algorithm • v1.

    Using OpenCV functions • v2. Own helper functions • Parallel solutions with their particular requirements (e.g. C99 compliance) • Modular approach to the full application • Single entry point for all implementations to help running the experiments: $./sequential /path/to/video.mp4 <API-code> • Makefile that eases compiling the application • Toolbox: • timecollector.py as a code-description of the experiments • Gnuplot scripts to generate the speedup graphics • Test code & examples for the helper functions
  7. Parallelization SHARED MEMORY CURRENT FRAME SEARCH WINDOW PREVIOUS FRAME SUBIMAGE

    CORRELATION MATRIX C PROCESSING ELEMENTS 3.COUNT THE BLACK PIXELS 4.FILL C (2p,2p) LAST ITERATION 1.CALCULATE THE SW. 2. 3.COUNT THE BLACK PIXELS 4.FILL C (0,0) FIRST ITERATION 1.CALCULATE THE SW. 2. (...) PE #0 PE #3 3 options considered: 1. Parallelize the 4 local motion vector calculation 2. Distribute the filled in of the Correlation Matrix 3. Calculate the GMV for all frames in parallel
  8. Problems found • Removing OpenCV dependencies on the core was

    crucial to parallelize on GPU • Need to implement and test computer-vision helper functions (xor2img(), count1s(), …) • Sequential algorithm runs way slower • Combine different APIs in the same program can be extremely hard • No CUDA-capable machine was found to run all the experiments before the deadline • Running the algorithms in my laptop only could produce significant data for OpenMP (up to 4 threads)
  9. Results • Obtained last week in a CUDA server provided

    by GTI – UVa • Access provided thanks to Dr. Mario Martínez Zarzuela • Available in new project CD & for download in http://alumnos.inf.uva.es/samalfa/thesis/results.zip • Machine characterization (cuttlas.gti.tel.uva.es) • 2x Intel Xeon X5650 – 6 cores / 12 threads • NVIDIA GeForce GTX 970 – Maxwell Microarchitecture • Raw data generated by script (results.csv) • Speedup calculations & graphics
  10. Speedup experiments 15 16 17 18 19 20 21 22

    0 128 256 384 512 Speedup Threads per Block Speedup achieved Speedup with CUDA 10 11 12 13 14 15 16 17 18 0 128 256 384 512 Speedup Work Group Size Speedup achieved Speedup with OpenCL 0 2 4 6 8 10 12 0 2 4 6 8 10 12 14 16 Speedup Processors Speedup achieved Speedup with OpenMP • OpenMP: 1, 2, 4, 8 and 12 threads • CUDA: bi-dimensional blocks of 128x128, 256x256 and 512x512 threads • OpenCL: executed on GPU with same config. than CUDA
  11. 0.96910227 1.893293297 3.73415816 7.129033256 10.70272883 18.6549191 19.17890871 18.83332244 12.08180988 12.25493491

    12.02863412 0 5 10 15 20 25 1 2 3 4 5 Normalized6Speedup OpenMP CUDA OpenCL Speedup normalized to the number of experiments (1-5)
  12. Conclusions • We achieved the biggest speedup, in order, with

    this versions: CUDA >> OpenCL > OpenMP • The parallel approach does not take full advantage of GPU’s capabilities as the correlation matrix does not fill the whole grid for small video resolutions. • Nearly constant speedup for different block/work-group sizes • Consumer applications could benefit hugely from relatively simple parallel implementations
  13. Future work lines • Complete computer vision application to show

    the results of the video stabilization • Look into some algorithm that works with rotational and zooming video • Improve the performance: • Determine the cause of delay when removing OpenCV dependencies (see 4.2.1) • Aim for bigger video resolutions - split the frames into more search windows • Combine parallel options to increase performance on GPU
  14. Reviewer Questions • Which parallel toolkit is best for GC-BPM

    algorithm (in terms of the speed-up)? • CUDA has proved to be the fastest of the implementations by a factor of +6 (~4s) compared to OpenCL version. • Which parallel APIs/frameworks do you recommend for parallel programming, and why? • Definitely, not OpenCL • Online compilation model for kernels is painful to debug • Documentation is limited and underdeveloped • OpenMP is a really easy and reliable way to parallelize code on CPU • Almost the same speedup than OpenCL without many of its disadvantages • CUDA is here to stay and it is worth to know it • Has to become more architecture-independent • Really powerful and accessible way to GP-GPU