GPU Computing
in Python
Glib Ivashkevych
junior researcher, NSC KIPT
Slide 2
Slide 2 text
Parallel
revolution
The Free Lunch Is Over: A Fundamental Turn Toward
Concurrency in Software
Herb Sutter, March 2005
Slide 3
Slide 3 text
When serial code hits the wall.
Power wall.
Now, Intel is embarked on a course already adopted by some of its major
rivals: obtaining more computing power by stamping multiple processors
on a single chip rather than straining to increase the speed of a single
processor.
Paul S. Otellini, Intel's CEO
May 2004
Slide 4
Slide 4 text
July 2006
Feb 2007
Nov 2008
Intel launches Core 2 Duo (Conroe)
Nvidia releases CUDA SDK
Tsubame, first GPU accelerated
supercomputer
Dec 2008 OpenCL 1.0 specification released
Today 50 GPU powered supercomputers in
Top500
Slide 5
Slide 5 text
It's very clear, that we are close to the tipping point. If we're not at a
tipping point, we're racing at it.
Jen-Hsun Huang, NVIDIA Co-founder and CEO
March 2013
Heterogeneous computing
becomes a standard in HPC
and programming has changed
Slide 6
Slide 6 text
Heterogeneous
computing
CPU
main memory
GPU
cores
GPU
memory
multiprocessors
Host Device
Slide 7
Slide 7 text
CPU GPU
general purpose
sophisticated design
and scheduling
perfect for
task parallelism
highly parallel
huge memory bandwidth
lightweight scheduling
perfect for
data parallelism
Slide 8
Slide 8 text
Anatomy of GPU:
multiprocessors
GPU
MP
shared
memory
GPU is composed of
tens of
multiprocessors
(streaming processors), which are
composed of
tens of cores
= hundreds of cores
Slide 9
Slide 9 text
Compute
Unified
Device
Architecture
is a
hierarchy of
computation
memory
synchronization
Python
fast development
huge # of packages: for data analysis, linear
algebra, special functions etc
metaprogramming
Convenient, but not that fast
in number crunching
Slide 13
Slide 13 text
PyCUDA
Wrapper package around CUDA API
Convenient abstractions: GPUArray, random numbers
generation, reductions & scans etc
Automatic cleanup, initialization and error checking,
kernels caching
Completeness
Slide 14
Slide 14 text
GPUArray
NumPy-like interface for GPU arrays
Convenient creation and manipulation routines
Elementwise operations
Cleanup
Slide 15
Slide 15 text
SourceModule
Abstraction to create, compile and run GPU
code
GPU code to compile is passed as a string
Control over nvcc compiler options
Convenient interface to get kernels
Slide 16
Slide 16 text
Metaprogramming
GPU code can be created at runtime
PyCUDA uses mako template engine internally
Any template engine is ok to create GPU source code.
Remember about codepy
Create more flexible and optimized code
Slide 17
Slide 17 text
Installation
numpy, mako, CUDA driver & toolkit are
required
Boost.Python is optional
Dev packages: if you build from source
Also:
PyOpenCl, pyfft
Slide 18
Slide 18 text
GPU computing resources
Documentation
Intro to Parallel Programming
by David Luebke (Nvidia) and John Owens (UC Davis)
Heterogeneous Parallel Programming
by Wen-mei W. Hwu (UIUC)
Several excellent books