VexCL implementation, Meeting C++ 2014

VexCL OpenCL/CUDA source code generation from C++ expressions Denis Demidov
Supercomputer Center of Russian Academy of Sciences Meeting C++, December 2014

Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA
hardware More mature, many libraries Kernels are written in C++ OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99

Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA
hardware More mature, many libraries Kernels are written in C++ Kernels are compiled to PTX together with host program OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99 Kernels are compiled at runtime, adding an initialization overhead The latter distinction is usually considered to be an OpenCL drawback. But it also allows us to generate more eﬃcient kernels at runtime! VexCL takes care of this part.

VexCL — a vector expression template library for OpenCL/CUDA Created
for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only The source code is publicly available under MIT license: https://github.com/ddemidov/vexcl Fork m e on G itH ub

VexCL interface

Hello VexCL: vector sum 1 #include <iostream> 2 #include <vector>
3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector<double> a(n, 1), b(n, 2), c(n); 14 vex :: vector<double> A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }

Initialization Multi-device and multi-platform computations are supported. VexCL context is
initialized from combination of device ﬁlters. Device ﬁlter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( vex :: Filter :: Any );

initialized from combination of device ﬁlters. Device ﬁlter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( vex :: Filter :: Type(CL DEVICE TYPE GPU) );

initialized from combination of device ﬁlters. Device ﬁlter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( 2 vex :: Filter :: Type(CL DEVICE TYPE ACCELERATOR) && 3 vex :: Filter :: Platform(” Intel ”) 4 );

initialized from combination of device ﬁlters. Device ﬁlter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( 2 vex :: Filter :: DoublePrecision && 3 [ ](const cl :: Device &d) { 4 return d. getInfo<CL DEVICE GLOBAL MEM SIZE>() >= 16 GB; 5 });

Memory and work splitting 1 vex :: Context ctx( vex
:: Filter :: Name(”Tesla”) ); 2 3 vex :: vector<double> x(ctx, N); 4 vex :: vector<double> y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y

:: Filter :: Type(CL DEVICE TYPE GPU) ); 2 3 vex :: vector<double> x(ctx, N); 4 vex :: vector<double> y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y

:: Filter :: DoublePrecision ); 2 3 vex :: vector<double> x(ctx, N); 4 vex :: vector<double> y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y

What vector expressions are supported? All vectors in an expression
have to be compatible: Have same size Located on same devices What may be used: Vectors, scalars, constants Arithmetic, logical operators Built-in OpenCL functions User-deﬁned functions Random number generators Slicing and permutations Reduce to a scalar (sum, min, max) Reduce across chosen dimensions Stencil operations Sparse matrix – vector products Fast Fourier Transform Sort, scan, reduce by key

A more involved example: Monte Carlo π Compute approximate value
of π: area of circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex :: Random<cl double2> rnd; 2 vex :: Reductor<size t, vex :: SUM> sum(ctx); 3 4 double pi = 4.0 ∗ sum( length( rnd(vex :: element index(0, n), seed) ) < 1 ) / n;

OpenCL/CUDA code is generated at runtime This expression: 1 x
= 2 ∗ y − sin(z); when compiled with: g++ −DVEXCL BACKEND OPENCL −lOpenCL ... generates, compiles, and launches the following kernel at runtime: 1 kernel void vexcl vector kernel 2 ( 3 ulong n, 4 global double ∗ prm 1, // x 5 int prm 2, // 2 6 global double ∗ prm 3, // y 7 global double ∗ prm 4 // z 8 ) 9 { 10 for(ulong idx = get global id(0); idx < n; idx += get global size(0)) 11 { 12 prm 1[idx] = ( ( prm 2 ∗ prm 3[idx] ) − sin( prm 4[idx] ) ); 13 } 14 }

OpenCL/CUDA code is generated at runtime This expression: 1 x
= 2 ∗ y − sin(z); when compiled with: g++ −DVEXCL BACKEND CUDA −lcuda ... generates, compiles, and launches the following kernel at runtime: 1 extern ”C” global void vexcl vector kernel ( 2 ulong n, 3 double ∗ prm 1, // x 4 int prm 2, // 2 5 double ∗ prm 3, // y 6 double ∗ prm 4 // z 7 ) 8 { 9 for(ulong idx = blockDim.x ∗ blockIdx.x + threadIdx.x, grid size = blockDim.x ∗ gridDim.x; 10 idx < n; idx += grid size) 11 { 12 prm 1[idx] = ( ( prm 2 ∗ prm 3[idx] ) − sin( prm 4[idx] ) ); 13 } 14 }

How does it work?

Expression templates

Expression templates How to implement a DSL in C++ eﬀectively?
The idea is quite old: 1995: Todd Veldhuizen, Expression templates, C++ Report. 1996: Blitz++ is a C++ class library for scientiﬁc computing which provides performance on par with Fortran 77/90. Today: Armadillo, Blaze, Boost.uBLAS, Eigen, MTL, etc. But how does it work?

Simple example: Vector arithmetic We want to be able to
write: 1 x = y + z; and it has to be as eﬀective as: 1 for( size t i = 0; i < n; ++i) 2 x[ i ] = y[i ] + z[i ];

C++ allows us to overload operators! 1 vector operator+(const vector
&a, const vector &b) { 2 vector tmp( a. size () ); 3 for( size t i = 0; i < a.size (); ++i) 4 tmp[i] = a[i] + b[i ]; 5 return tmp; 6 } Any problems?

C++ allows us to overload operators! 1 vector operator+(const vector
&a, const vector &b) { 2 vector tmp( a. size () ); 3 for( size t i = 0; i < a.size (); ++i) 4 tmp[i] = a[i] + b[i ]; 5 return tmp; 6 } Any problems? Extra memory allocation Extra memory I/O 1 a = x + y + z; 2 temporary vectors 8 × n memory reads/writes 1 for( size t i = 0; i < n; ++i) 2 a[ i ] = x[i ] + y[i ] + z[i ]; no temporaries 4 × n memory reads/writes

Lazy evaluation v0.1 The idea: postpone the actual evaluation until
assignment.

assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 };

assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 }

assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 } 9 10 const vector& vector :: operator=(const vsum &s) { 11 for( size t i = 0; i < n; ++i) 12 data[ i ] = s.lhs [ i ] + s.rhs[ i ]; 13 return ∗this ; 14 }

Not general enough What will happen here? 1 a =
x + y + z;

Not general enough What will happen here? 1 a =
x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression (’vsum’ and ’vector’) a = x + y + z; ~~~~~ ^ ~ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from ’vsum’ to ’const vector’ for 1st argument vsum operator+(const vector &a, const vector &b) { ^ 1 error generated.

Lazy evaluation v0.2 1 template <class LHS, class RHS> 2
struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 };

struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 }; 8 9 template <class LHS, class RHS> 10 vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 11 return vsum<LHS, RHS>{a, b}; 12 }

struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 }; 8 9 template <class LHS, class RHS> 10 vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 11 return vsum<LHS, RHS>{a, b}; 12 } 13 14 template<class Expr> 15 const vector& vector :: operator=(const Expr &expr) { 16 for(int i = 0; i < n; ++i) data[i] = expr[i ]; 17 return ∗this ; 18 }

Still not general enough There are times in life when
addition alone is not enough

Lazy evaluation v0.3 1 struct plus { 2 static double
apply(double a, double b) { return a + b; } 3 };

apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[ ]( size t i) const { return OP::apply(lhs [ i ], rhs[ i ]); } 11 };

apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[ ]( size t i) const { return OP::apply(lhs [ i ], rhs[ i ]); } 11 }; 12 13 template <class LHS, class RHS> 14 binary op<LHS, plus, RHS> operator+(const LHS &a, const RHS &b) { 15 return binary op<LHS, plus, RHS>{a, b}; 16 }

Expression templates are trees The expression in the RHS of:
1 a = x + y; ... is of type: binary op< vector , plus , vector > + x y

1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y

1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ]

1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ]

1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y #pragma omp parallel for for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y #pragma omp parallel for for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ] The C++ compiler walks the tree. binary op :: operator[ ] does the actual work.

So far, so good It is now possible to: 1
v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as eﬀective as: 1 for( size t i = 0; i < n; ++i) 2 v[ i ] = a[i] ∗ x[ i ] + b[i] ∗ y[ i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.

So far, so good It is now possible to: 1
v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as eﬀective as: 1 for( size t i = 0; i < n; ++i) 2 v[ i ] = a[i] ∗ x[ i ] + b[i] ∗ y[ i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything. But how is that related to OpenCL code generation?

OpenCL code generation

How does OpenCL work? 1 A compute kernel is compiled
at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.

How does OpenCL work? 1 A compute kernel is compiled
at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a ﬁle, or stored in a static string, or generated.

Generating kernel source from an expression We want this expression:
1 a = x + y − z; . . . to result in this kernel: 1 kernel void vexcl vector kernel ( 2 ulong n, 3 global double ∗ res , 4 global double ∗ prm1, 5 global double ∗ prm2, 6 global double ∗ prm3 7 ) 8 { 9 for( size t idx = get global id(0); idx < n; idx += get global size(0)) { 10 res [ idx ] = ( ( prm1[idx] + prm2[idx] ) − prm3[idx] ); 11 } 12 } − z + x y

Declaring parameters Each terminal knows what parameters it needs: 1
/∗static∗/ void vector :: declare params(std :: ostream &src, unsigned &pos) { 2 src << ”,\n global double ∗ prm” << ++pos; 3 } An expression just asks its terminals to do the work: 4 template <class LHS, class OP, class RHS> 5 /∗static∗/ void binary op<LHS, OP, RHS>::declare params( 6 std :: ostream &src, unsigned &pos) 7 { 8 LHS::declare params(src , pos); 9 RHS::declare params(src, pos); 10 }

Building string representation for expression 1 struct plus { 2
static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 };

static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 }; 4 5 /∗static∗/ void vector :: to string (std :: ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 }

static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 }; 4 5 /∗static∗/ void vector :: to string (std :: ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 } 8 9 template <class LHS, class OP, class RHS> 10 /∗static∗/ void binary op<LHS, OP, RHS>::to string(std::ostream &src, unsigned &pos) { 11 src << ”( ”; 12 LHS:: to string (src , pos); 13 OP:: to string (src ); 14 RHS::to string (src , pos); 15 src << ” )”; 16 }

Generating kernel source 1 template <class LHS, class RHS> 2
std :: string kernel source () { 3 std :: ostringstream src ; 4 5 src << ”kernel void vexcl vector kernel (\n ulong n”; 6 unsigned pos = 0; 7 LHS::declare params(src , pos); 8 RHS::declare params(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size (0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS:: to string (src , pos); src << ” = ”; 14 RHS::to string (src , pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src . str (); 18 }

Setting kernel arguments 1 void vector :: set args (cl
:: Kernel &krn, unsigned &pos) { 2 krn.setArg(pos++, buﬀer); 3 } 4 5 template <class LHS, class OP, class RHS> 6 void binary op<LHS, OP, RHS>::set args(cl::Kernel &krn, unsigned &pos) { 7 lhs . set args (krn, pos); 8 rhs. set args (krn, pos); 9 } The methods are not static anymore!

Gluing it all together 1 template <class Expr> 2 const
vector& vector :: operator=(const Expr &expr) { 3 static cl :: Kernel krn = build kernel (device , kernel source <This, Expr>()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr. set args (krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer . size (), cl :: NullRange); 12 13 return ∗this ; 14 } Kernel is generated and compiled once, applied many times: Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel.

What you saw is not what you get The actual
implementation is a bit more complicated: There are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-deﬁned functions. . . . Boost.Proto is used to keep the code mess contained.

Boost.Proto Boost.Proto is a framework for building Embedded Domain-Speciﬁc Languages
in C++. It provides tools for constructing, type-checking, transforming and executing expression templates. Hello Proto 1 #include <iostream> 2 #include <boost/proto/proto.hpp> 3 namespace proto = boost::proto; 4 5 int main() { 6 auto expr = proto:: as expr(6) ∗ 7 − 2; 7 proto :: display expr ( expr ); 8 9 proto :: default context ctx; 10 std :: cout << proto::eval( expr, ctx ) << std::endl; 11 } Output: minus( multiplies( terminal(6) , terminal(7) ) , terminal(2) ) 40

Creating custom Boost.Proto context 1. Deﬁne custom Proto context: 1
struct print expression context { 2 template <typename Expr, typename Tag = typename Expr::proto tag> struct eval {}; 3 4 template <typename Expr> 5 struct eval<Expr, proto::tag :: plus> { 6 typedef void result type ; 7 void operator()(const Expr &expr, print expression context &ctx) const { 8 cout << ”( ”; 9 proto :: eval( proto :: left (expr ), ctx ); 10 cout << ” + ”; 11 proto :: eval( proto :: right (expr ), ctx ); 12 cout << ” )”; 13 } 14 }; 15 16 template <typename Expr> 17 struct eval<Expr, proto::tag :: terminal> { 18 typedef void result type ; 19 void operator()(const Expr &expr, print expression context &ctx) const { 20 cout << proto::eval( expr, proto :: default context () ); 21 } 22 }; 23 };

Creating custom Boost.Proto context 2. Use the context: 1 auto
expr = proto:: as expr(6) ∗ 7 − 2; 2 3 print expression context ctx; 4 proto :: eval( expr, ctx ); Output: ( ( 6 * 7 ) + 5 )

Generating code with Boost.Proto 1 template <class Term> struct partial
vector expr { 2 static void get(std :: ostream &os, unsigned &pos) { os << ”prm ” << ++pos; } 3 }; 4 5 struct print expression context { 6 // ... 7 unsigned pos = 0; 8 9 template <typename Expr> 10 struct eval<Expr, proto::tag :: terminal> { 11 typedef void result type ; 12 void operator()(const Expr &expr, print expression context &ctx) const { 13 partial vector expr <typename proto::result of :: value<Expr>::type>::get(cout, ctx.pos); 14 } 15 }; 16 }; 1 proto :: eval( proto :: as expr(6) ∗ 7 − 2, ctx ); ( ( prm_1 * prm_2 ) + prm_3 )

Almost there. . . The code is very close to
the one actually used in VexCL We can leave deﬁnition of print expression context intact. Only partial vector expr needs to be specialized for each new terminal type. We will need several contexts similar to print expression context : Declaring kernel parameters, Setting kernel arguments, . . .

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double,
sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.

Example: element index <vexcl/element index.hpp> 1 x = sin( 2
∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr<vector domain>( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }

Example: element index <vexcl/element index.hpp> 1 x = sin( 2
∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline boost:: proto :: result of :: as expr<elem index, vector domain>::type 7 element index( size t offset = 0) { 8 return boost:: proto :: as expr<vector domain>( elem index{offset} ); 9 } 10 11 namespace traits { 12 template <> struct is vector expr terminal < elem index > : std :: true type {}; 13 } 14 }

Parameter declaration 1 template <> 2 struct kernel param declaration
< elem index > 3 { 4 static std :: string get( 5 const elem index&, const cl :: Device&, const std :: string &prm name, 6 detail :: kernel generator state ptr 7 ) 8 { 9 std :: ostringstream s; 10 s << ”,\n\t” << type name<size t>() << ” ” << prm name; 11 return s. str (); 12 } 13 };

Contribution to expression string 1 template <> 2 struct partial
vector expr < elem index > 3 { 4 static std :: string get( 5 const elem index&, const cl :: Device&, const std :: string &prm name, 6 detail :: kernel generator state ptr 7 ) 8 { 9 std :: ostringstream s; 10 s << ”(” << prm name << ” + idx)”; 11 return s. str (); 12 } 13 };

Setting kernel arguments 1 template <> 2 struct kernel arg
setter < elem index > 3 { 4 static void set( 5 const elem index &term, cl :: Kernel &kernel, unsigned /∗device∗/, 6 size t index offset , unsigned &position, 7 detail :: kernel generator state ptr 8 ) 9 { 10 kernel.setArg( position ++, term.offset + index offset ); 11 } 12 };

And we are done! vex :: element index() is now
fully deﬁned: We may use it in vector expressions of arbitrary complexity. Also, we have an expression engine (a set of proto contexts): Adding new terminal types is straightforward.

Other uses for expression templates in VexCL

Device ﬁlters 1 vex :: Context ctx( 2 vex ::
Filter :: Env && vex::Filter :: DoublePrecision 3 ); Mini-DSL inside VexCL Terminals are functors Only logical operators are provided

Symbolic variables An instance of vex :: symbolic<> dumps any
arithmetic operations to a std :: ostream: 1 vex :: generator :: set recorder ( std :: cout ); 2 vex :: symbolic<double> x, y = 6; 3 x = sin(y ∗ 7); double var1; double var2 = 6; var1 = sin( var2 * 7 ); We may feed symbolic variables to a generic algorithm and obtain its string representation. Generate VexCL function: 1 auto sqr = vex::generator :: make function<double(double, double)>( 2 [ ](const auto &x, const auto &y){ return x ∗ x + y ∗ y; } ); 3 x = sqr( sin(y), cos(z) ); Generate eﬀective compute kernel from a generic algorithm.

Conclusion Advantages of code generation: You get the exact kernel
you need. Flexibility of generic C++ code with C99 backend. There is no need to use vendor-speciﬁc solutions. Performance! 0 100 200 300 Compute time (sec) cuBLAS Thrust VexCL VexCL (symbolic) Integrate large number of ODEs on a GPU with Boost.odeint Fork m e on G itH ub

Benchmarking performance

Parameter study for the Lorenz attractor system Lorenz attractor system
˙ x = −σ (x − y) , ˙ y = Rx − y − xz, ˙ z = −bz + xy. Solve large number of Lorenz systems, each for a diﬀerent value of R. Use Boost.odeint. 20 10 0 10 20 20 0 20 0 10 20 30 40 50 Lorenz attractor trajectory

CUBLAS implementation CUBLAS is a highly optimized BLAS implementation from
NVIDIA. Its disadvantage is that it has ﬁxed number of kernels/functions. Hence, linear combinations (used internally by odeint): x0 = α1x1 + α2x2 + · · · + αnxn are implemented as: cublasDset (...); // x0 = 0 cublasDaxpy (...); // x0 = x0 + α1 ∗ x1 ... cublasDaxpy (...); // x0 = x0 + αn ∗ xn

Thrust implementation It is possible to fuse linear combination kernels
with Thrust: Thrust 1 struct scale sum2 { 2 const double a1, a2; 3 scale sum2(double a1, double a2) : a1(a1), a2(a2) { } 4 template<class Tuple> 5 host device void operator()(Tuple t) const { 6 thrust :: get<0>(t) = a1 ∗ thrust::get<1>(t) + a2 ∗ thrust::get<2>(t); 7 } 8 }; 9 10 thrust :: for each ( 11 thrust :: make zip iterator ( 12 thrust :: make tuple( x0.begin (), x1.begin (), x2.begin() ) 13 ), 14 thrust :: make zip iterator ( 15 thrust :: make tuple( x0.end(), x1.end(), x2.end() ) 16 ), 17 scale sum2(a1, a2) 18 );

Thrust implementation It is possible to fuse linear combination kernels
with Thrust: Thrust 1 struct scale sum2 { 2 const double a1, a2; 3 scale sum2(double a1, double a2) : a1(a1), a2(a2) { } 4 template<class Tuple> 5 host device void operator()(Tuple t) const { 6 thrust :: get<0>(t) = a1 ∗ thrust::get<1>(t) + a2 ∗ thrust::get<2>(t); 7 } 8 }; 9 10 thrust :: for each ( 11 thrust :: make zip iterator ( 12 thrust :: make tuple( x0.begin (), x1.begin (), x2.begin() ) 13 ), 14 thrust :: make zip iterator ( 15 thrust :: make tuple( x0.end(), x1.end(), x2.end() ) 16 ), 17 scale sum2(a1, a2) 18 ); VexCL 1 x0 = a1 ∗ x1 + a2 ∗ x2;

VexCL symbolic implementation Pass VexCL symbolic variables to a Boost.odeint
algorithm. Make single integration step. Generate monolithic kernel that does one full time step. Intermediate variables are no longer stored in global memory.

Conclusion Advantages of code generation: You get the exact kernel
you need. Flexibility of generic C++ code with C99 backend. There is no need to use vendor-speciﬁc solutions. Performance! 0 100 200 300 Compute time (sec) cuBLAS Thrust VexCL VexCL (symbolic) Integrate large number of ODEs on a GPU with Boost.odeint Fork m e on G itH ub

VexCL implementation, Meeting C++ 2014

VexCL implementation, Meeting C++ 2014

More Decks by Denis Demidov

Other Decks in Programming

Featured

Transcript