Slide 1

Slide 1 text

VexCL OpenCL/CUDA source code generation from C++ expressions Denis Demidov Supercomputer Center of Russian Academy of Sciences Meeting C++, December 2014

Slide 2

Slide 2 text

Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA hardware More mature, many libraries Kernels are written in C++ OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99

Slide 3

Slide 3 text

Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA hardware More mature, many libraries Kernels are written in C++ Kernels are compiled to PTX together with host program OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99 Kernels are compiled at runtime, adding an initialization overhead The latter distinction is usually considered to be an OpenCL drawback. But it also allows us to generate more efficient kernels at runtime! VexCL takes care of this part.

Slide 4

Slide 4 text

VexCL — a vector expression template library for OpenCL/CUDA Created for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only The source code is publicly available under MIT license: https://github.com/ddemidov/vexcl Fork m e on G itH ub

Slide 5

Slide 5 text

VexCL interface

Slide 6

Slide 6 text

Hello VexCL: vector sum 1 #include 2 #include 3 #include 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector a(n, 1), b(n, 2), c(n); 14 vex :: vector A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }

Slide 7

Slide 7 text

Hello VexCL: vector sum 1 #include 2 #include 3 #include 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector a(n, 1), b(n, 2), c(n); 14 vex :: vector A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }

Slide 8

Slide 8 text

Hello VexCL: vector sum 1 #include 2 #include 3 #include 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector a(n, 1), b(n, 2), c(n); 14 vex :: vector A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }

Slide 9

Slide 9 text

Hello VexCL: vector sum 1 #include 2 #include 3 #include 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector a(n, 1), b(n, 2), c(n); 14 vex :: vector A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }

Slide 10

Slide 10 text

Hello VexCL: vector sum 1 #include 2 #include 3 #include 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector a(n, 1), b(n, 2), c(n); 14 vex :: vector A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }

Slide 11

Slide 11 text

Hello VexCL: vector sum 1 #include 2 #include 3 #include 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector a(n, 1), b(n, 2), c(n); 14 vex :: vector A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }

Slide 12

Slide 12 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( vex :: Filter :: Any );

Slide 13

Slide 13 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( vex :: Filter :: Type(CL DEVICE TYPE GPU) );

Slide 14

Slide 14 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( 2 vex :: Filter :: Type(CL DEVICE TYPE ACCELERATOR) && 3 vex :: Filter :: Platform(” Intel ”) 4 );

Slide 15

Slide 15 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( 2 vex :: Filter :: DoublePrecision && 3 [ ](const cl :: Device &d) { 4 return d. getInfo() >= 16 GB; 5 });

Slide 16

Slide 16 text

Memory and work splitting 1 vex :: Context ctx( vex :: Filter :: Name(”Tesla”) ); 2 3 vex :: vector x(ctx, N); 4 vex :: vector y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y

Slide 17

Slide 17 text

Memory and work splitting 1 vex :: Context ctx( vex :: Filter :: Type(CL DEVICE TYPE GPU) ); 2 3 vex :: vector x(ctx, N); 4 vex :: vector y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y

Slide 18

Slide 18 text

Memory and work splitting 1 vex :: Context ctx( vex :: Filter :: DoublePrecision ); 2 3 vex :: vector x(ctx, N); 4 vex :: vector y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y

Slide 19

Slide 19 text

What vector expressions are supported? All vectors in an expression have to be compatible: Have same size Located on same devices What may be used: Vectors, scalars, constants Arithmetic, logical operators Built-in OpenCL functions User-defined functions Random number generators Slicing and permutations Reduce to a scalar (sum, min, max) Reduce across chosen dimensions Stencil operations Sparse matrix – vector products Fast Fourier Transform Sort, scan, reduce by key

Slide 20

Slide 20 text

A more involved example: Monte Carlo π Compute approximate value of π: area of circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex :: Random rnd; 2 vex :: Reductor sum(ctx); 3 4 double pi = 4.0 ∗ sum( length( rnd(vex :: element index(0, n), seed) ) < 1 ) / n;

Slide 21

Slide 21 text

OpenCL/CUDA code is generated at runtime This expression: 1 x = 2 ∗ y − sin(z); when compiled with: g++ −DVEXCL BACKEND OPENCL −lOpenCL ... generates, compiles, and launches the following kernel at runtime: 1 kernel void vexcl vector kernel 2 ( 3 ulong n, 4 global double ∗ prm 1, // x 5 int prm 2, // 2 6 global double ∗ prm 3, // y 7 global double ∗ prm 4 // z 8 ) 9 { 10 for(ulong idx = get global id(0); idx < n; idx += get global size(0)) 11 { 12 prm 1[idx] = ( ( prm 2 ∗ prm 3[idx] ) − sin( prm 4[idx] ) ); 13 } 14 }

Slide 22

Slide 22 text

OpenCL/CUDA code is generated at runtime This expression: 1 x = 2 ∗ y − sin(z); when compiled with: g++ −DVEXCL BACKEND CUDA −lcuda ... generates, compiles, and launches the following kernel at runtime: 1 extern ”C” global void vexcl vector kernel ( 2 ulong n, 3 double ∗ prm 1, // x 4 int prm 2, // 2 5 double ∗ prm 3, // y 6 double ∗ prm 4 // z 7 ) 8 { 9 for(ulong idx = blockDim.x ∗ blockIdx.x + threadIdx.x, grid size = blockDim.x ∗ gridDim.x; 10 idx < n; idx += grid size) 11 { 12 prm 1[idx] = ( ( prm 2 ∗ prm 3[idx] ) − sin( prm 4[idx] ) ); 13 } 14 }

Slide 23

Slide 23 text

How does it work?

Slide 24

Slide 24 text

Expression templates

Slide 25

Slide 25 text

Expression templates How to implement a DSL in C++ effectively? The idea is quite old: 1995: Todd Veldhuizen, Expression templates, C++ Report. 1996: Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. Today: Armadillo, Blaze, Boost.uBLAS, Eigen, MTL, etc. But how does it work?

Slide 26

Slide 26 text

Simple example: Vector arithmetic We want to be able to write: 1 x = y + z; and it has to be as effective as: 1 for( size t i = 0; i < n; ++i) 2 x[ i ] = y[i ] + z[i ];

Slide 27

Slide 27 text

C++ allows us to overload operators! 1 vector operator+(const vector &a, const vector &b) { 2 vector tmp( a. size () ); 3 for( size t i = 0; i < a.size (); ++i) 4 tmp[i] = a[i] + b[i ]; 5 return tmp; 6 } Any problems?

Slide 28

Slide 28 text

C++ allows us to overload operators! 1 vector operator+(const vector &a, const vector &b) { 2 vector tmp( a. size () ); 3 for( size t i = 0; i < a.size (); ++i) 4 tmp[i] = a[i] + b[i ]; 5 return tmp; 6 } Any problems? Extra memory allocation Extra memory I/O 1 a = x + y + z; 2 temporary vectors 8 × n memory reads/writes 1 for( size t i = 0; i < n; ++i) 2 a[ i ] = x[i ] + y[i ] + z[i ]; no temporaries 4 × n memory reads/writes

Slide 29

Slide 29 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment.

Slide 30

Slide 30 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 };

Slide 31

Slide 31 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 }

Slide 32

Slide 32 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 } 9 10 const vector& vector :: operator=(const vsum &s) { 11 for( size t i = 0; i < n; ++i) 12 data[ i ] = s.lhs [ i ] + s.rhs[ i ]; 13 return ∗this ; 14 }

Slide 33

Slide 33 text

Not general enough What will happen here? 1 a = x + y + z;

Slide 34

Slide 34 text

Not general enough What will happen here? 1 a = x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression (’vsum’ and ’vector’) a = x + y + z; ~~~~~ ^ ~ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from ’vsum’ to ’const vector’ for 1st argument vsum operator+(const vector &a, const vector &b) { ^ 1 error generated.

Slide 35

Slide 35 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 };

Slide 36

Slide 36 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 }; 8 9 template 10 vsum operator+(const LHS &a, const RHS &b) { 11 return vsum{a, b}; 12 }

Slide 37

Slide 37 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 }; 8 9 template 10 vsum operator+(const LHS &a, const RHS &b) { 11 return vsum{a, b}; 12 } 13 14 template 15 const vector& vector :: operator=(const Expr &expr) { 16 for(int i = 0; i < n; ++i) data[i] = expr[i ]; 17 return ∗this ; 18 }

Slide 38

Slide 38 text

Still not general enough There are times in life when addition alone is not enough

Slide 39

Slide 39 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 };

Slide 40

Slide 40 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 }; 4 5 template 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[ ]( size t i) const { return OP::apply(lhs [ i ], rhs[ i ]); } 11 };

Slide 41

Slide 41 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 }; 4 5 template 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[ ]( size t i) const { return OP::apply(lhs [ i ], rhs[ i ]); } 11 }; 12 13 template 14 binary op operator+(const LHS &a, const RHS &b) { 15 return binary op{a, b}; 16 }

Slide 42

Slide 42 text

Expression templates are trees The expression in the RHS of: 1 a = x + y; ... is of type: binary op< vector , plus , vector > + x y

Slide 43

Slide 43 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y

Slide 44

Slide 44 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ]

Slide 45

Slide 45 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ]

Slide 46

Slide 46 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

Slide 47

Slide 47 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y #pragma omp parallel for for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

Slide 48

Slide 48 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y #pragma omp parallel for for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ] The C++ compiler walks the tree. binary op :: operator[ ] does the actual work.

Slide 49

Slide 49 text

So far, so good It is now possible to: 1 v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for( size t i = 0; i < n; ++i) 2 v[ i ] = a[i] ∗ x[ i ] + b[i] ∗ y[ i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.

Slide 50

Slide 50 text

So far, so good It is now possible to: 1 v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for( size t i = 0; i < n; ++i) 2 v[ i ] = a[i] ∗ x[ i ] + b[i] ∗ y[ i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything. But how is that related to OpenCL code generation?

Slide 51

Slide 51 text

OpenCL code generation

Slide 52

Slide 52 text

How does OpenCL work? 1 A compute kernel is compiled at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.

Slide 53

Slide 53 text

How does OpenCL work? 1 A compute kernel is compiled at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a file, or stored in a static string, or generated.

Slide 54

Slide 54 text

Generating kernel source from an expression We want this expression: 1 a = x + y − z; . . . to result in this kernel: 1 kernel void vexcl vector kernel ( 2 ulong n, 3 global double ∗ res , 4 global double ∗ prm1, 5 global double ∗ prm2, 6 global double ∗ prm3 7 ) 8 { 9 for( size t idx = get global id(0); idx < n; idx += get global size(0)) { 10 res [ idx ] = ( ( prm1[idx] + prm2[idx] ) − prm3[idx] ); 11 } 12 } − z + x y

Slide 55

Slide 55 text

Declaring parameters Each terminal knows what parameters it needs: 1 /∗static∗/ void vector :: declare params(std :: ostream &src, unsigned &pos) { 2 src << ”,\n global double ∗ prm” << ++pos; 3 } An expression just asks its terminals to do the work: 4 template 5 /∗static∗/ void binary op::declare params( 6 std :: ostream &src, unsigned &pos) 7 { 8 LHS::declare params(src , pos); 9 RHS::declare params(src, pos); 10 }

Slide 56

Slide 56 text

Building string representation for expression 1 struct plus { 2 static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 };

Slide 57

Slide 57 text

Building string representation for expression 1 struct plus { 2 static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 }; 4 5 /∗static∗/ void vector :: to string (std :: ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 }

Slide 58

Slide 58 text

Building string representation for expression 1 struct plus { 2 static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 }; 4 5 /∗static∗/ void vector :: to string (std :: ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 } 8 9 template 10 /∗static∗/ void binary op::to string(std::ostream &src, unsigned &pos) { 11 src << ”( ”; 12 LHS:: to string (src , pos); 13 OP:: to string (src ); 14 RHS::to string (src , pos); 15 src << ” )”; 16 }

Slide 59

Slide 59 text

Generating kernel source 1 template 2 std :: string kernel source () { 3 std :: ostringstream src ; 4 5 src << ”kernel void vexcl vector kernel (\n ulong n”; 6 unsigned pos = 0; 7 LHS::declare params(src , pos); 8 RHS::declare params(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size (0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS:: to string (src , pos); src << ” = ”; 14 RHS::to string (src , pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src . str (); 18 }

Slide 60

Slide 60 text

Generating kernel source 1 template 2 std :: string kernel source () { 3 std :: ostringstream src ; 4 5 src << ”kernel void vexcl vector kernel (\n ulong n”; 6 unsigned pos = 0; 7 LHS::declare params(src , pos); 8 RHS::declare params(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size (0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS:: to string (src , pos); src << ” = ”; 14 RHS::to string (src , pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src . str (); 18 }

Slide 61

Slide 61 text

Generating kernel source 1 template 2 std :: string kernel source () { 3 std :: ostringstream src ; 4 5 src << ”kernel void vexcl vector kernel (\n ulong n”; 6 unsigned pos = 0; 7 LHS::declare params(src , pos); 8 RHS::declare params(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size (0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS:: to string (src , pos); src << ” = ”; 14 RHS::to string (src , pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src . str (); 18 }

Slide 62

Slide 62 text

Setting kernel arguments 1 void vector :: set args (cl :: Kernel &krn, unsigned &pos) { 2 krn.setArg(pos++, buffer); 3 } 4 5 template 6 void binary op::set args(cl::Kernel &krn, unsigned &pos) { 7 lhs . set args (krn, pos); 8 rhs. set args (krn, pos); 9 } The methods are not static anymore!

Slide 63

Slide 63 text

Gluing it all together 1 template 2 const vector& vector :: operator=(const Expr &expr) { 3 static cl :: Kernel krn = build kernel (device , kernel source ()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr. set args (krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer . size (), cl :: NullRange); 12 13 return ∗this ; 14 } Kernel is generated and compiled once, applied many times: Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel.

Slide 64

Slide 64 text

Gluing it all together 1 template 2 const vector& vector :: operator=(const Expr &expr) { 3 static cl :: Kernel krn = build kernel (device , kernel source ()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr. set args (krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer . size (), cl :: NullRange); 12 13 return ∗this ; 14 } Kernel is generated and compiled once, applied many times: Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel.

Slide 65

Slide 65 text

What you saw is not what you get The actual implementation is a bit more complicated: There are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. . . . Boost.Proto is used to keep the code mess contained.

Slide 66

Slide 66 text

Boost.Proto Boost.Proto is a framework for building Embedded Domain-Specific Languages in C++. It provides tools for constructing, type-checking, transforming and executing expression templates. Hello Proto 1 #include 2 #include 3 namespace proto = boost::proto; 4 5 int main() { 6 auto expr = proto:: as expr(6) ∗ 7 − 2; 7 proto :: display expr ( expr ); 8 9 proto :: default context ctx; 10 std :: cout << proto::eval( expr, ctx ) << std::endl; 11 } Output: minus( multiplies( terminal(6) , terminal(7) ) , terminal(2) ) 40

Slide 67

Slide 67 text

Boost.Proto Boost.Proto is a framework for building Embedded Domain-Specific Languages in C++. It provides tools for constructing, type-checking, transforming and executing expression templates. Hello Proto 1 #include 2 #include 3 namespace proto = boost::proto; 4 5 int main() { 6 auto expr = proto:: as expr(6) ∗ 7 − 2; 7 proto :: display expr ( expr ); 8 9 proto :: default context ctx; 10 std :: cout << proto::eval( expr, ctx ) << std::endl; 11 } Output: minus( multiplies( terminal(6) , terminal(7) ) , terminal(2) ) 40

Slide 68

Slide 68 text

Creating custom Boost.Proto context 1. Define custom Proto context: 1 struct print expression context { 2 template struct eval {}; 3 4 template 5 struct eval { 6 typedef void result type ; 7 void operator()(const Expr &expr, print expression context &ctx) const { 8 cout << ”( ”; 9 proto :: eval( proto :: left (expr ), ctx ); 10 cout << ” + ”; 11 proto :: eval( proto :: right (expr ), ctx ); 12 cout << ” )”; 13 } 14 }; 15 16 template 17 struct eval { 18 typedef void result type ; 19 void operator()(const Expr &expr, print expression context &ctx) const { 20 cout << proto::eval( expr, proto :: default context () ); 21 } 22 }; 23 };

Slide 69

Slide 69 text

Creating custom Boost.Proto context 2. Use the context: 1 auto expr = proto:: as expr(6) ∗ 7 − 2; 2 3 print expression context ctx; 4 proto :: eval( expr, ctx ); Output: ( ( 6 * 7 ) + 5 )

Slide 70

Slide 70 text

Generating code with Boost.Proto 1 template struct partial vector expr { 2 static void get(std :: ostream &os, unsigned &pos) { os << ”prm ” << ++pos; } 3 }; 4 5 struct print expression context { 6 // ... 7 unsigned pos = 0; 8 9 template 10 struct eval { 11 typedef void result type ; 12 void operator()(const Expr &expr, print expression context &ctx) const { 13 partial vector expr ::type>::get(cout, ctx.pos); 14 } 15 }; 16 }; 1 proto :: eval( proto :: as expr(6) ∗ 7 − 2, ctx ); ( ( prm_1 * prm_2 ) + prm_3 )

Slide 71

Slide 71 text

Generating code with Boost.Proto 1 template struct partial vector expr { 2 static void get(std :: ostream &os, unsigned &pos) { os << ”prm ” << ++pos; } 3 }; 4 5 struct print expression context { 6 // ... 7 unsigned pos = 0; 8 9 template 10 struct eval { 11 typedef void result type ; 12 void operator()(const Expr &expr, print expression context &ctx) const { 13 partial vector expr ::type>::get(cout, ctx.pos); 14 } 15 }; 16 }; 1 proto :: eval( proto :: as expr(6) ∗ 7 − 2, ctx ); ( ( prm_1 * prm_2 ) + prm_3 )

Slide 72

Slide 72 text

Generating code with Boost.Proto 1 template struct partial vector expr { 2 static void get(std :: ostream &os, unsigned &pos) { os << ”prm ” << ++pos; } 3 }; 4 5 struct print expression context { 6 // ... 7 unsigned pos = 0; 8 9 template 10 struct eval { 11 typedef void result type ; 12 void operator()(const Expr &expr, print expression context &ctx) const { 13 partial vector expr ::type>::get(cout, ctx.pos); 14 } 15 }; 16 }; 1 proto :: eval( proto :: as expr(6) ∗ 7 − 2, ctx ); ( ( prm_1 * prm_2 ) + prm_3 )

Slide 73

Slide 73 text

Almost there. . . The code is very close to the one actually used in VexCL We can leave definition of print expression context intact. Only partial vector expr needs to be specialized for each new terminal type. We will need several contexts similar to print expression context : Declaring kernel parameters, Setting kernel arguments, . . .

Slide 74

Slide 74 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double, sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.

Slide 75

Slide 75 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double, sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.

Slide 76

Slide 76 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double, sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.

Slide 77

Slide 77 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double, sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.

Slide 78

Slide 78 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double, sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.

Slide 79

Slide 79 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double, sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.

Slide 80

Slide 80 text

Example: element index 1 x = sin( 2 ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }

Slide 81

Slide 81 text

Example: element index 1 x = sin( 2 ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }

Slide 82

Slide 82 text

Example: element index 1 x = sin( 2 ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }

Slide 83

Slide 83 text

Example: element index 1 x = sin( 2 ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }

Slide 84

Slide 84 text

Example: element index 1 x = sin( 2 ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline boost:: proto :: result of :: as expr::type 7 element index( size t offset = 0) { 8 return boost:: proto :: as expr( elem index{offset} ); 9 } 10 11 namespace traits { 12 template <> struct is vector expr terminal < elem index > : std :: true type {}; 13 } 14 }

Slide 85

Slide 85 text

Parameter declaration 1 template <> 2 struct kernel param declaration < elem index > 3 { 4 static std :: string get( 5 const elem index&, const cl :: Device&, const std :: string &prm name, 6 detail :: kernel generator state ptr 7 ) 8 { 9 std :: ostringstream s; 10 s << ”,\n\t” << type name() << ” ” << prm name; 11 return s. str (); 12 } 13 };

Slide 86

Slide 86 text

Contribution to expression string 1 template <> 2 struct partial vector expr < elem index > 3 { 4 static std :: string get( 5 const elem index&, const cl :: Device&, const std :: string &prm name, 6 detail :: kernel generator state ptr 7 ) 8 { 9 std :: ostringstream s; 10 s << ”(” << prm name << ” + idx)”; 11 return s. str (); 12 } 13 };

Slide 87

Slide 87 text

Setting kernel arguments 1 template <> 2 struct kernel arg setter < elem index > 3 { 4 static void set( 5 const elem index &term, cl :: Kernel &kernel, unsigned /∗device∗/, 6 size t index offset , unsigned &position, 7 detail :: kernel generator state ptr 8 ) 9 { 10 kernel.setArg( position ++, term.offset + index offset ); 11 } 12 };

Slide 88

Slide 88 text

And we are done! vex :: element index() is now fully defined: We may use it in vector expressions of arbitrary complexity. Also, we have an expression engine (a set of proto contexts): Adding new terminal types is straightforward.

Slide 89

Slide 89 text

Other uses for expression templates in VexCL

Slide 90

Slide 90 text

Device filters 1 vex :: Context ctx( 2 vex :: Filter :: Env && vex::Filter :: DoublePrecision 3 ); Mini-DSL inside VexCL Terminals are functors Only logical operators are provided

Slide 91

Slide 91 text

Symbolic variables An instance of vex :: symbolic<> dumps any arithmetic operations to a std :: ostream: 1 vex :: generator :: set recorder ( std :: cout ); 2 vex :: symbolic x, y = 6; 3 x = sin(y ∗ 7); double var1; double var2 = 6; var1 = sin( var2 * 7 ); We may feed symbolic variables to a generic algorithm and obtain its string representation. Generate VexCL function: 1 auto sqr = vex::generator :: make function( 2 [ ](const auto &x, const auto &y){ return x ∗ x + y ∗ y; } ); 3 x = sqr( sin(y), cos(z) ); Generate effective compute kernel from a generic algorithm.

Slide 92

Slide 92 text

Conclusion Advantages of code generation: You get the exact kernel you need. Flexibility of generic C++ code with C99 backend. There is no need to use vendor-specific solutions. Performance! 0 100 200 300 Compute time (sec) cuBLAS Thrust VexCL VexCL (symbolic) Integrate large number of ODEs on a GPU with Boost.odeint Fork m e on G itH ub

Slide 93

Slide 93 text

Benchmarking performance

Slide 94

Slide 94 text

Parameter study for the Lorenz attractor system Lorenz attractor system ˙ x = −σ (x − y) , ˙ y = Rx − y − xz, ˙ z = −bz + xy. Solve large number of Lorenz systems, each for a different value of R. Use Boost.odeint. 20 10 0 10 20 20 0 20 0 10 20 30 40 50 Lorenz attractor trajectory

Slide 95

Slide 95 text

CUBLAS implementation CUBLAS is a highly optimized BLAS implementation from NVIDIA. Its disadvantage is that it has fixed number of kernels/functions. Hence, linear combinations (used internally by odeint): x0 = α1x1 + α2x2 + · · · + αnxn are implemented as: cublasDset (...); // x0 = 0 cublasDaxpy (...); // x0 = x0 + α1 ∗ x1 ... cublasDaxpy (...); // x0 = x0 + αn ∗ xn

Slide 96

Slide 96 text

Thrust implementation It is possible to fuse linear combination kernels with Thrust: Thrust 1 struct scale sum2 { 2 const double a1, a2; 3 scale sum2(double a1, double a2) : a1(a1), a2(a2) { } 4 template 5 host device void operator()(Tuple t) const { 6 thrust :: get<0>(t) = a1 ∗ thrust::get<1>(t) + a2 ∗ thrust::get<2>(t); 7 } 8 }; 9 10 thrust :: for each ( 11 thrust :: make zip iterator ( 12 thrust :: make tuple( x0.begin (), x1.begin (), x2.begin() ) 13 ), 14 thrust :: make zip iterator ( 15 thrust :: make tuple( x0.end(), x1.end(), x2.end() ) 16 ), 17 scale sum2(a1, a2) 18 );

Slide 97

Slide 97 text

Thrust implementation It is possible to fuse linear combination kernels with Thrust: Thrust 1 struct scale sum2 { 2 const double a1, a2; 3 scale sum2(double a1, double a2) : a1(a1), a2(a2) { } 4 template 5 host device void operator()(Tuple t) const { 6 thrust :: get<0>(t) = a1 ∗ thrust::get<1>(t) + a2 ∗ thrust::get<2>(t); 7 } 8 }; 9 10 thrust :: for each ( 11 thrust :: make zip iterator ( 12 thrust :: make tuple( x0.begin (), x1.begin (), x2.begin() ) 13 ), 14 thrust :: make zip iterator ( 15 thrust :: make tuple( x0.end(), x1.end(), x2.end() ) 16 ), 17 scale sum2(a1, a2) 18 ); VexCL 1 x0 = a1 ∗ x1 + a2 ∗ x2;

Slide 98

Slide 98 text

VexCL symbolic implementation Pass VexCL symbolic variables to a Boost.odeint algorithm. Make single integration step. Generate monolithic kernel that does one full time step. Intermediate variables are no longer stored in global memory.

Slide 99

Slide 99 text

Conclusion Advantages of code generation: You get the exact kernel you need. Flexibility of generic C++ code with C99 backend. There is no need to use vendor-specific solutions. Performance! 0 100 200 300 Compute time (sec) cuBLAS Thrust VexCL VexCL (symbolic) Integrate large number of ODEs on a GPU with Boost.odeint Fork m e on G itH ub