VexCL implementation, Meeting C++ 2014

A90f9e7d204e6a93aa60dec78e149c85?s=47 Denis Demidov
December 05, 2014

VexCL implementation, Meeting C++ 2014

Generating OpenCL/CUDA source code from C++ expressions in VexCL

http://www.youtube.com/watch?v=W8onyk2Rraw

VexCL is an opensource C++ vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++ and provides convenient and intuitive notation for linear algebra operations, vector arithmetic and various parallel primitives. This talk discusses C++ techniques that VexCL uses to effectively generate OpenCL/CUDA compute kernels from user expressions.

A90f9e7d204e6a93aa60dec78e149c85?s=128

Denis Demidov

December 05, 2014
Tweet

Transcript

  1. VexCL OpenCL/CUDA source code generation from C++ expressions Denis Demidov

    Supercomputer Center of Russian Academy of Sciences Meeting C++, December 2014
  2. Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA

    hardware More mature, many libraries Kernels are written in C++ OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99
  3. Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA

    hardware More mature, many libraries Kernels are written in C++ Kernels are compiled to PTX together with host program OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99 Kernels are compiled at runtime, adding an initialization overhead The latter distinction is usually considered to be an OpenCL drawback. But it also allows us to generate more efficient kernels at runtime! VexCL takes care of this part.
  4. VexCL — a vector expression template library for OpenCL/CUDA Created

    for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only The source code is publicly available under MIT license: https://github.com/ddemidov/vexcl Fork m e on G itH ub
  5. VexCL interface

  6. Hello VexCL: vector sum 1 #include <iostream> 2 #include <vector>

    3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector<double> a(n, 1), b(n, 2), c(n); 14 vex :: vector<double> A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }
  7. Hello VexCL: vector sum 1 #include <iostream> 2 #include <vector>

    3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector<double> a(n, 1), b(n, 2), c(n); 14 vex :: vector<double> A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }
  8. Hello VexCL: vector sum 1 #include <iostream> 2 #include <vector>

    3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector<double> a(n, 1), b(n, 2), c(n); 14 vex :: vector<double> A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }
  9. Hello VexCL: vector sum 1 #include <iostream> 2 #include <vector>

    3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector<double> a(n, 1), b(n, 2), c(n); 14 vex :: vector<double> A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }
  10. Hello VexCL: vector sum 1 #include <iostream> 2 #include <vector>

    3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector<double> a(n, 1), b(n, 2), c(n); 14 vex :: vector<double> A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }
  11. Hello VexCL: vector sum 1 #include <iostream> 2 #include <vector>

    3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 const size t n = 1024 ∗ 1024; 7 8 // Get compute devices supporting double precision: 9 vex :: Context ctx(vex :: Filter :: DoublePrecision ); 10 if (!ctx) throw std:: runtime error (”No compute devices available”); 11 12 // Prepare input data, transfer it to the device(s): 13 std :: vector<double> a(n, 1), b(n, 2), c(n); 14 vex :: vector<double> A(ctx, a), B(ctx, b), C(ctx, n); 15 16 // Launch compute kernel: 17 C = A + B; 18 19 // Get result back to host: 20 vex :: copy(C, c); 21 std :: cout << c[42] << std::endl; 22 }
  12. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( vex :: Filter :: Any );
  13. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( vex :: Filter :: Type(CL DEVICE TYPE GPU) );
  14. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( 2 vex :: Filter :: Type(CL DEVICE TYPE ACCELERATOR) && 3 vex :: Filter :: Platform(” Intel ”) 4 );
  15. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const cl :: Device&. Initialize VexCL context on selected devices 1 vex :: Context ctx( 2 vex :: Filter :: DoublePrecision && 3 [ ](const cl :: Device &d) { 4 return d. getInfo<CL DEVICE GLOBAL MEM SIZE>() >= 16 GB; 5 });
  16. Memory and work splitting 1 vex :: Context ctx( vex

    :: Filter :: Name(”Tesla”) ); 2 3 vex :: vector<double> x(ctx, N); 4 vex :: vector<double> y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y
  17. Memory and work splitting 1 vex :: Context ctx( vex

    :: Filter :: Type(CL DEVICE TYPE GPU) ); 2 3 vex :: vector<double> x(ctx, N); 4 vex :: vector<double> y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y
  18. Memory and work splitting 1 vex :: Context ctx( vex

    :: Filter :: DoublePrecision ); 2 3 vex :: vector<double> x(ctx, N); 4 vex :: vector<double> y(ctx, N); 5 6 x = vex::element index() ∗ (1.0 / N); 7 y = sin(2 ∗ x) + sqrt(1 − x ∗ x); x y
  19. What vector expressions are supported? All vectors in an expression

    have to be compatible: Have same size Located on same devices What may be used: Vectors, scalars, constants Arithmetic, logical operators Built-in OpenCL functions User-defined functions Random number generators Slicing and permutations Reduce to a scalar (sum, min, max) Reduce across chosen dimensions Stencil operations Sparse matrix – vector products Fast Fourier Transform Sort, scan, reduce by key
  20. A more involved example: Monte Carlo π Compute approximate value

    of π: area of circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex :: Random<cl double2> rnd; 2 vex :: Reductor<size t, vex :: SUM> sum(ctx); 3 4 double pi = 4.0 ∗ sum( length( rnd(vex :: element index(0, n), seed) ) < 1 ) / n;
  21. OpenCL/CUDA code is generated at runtime This expression: 1 x

    = 2 ∗ y − sin(z); when compiled with: g++ −DVEXCL BACKEND OPENCL −lOpenCL ... generates, compiles, and launches the following kernel at runtime: 1 kernel void vexcl vector kernel 2 ( 3 ulong n, 4 global double ∗ prm 1, // x 5 int prm 2, // 2 6 global double ∗ prm 3, // y 7 global double ∗ prm 4 // z 8 ) 9 { 10 for(ulong idx = get global id(0); idx < n; idx += get global size(0)) 11 { 12 prm 1[idx] = ( ( prm 2 ∗ prm 3[idx] ) − sin( prm 4[idx] ) ); 13 } 14 }
  22. OpenCL/CUDA code is generated at runtime This expression: 1 x

    = 2 ∗ y − sin(z); when compiled with: g++ −DVEXCL BACKEND CUDA −lcuda ... generates, compiles, and launches the following kernel at runtime: 1 extern ”C” global void vexcl vector kernel ( 2 ulong n, 3 double ∗ prm 1, // x 4 int prm 2, // 2 5 double ∗ prm 3, // y 6 double ∗ prm 4 // z 7 ) 8 { 9 for(ulong idx = blockDim.x ∗ blockIdx.x + threadIdx.x, grid size = blockDim.x ∗ gridDim.x; 10 idx < n; idx += grid size) 11 { 12 prm 1[idx] = ( ( prm 2 ∗ prm 3[idx] ) − sin( prm 4[idx] ) ); 13 } 14 }
  23. How does it work?

  24. Expression templates

  25. Expression templates How to implement a DSL in C++ effectively?

    The idea is quite old: 1995: Todd Veldhuizen, Expression templates, C++ Report. 1996: Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. Today: Armadillo, Blaze, Boost.uBLAS, Eigen, MTL, etc. But how does it work?
  26. Simple example: Vector arithmetic We want to be able to

    write: 1 x = y + z; and it has to be as effective as: 1 for( size t i = 0; i < n; ++i) 2 x[ i ] = y[i ] + z[i ];
  27. C++ allows us to overload operators! 1 vector operator+(const vector

    &a, const vector &b) { 2 vector tmp( a. size () ); 3 for( size t i = 0; i < a.size (); ++i) 4 tmp[i] = a[i] + b[i ]; 5 return tmp; 6 } Any problems?
  28. C++ allows us to overload operators! 1 vector operator+(const vector

    &a, const vector &b) { 2 vector tmp( a. size () ); 3 for( size t i = 0; i < a.size (); ++i) 4 tmp[i] = a[i] + b[i ]; 5 return tmp; 6 } Any problems? Extra memory allocation Extra memory I/O 1 a = x + y + z; 2 temporary vectors 8 × n memory reads/writes 1 for( size t i = 0; i < n; ++i) 2 a[ i ] = x[i ] + y[i ] + z[i ]; no temporaries 4 × n memory reads/writes
  29. Lazy evaluation v0.1 The idea: postpone the actual evaluation until

    assignment.
  30. Lazy evaluation v0.1 The idea: postpone the actual evaluation until

    assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 };
  31. Lazy evaluation v0.1 The idea: postpone the actual evaluation until

    assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 }
  32. Lazy evaluation v0.1 The idea: postpone the actual evaluation until

    assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 } 9 10 const vector& vector :: operator=(const vsum &s) { 11 for( size t i = 0; i < n; ++i) 12 data[ i ] = s.lhs [ i ] + s.rhs[ i ]; 13 return ∗this ; 14 }
  33. Not general enough What will happen here? 1 a =

    x + y + z;
  34. Not general enough What will happen here? 1 a =

    x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression (’vsum’ and ’vector’) a = x + y + z; ~~~~~ ^ ~ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from ’vsum’ to ’const vector’ for 1st argument vsum operator+(const vector &a, const vector &b) { ^ 1 error generated.
  35. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 };
  36. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 }; 8 9 template <class LHS, class RHS> 10 vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 11 return vsum<LHS, RHS>{a, b}; 12 }
  37. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[ ]( size t i) const { return lhs [ i ] + rhs[i ]; } 7 }; 8 9 template <class LHS, class RHS> 10 vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 11 return vsum<LHS, RHS>{a, b}; 12 } 13 14 template<class Expr> 15 const vector& vector :: operator=(const Expr &expr) { 16 for(int i = 0; i < n; ++i) data[i] = expr[i ]; 17 return ∗this ; 18 }
  38. Still not general enough There are times in life when

    addition alone is not enough
  39. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 };
  40. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[ ]( size t i) const { return OP::apply(lhs [ i ], rhs[ i ]); } 11 };
  41. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[ ]( size t i) const { return OP::apply(lhs [ i ], rhs[ i ]); } 11 }; 12 13 template <class LHS, class RHS> 14 binary op<LHS, plus, RHS> operator+(const LHS &a, const RHS &b) { 15 return binary op<LHS, plus, RHS>{a, b}; 16 }
  42. Expression templates are trees The expression in the RHS of:

    1 a = x + y; ... is of type: binary op< vector , plus , vector > + x y
  43. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y
  44. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ]
  45. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ]
  46. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]
  47. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y #pragma omp parallel for for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]
  48. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector , plus , vector > , minus , vector > − z + x y #pragma omp parallel for for( size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ] The C++ compiler walks the tree. binary op :: operator[ ] does the actual work.
  49. So far, so good It is now possible to: 1

    v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for( size t i = 0; i < n; ++i) 2 v[ i ] = a[i] ∗ x[ i ] + b[i] ∗ y[ i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.
  50. So far, so good It is now possible to: 1

    v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for( size t i = 0; i < n; ++i) 2 v[ i ] = a[i] ∗ x[ i ] + b[i] ∗ y[ i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything. But how is that related to OpenCL code generation?
  51. OpenCL code generation

  52. How does OpenCL work? 1 A compute kernel is compiled

    at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.
  53. How does OpenCL work? 1 A compute kernel is compiled

    at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a file, or stored in a static string, or generated.
  54. Generating kernel source from an expression We want this expression:

    1 a = x + y − z; . . . to result in this kernel: 1 kernel void vexcl vector kernel ( 2 ulong n, 3 global double ∗ res , 4 global double ∗ prm1, 5 global double ∗ prm2, 6 global double ∗ prm3 7 ) 8 { 9 for( size t idx = get global id(0); idx < n; idx += get global size(0)) { 10 res [ idx ] = ( ( prm1[idx] + prm2[idx] ) − prm3[idx] ); 11 } 12 } − z + x y
  55. Declaring parameters Each terminal knows what parameters it needs: 1

    /∗static∗/ void vector :: declare params(std :: ostream &src, unsigned &pos) { 2 src << ”,\n global double ∗ prm” << ++pos; 3 } An expression just asks its terminals to do the work: 4 template <class LHS, class OP, class RHS> 5 /∗static∗/ void binary op<LHS, OP, RHS>::declare params( 6 std :: ostream &src, unsigned &pos) 7 { 8 LHS::declare params(src , pos); 9 RHS::declare params(src, pos); 10 }
  56. Building string representation for expression 1 struct plus { 2

    static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 };
  57. Building string representation for expression 1 struct plus { 2

    static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 }; 4 5 /∗static∗/ void vector :: to string (std :: ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 }
  58. Building string representation for expression 1 struct plus { 2

    static std :: string to string (std :: ostream &src) { src << ” + ”; } 3 }; 4 5 /∗static∗/ void vector :: to string (std :: ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 } 8 9 template <class LHS, class OP, class RHS> 10 /∗static∗/ void binary op<LHS, OP, RHS>::to string(std::ostream &src, unsigned &pos) { 11 src << ”( ”; 12 LHS:: to string (src , pos); 13 OP:: to string (src ); 14 RHS::to string (src , pos); 15 src << ” )”; 16 }
  59. Generating kernel source 1 template <class LHS, class RHS> 2

    std :: string kernel source () { 3 std :: ostringstream src ; 4 5 src << ”kernel void vexcl vector kernel (\n ulong n”; 6 unsigned pos = 0; 7 LHS::declare params(src , pos); 8 RHS::declare params(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size (0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS:: to string (src , pos); src << ” = ”; 14 RHS::to string (src , pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src . str (); 18 }
  60. Generating kernel source 1 template <class LHS, class RHS> 2

    std :: string kernel source () { 3 std :: ostringstream src ; 4 5 src << ”kernel void vexcl vector kernel (\n ulong n”; 6 unsigned pos = 0; 7 LHS::declare params(src , pos); 8 RHS::declare params(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size (0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS:: to string (src , pos); src << ” = ”; 14 RHS::to string (src , pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src . str (); 18 }
  61. Generating kernel source 1 template <class LHS, class RHS> 2

    std :: string kernel source () { 3 std :: ostringstream src ; 4 5 src << ”kernel void vexcl vector kernel (\n ulong n”; 6 unsigned pos = 0; 7 LHS::declare params(src , pos); 8 RHS::declare params(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size (0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS:: to string (src , pos); src << ” = ”; 14 RHS::to string (src , pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src . str (); 18 }
  62. Setting kernel arguments 1 void vector :: set args (cl

    :: Kernel &krn, unsigned &pos) { 2 krn.setArg(pos++, buffer); 3 } 4 5 template <class LHS, class OP, class RHS> 6 void binary op<LHS, OP, RHS>::set args(cl::Kernel &krn, unsigned &pos) { 7 lhs . set args (krn, pos); 8 rhs. set args (krn, pos); 9 } The methods are not static anymore!
  63. Gluing it all together 1 template <class Expr> 2 const

    vector& vector :: operator=(const Expr &expr) { 3 static cl :: Kernel krn = build kernel (device , kernel source <This, Expr>()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr. set args (krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer . size (), cl :: NullRange); 12 13 return ∗this ; 14 } Kernel is generated and compiled once, applied many times: Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel.
  64. Gluing it all together 1 template <class Expr> 2 const

    vector& vector :: operator=(const Expr &expr) { 3 static cl :: Kernel krn = build kernel (device , kernel source <This, Expr>()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr. set args (krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer . size (), cl :: NullRange); 12 13 return ∗this ; 14 } Kernel is generated and compiled once, applied many times: Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel.
  65. What you saw is not what you get The actual

    implementation is a bit more complicated: There are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. . . . Boost.Proto is used to keep the code mess contained.
  66. Boost.Proto Boost.Proto is a framework for building Embedded Domain-Specific Languages

    in C++. It provides tools for constructing, type-checking, transforming and executing expression templates. Hello Proto 1 #include <iostream> 2 #include <boost/proto/proto.hpp> 3 namespace proto = boost::proto; 4 5 int main() { 6 auto expr = proto:: as expr(6) ∗ 7 − 2; 7 proto :: display expr ( expr ); 8 9 proto :: default context ctx; 10 std :: cout << proto::eval( expr, ctx ) << std::endl; 11 } Output: minus( multiplies( terminal(6) , terminal(7) ) , terminal(2) ) 40
  67. Boost.Proto Boost.Proto is a framework for building Embedded Domain-Specific Languages

    in C++. It provides tools for constructing, type-checking, transforming and executing expression templates. Hello Proto 1 #include <iostream> 2 #include <boost/proto/proto.hpp> 3 namespace proto = boost::proto; 4 5 int main() { 6 auto expr = proto:: as expr(6) ∗ 7 − 2; 7 proto :: display expr ( expr ); 8 9 proto :: default context ctx; 10 std :: cout << proto::eval( expr, ctx ) << std::endl; 11 } Output: minus( multiplies( terminal(6) , terminal(7) ) , terminal(2) ) 40
  68. Creating custom Boost.Proto context 1. Define custom Proto context: 1

    struct print expression context { 2 template <typename Expr, typename Tag = typename Expr::proto tag> struct eval {}; 3 4 template <typename Expr> 5 struct eval<Expr, proto::tag :: plus> { 6 typedef void result type ; 7 void operator()(const Expr &expr, print expression context &ctx) const { 8 cout << ”( ”; 9 proto :: eval( proto :: left (expr ), ctx ); 10 cout << ” + ”; 11 proto :: eval( proto :: right (expr ), ctx ); 12 cout << ” )”; 13 } 14 }; 15 16 template <typename Expr> 17 struct eval<Expr, proto::tag :: terminal> { 18 typedef void result type ; 19 void operator()(const Expr &expr, print expression context &ctx) const { 20 cout << proto::eval( expr, proto :: default context () ); 21 } 22 }; 23 };
  69. Creating custom Boost.Proto context 2. Use the context: 1 auto

    expr = proto:: as expr(6) ∗ 7 − 2; 2 3 print expression context ctx; 4 proto :: eval( expr, ctx ); Output: ( ( 6 * 7 ) + 5 )
  70. Generating code with Boost.Proto 1 template <class Term> struct partial

    vector expr { 2 static void get(std :: ostream &os, unsigned &pos) { os << ”prm ” << ++pos; } 3 }; 4 5 struct print expression context { 6 // ... 7 unsigned pos = 0; 8 9 template <typename Expr> 10 struct eval<Expr, proto::tag :: terminal> { 11 typedef void result type ; 12 void operator()(const Expr &expr, print expression context &ctx) const { 13 partial vector expr <typename proto::result of :: value<Expr>::type>::get(cout, ctx.pos); 14 } 15 }; 16 }; 1 proto :: eval( proto :: as expr(6) ∗ 7 − 2, ctx ); ( ( prm_1 * prm_2 ) + prm_3 )
  71. Generating code with Boost.Proto 1 template <class Term> struct partial

    vector expr { 2 static void get(std :: ostream &os, unsigned &pos) { os << ”prm ” << ++pos; } 3 }; 4 5 struct print expression context { 6 // ... 7 unsigned pos = 0; 8 9 template <typename Expr> 10 struct eval<Expr, proto::tag :: terminal> { 11 typedef void result type ; 12 void operator()(const Expr &expr, print expression context &ctx) const { 13 partial vector expr <typename proto::result of :: value<Expr>::type>::get(cout, ctx.pos); 14 } 15 }; 16 }; 1 proto :: eval( proto :: as expr(6) ∗ 7 − 2, ctx ); ( ( prm_1 * prm_2 ) + prm_3 )
  72. Generating code with Boost.Proto 1 template <class Term> struct partial

    vector expr { 2 static void get(std :: ostream &os, unsigned &pos) { os << ”prm ” << ++pos; } 3 }; 4 5 struct print expression context { 6 // ... 7 unsigned pos = 0; 8 9 template <typename Expr> 10 struct eval<Expr, proto::tag :: terminal> { 11 typedef void result type ; 12 void operator()(const Expr &expr, print expression context &ctx) const { 13 partial vector expr <typename proto::result of :: value<Expr>::type>::get(cout, ctx.pos); 14 } 15 }; 16 }; 1 proto :: eval( proto :: as expr(6) ∗ 7 − 2, ctx ); ( ( prm_1 * prm_2 ) + prm_3 )
  73. Almost there. . . The code is very close to

    the one actually used in VexCL We can leave definition of print expression context intact. Only partial vector expr needs to be specialized for each new terminal type. We will need several contexts similar to print expression context : Declaring kernel parameters, Setting kernel arguments, . . .
  74. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double,

    sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.
  75. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double,

    sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.
  76. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double,

    sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.
  77. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double,

    sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.
  78. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double,

    sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.
  79. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(double,

    sqr, (double, x)(double, y), 2 return x ∗ x + y ∗ y; 3 ); 4 auto tmp = vex::make temp<1>(x); 5 y = sqr(sin(tmp), cos(tmp)); OpenCL code: 1 double sqr(double x, double y) { return x ∗ x + y ∗ y; } 2 kernel void vexcl vector kernel (ulong n, 3 global double ∗ prm 1, 4 global double ∗ prm 2 1 5 ) 6 { 7 for(ulong idx = get global id(0); 8 idx < n; idx += get global size(0)) 9 { 10 double temp 1 = prm 2 1[idx]; 11 prm 1[idx] = sqr( sin( temp 1 ), cos( temp 1 ) ); 12 } 13 } Each terminal should tell VexCL how to use it: Does the terminal need global preamble? What kernel parameters does it need, and how to set them? Does it need local preamble? What contribution does it make to the expression string? This is done by specializing a number of templated structs.
  80. Example: element index <vexcl/element index.hpp> 1 x = sin( 2

    ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr<vector domain>( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }
  81. Example: element index <vexcl/element index.hpp> 1 x = sin( 2

    ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr<vector domain>( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }
  82. Example: element index <vexcl/element index.hpp> 1 x = sin( 2

    ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr<vector domain>( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }
  83. Example: element index <vexcl/element index.hpp> 1 x = sin( 2

    ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline auto element index( size t offset = 0) { 7 return boost:: proto :: as expr<vector domain>( elem index{offset} ); 8 } 9 10 namespace traits { 11 template <> struct is vector expr terminal < elem index > : std :: true type {}; 12 } 13 }
  84. Example: element index <vexcl/element index.hpp> 1 x = sin( 2

    ∗ M PI ∗ vex::element index() / n ); 1 namespace vex { 2 struct elem index { 3 size t offset ; 4 }; 5 6 inline boost:: proto :: result of :: as expr<elem index, vector domain>::type 7 element index( size t offset = 0) { 8 return boost:: proto :: as expr<vector domain>( elem index{offset} ); 9 } 10 11 namespace traits { 12 template <> struct is vector expr terminal < elem index > : std :: true type {}; 13 } 14 }
  85. Parameter declaration 1 template <> 2 struct kernel param declaration

    < elem index > 3 { 4 static std :: string get( 5 const elem index&, const cl :: Device&, const std :: string &prm name, 6 detail :: kernel generator state ptr 7 ) 8 { 9 std :: ostringstream s; 10 s << ”,\n\t” << type name<size t>() << ” ” << prm name; 11 return s. str (); 12 } 13 };
  86. Contribution to expression string 1 template <> 2 struct partial

    vector expr < elem index > 3 { 4 static std :: string get( 5 const elem index&, const cl :: Device&, const std :: string &prm name, 6 detail :: kernel generator state ptr 7 ) 8 { 9 std :: ostringstream s; 10 s << ”(” << prm name << ” + idx)”; 11 return s. str (); 12 } 13 };
  87. Setting kernel arguments 1 template <> 2 struct kernel arg

    setter < elem index > 3 { 4 static void set( 5 const elem index &term, cl :: Kernel &kernel, unsigned /∗device∗/, 6 size t index offset , unsigned &position, 7 detail :: kernel generator state ptr 8 ) 9 { 10 kernel.setArg( position ++, term.offset + index offset ); 11 } 12 };
  88. And we are done! vex :: element index() is now

    fully defined: We may use it in vector expressions of arbitrary complexity. Also, we have an expression engine (a set of proto contexts): Adding new terminal types is straightforward.
  89. Other uses for expression templates in VexCL

  90. Device filters 1 vex :: Context ctx( 2 vex ::

    Filter :: Env && vex::Filter :: DoublePrecision 3 ); Mini-DSL inside VexCL Terminals are functors Only logical operators are provided
  91. Symbolic variables An instance of vex :: symbolic<> dumps any

    arithmetic operations to a std :: ostream: 1 vex :: generator :: set recorder ( std :: cout ); 2 vex :: symbolic<double> x, y = 6; 3 x = sin(y ∗ 7); double var1; double var2 = 6; var1 = sin( var2 * 7 ); We may feed symbolic variables to a generic algorithm and obtain its string representation. Generate VexCL function: 1 auto sqr = vex::generator :: make function<double(double, double)>( 2 [ ](const auto &x, const auto &y){ return x ∗ x + y ∗ y; } ); 3 x = sqr( sin(y), cos(z) ); Generate effective compute kernel from a generic algorithm.
  92. Conclusion Advantages of code generation: You get the exact kernel

    you need. Flexibility of generic C++ code with C99 backend. There is no need to use vendor-specific solutions. Performance! 0 100 200 300 Compute time (sec) cuBLAS Thrust VexCL VexCL (symbolic) Integrate large number of ODEs on a GPU with Boost.odeint Fork m e on G itH ub
  93. Benchmarking performance

  94. Parameter study for the Lorenz attractor system Lorenz attractor system

    ˙ x = −σ (x − y) , ˙ y = Rx − y − xz, ˙ z = −bz + xy. Solve large number of Lorenz systems, each for a different value of R. Use Boost.odeint. 20 10 0 10 20 20 0 20 0 10 20 30 40 50 Lorenz attractor trajectory
  95. CUBLAS implementation CUBLAS is a highly optimized BLAS implementation from

    NVIDIA. Its disadvantage is that it has fixed number of kernels/functions. Hence, linear combinations (used internally by odeint): x0 = α1x1 + α2x2 + · · · + αnxn are implemented as: cublasDset (...); // x0 = 0 cublasDaxpy (...); // x0 = x0 + α1 ∗ x1 ... cublasDaxpy (...); // x0 = x0 + αn ∗ xn
  96. Thrust implementation It is possible to fuse linear combination kernels

    with Thrust: Thrust 1 struct scale sum2 { 2 const double a1, a2; 3 scale sum2(double a1, double a2) : a1(a1), a2(a2) { } 4 template<class Tuple> 5 host device void operator()(Tuple t) const { 6 thrust :: get<0>(t) = a1 ∗ thrust::get<1>(t) + a2 ∗ thrust::get<2>(t); 7 } 8 }; 9 10 thrust :: for each ( 11 thrust :: make zip iterator ( 12 thrust :: make tuple( x0.begin (), x1.begin (), x2.begin() ) 13 ), 14 thrust :: make zip iterator ( 15 thrust :: make tuple( x0.end(), x1.end(), x2.end() ) 16 ), 17 scale sum2(a1, a2) 18 );
  97. Thrust implementation It is possible to fuse linear combination kernels

    with Thrust: Thrust 1 struct scale sum2 { 2 const double a1, a2; 3 scale sum2(double a1, double a2) : a1(a1), a2(a2) { } 4 template<class Tuple> 5 host device void operator()(Tuple t) const { 6 thrust :: get<0>(t) = a1 ∗ thrust::get<1>(t) + a2 ∗ thrust::get<2>(t); 7 } 8 }; 9 10 thrust :: for each ( 11 thrust :: make zip iterator ( 12 thrust :: make tuple( x0.begin (), x1.begin (), x2.begin() ) 13 ), 14 thrust :: make zip iterator ( 15 thrust :: make tuple( x0.end(), x1.end(), x2.end() ) 16 ), 17 scale sum2(a1, a2) 18 ); VexCL 1 x0 = a1 ∗ x1 + a2 ∗ x2;
  98. VexCL symbolic implementation Pass VexCL symbolic variables to a Boost.odeint

    algorithm. Make single integration step. Generate monolithic kernel that does one full time step. Intermediate variables are no longer stored in global memory.
  99. Conclusion Advantages of code generation: You get the exact kernel

    you need. Flexibility of generic C++ code with C99 backend. There is no need to use vendor-specific solutions. Performance! 0 100 200 300 Compute time (sec) cuBLAS Thrust VexCL VexCL (symbolic) Integrate large number of ODEs on a GPU with Boost.odeint Fork m e on G itH ub