VexCL implementation, University of Texas, 2013

VexCL implementation, University of Texas, 2013

General ideas behind VexCL implementation

A90f9e7d204e6a93aa60dec78e149c85?s=128

Denis Demidov

October 22, 2013
Tweet

Transcript

  1. VexCL – Automatic OpenCL Code Generation Denis Demidov Supercomputer Center

    of Russian Academy of Sciences Kazan Federal University October 2013, Austin, Texas
  2. 1 Expression templates 2 OpenCL code generation 3 Extending VexCL

  3. Expression templates How to implement a DSL in C++ effectively?

    The idea is quite old: Todd Veldhuizen, Expression templates, C++ Report, 1995 First (?) implementation: Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. Today: std::valarray, Boost.uBLAS, MTL, Eigen, Armadillo, etc. How does it work?
  4. Simple example: Vector addition We want to be able to

    write: 1 x = y + z; And it has to be as effective as: 1 for(size t i = 0; i < n; ++i) 2 x[i ] = y[i] + z[i ];
  5. C++ allows us to overload operators! 1 const vector operator+(const

    vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 }
  6. C++ allows us to overload operators! 1 const vector operator+(const

    vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems?
  7. C++ allows us to overload operators! 1 const vector operator+(const

    vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems? Extra memory allocation
  8. C++ allows us to overload operators! 1 const vector operator+(const

    vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems? Extra memory allocation Extra memory I/O
  9. Lazy evaluation v0.1 1 struct vsum { 2 const vector

    &a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 };
  10. Lazy evaluation v0.1 1 struct vsum { 2 const vector

    &a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 }; 6 7 const vsum operator+(const vector &a, const vector &b) { 8 return vsum(a, b); 9 }
  11. Lazy evaluation v0.1 1 struct vsum { 2 const vector

    &a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 }; 6 7 const vsum operator+(const vector &a, const vector &b) { 8 return vsum(a, b); 9 } 10 11 const vector& vector::operator=(const vsum &s) { 12 for(size t i = 0; i < data.size (); ++i) 13 data[i ] = s.a[ i ] + s.b[i ]; 14 return ∗this; 15 }
  12. Not general enough What happens if we write this? 1

    a = x + y + z;
  13. Not general enough What happens if we write this? 1

    a = x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression (’const vsum’ and ’vector’) a = x + y + z; ~~~~~ ^ ~ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from ’const vsum’ to ’const vector’ for 1st argument const vsum operator+(const vector &a, const vector &b) { ^ 1 error generated. Sidenote: Clang error messages are awesome!
  14. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 };
  15. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 }; 10 11 template <class LHS, class RHS> 12 const vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 13 return vsum<LHS, RHS>(a, b); 14 }
  16. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 }; 10 11 template <class LHS, class RHS> 12 const vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 13 return vsum<LHS, RHS>(a, b); 14 } 15 16 template<class Expr> 17 const vector& vector::operator=(const Expr &expr) { 18 for(int i = 0; i < data.size (); ++i) data[i] = expr[i ]; 19 return ∗this; 20 }
  17. Still not general enough There are times in life when

    addition alone is not enough. . .
  18. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 };
  19. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 binary op(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 10 double operator[](size t i) const { 11 return OP::apply(lhs[i], rhs[ i ]); 12 } 13 };
  20. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 binary op(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 10 double operator[](size t i) const { 11 return OP::apply(lhs[i], rhs[ i ]); 12 } 13 }; 14 15 template <class LHS, class RHS> 16 const binary op<LHS, plus, RHS> operator+(const LHS &a, const RHS &b) { 17 return binary op<LHS, plus, RHS>(a, b); 18 }
  21. Expression templates are trees The expression in the RHS of:

    1 a = x + y; ... is of type: binary op< vector, plus, vector > + x y
  22. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y
  23. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ]
  24. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ]
  25. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]
  26. Expression templates are trees The expression in the RHS of:

    1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y #pragma omp parallel for for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]
  27. So far, so good It is now possible to: 1

    v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for(size t i = 0; i < n; ++i) 2 v[i ] = a[i] ∗ x[i ] + b[i] ∗ y[i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.
  28. 1 Expression templates 2 OpenCL code generation 3 Extending VexCL

  29. How does OpenCL work? 1 A compute kernel is compiled

    at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.
  30. How does OpenCL work? 1 A compute kernel is compiled

    at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a file, or stored in a static string, or generated.
  31. Generating kernel source from an expression We want this expression:

    1 a = x + y − z; . . . to result in this kernel: 1 kernel void vexcl vector kernel( 2 ulong n, 3 global double ∗ res, 4 global double ∗ prm1, 5 global double ∗ prm2, 6 global double ∗ prm3 7 ) 8 { 9 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 10 res[idx] = ( ( prm1[idx] + prm2[idx] ) − prm3[idx] ); 11 } 12 } − z + x y
  32. Declaring parameters Each terminal knows what parameters it needs: 1

    /∗static∗/ void vector::prm decl(std::ostream &src, unsigned &pos) { 2 src << ”,\n global double ∗ prm” << ++pos; 3 } An expression just asks its terminals to do the work: 4 template <class LHS, class OP, class RHS> 5 /∗static∗/ void binary op<LHS, OP, RHS>::prm decl( 6 std :: ostream &src, unsigned &pos) 7 { 8 LHS::prm decl(src, pos); 9 RHS::prm decl(src, pos); 10 }
  33. Building expression string 1 struct plus { 2 static std

    :: string string () { return ”+”; } 3 };
  34. Building expression string 1 struct plus { 2 static std

    :: string string () { return ”+”; } 3 }; 4 5 /∗static∗/ void vector::make expr(std::ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 }
  35. Building expression string 1 struct plus { 2 static std

    :: string string () { return ”+”; } 3 }; 4 5 /∗static∗/ void vector::make expr(std::ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 } 8 9 template <class LHS, class OP, class RHS> 10 /∗static∗/ void binary op<LHS, OP, RHS>::make expr( 11 std :: ostream &src, unsigned &pos) const 12 { 13 src << ”( ”; 14 LHS::make expr(src, pos); 15 src << ” ” << OP::string() << ” ”;; 16 RHS::make expr(src, pos); 17 src << ” )”; 18 }
  36. Source generation 1 template <class LHS, class RHS> 2 std

    :: string kernel source() { 3 std :: ostringstream src; 4 5 src << ”kernel void vexcl vector kernel(\n ulong n”; 6 unsigned pos = 0; 7 LHS::prm decl(src, pos); 8 RHS::prm decl(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size(0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS::make expr(src, pos); src << ” = ”; 14 RHS::make expr(src, pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src.str (); 18 }
  37. Source generation 1 template <class LHS, class RHS> 2 std

    :: string kernel source() { 3 std :: ostringstream src; 4 5 src << ”kernel void vexcl vector kernel(\n ulong n”; 6 unsigned pos = 0; 7 LHS::prm decl(src, pos); 8 RHS::prm decl(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size(0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS::make expr(src, pos); src << ” = ”; 14 RHS::make expr(src, pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src.str (); 18 }
  38. Source generation 1 template <class LHS, class RHS> 2 std

    :: string kernel source() { 3 std :: ostringstream src; 4 5 src << ”kernel void vexcl vector kernel(\n ulong n”; 6 unsigned pos = 0; 7 LHS::prm decl(src, pos); 8 RHS::prm decl(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size(0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS::make expr(src, pos); src << ” = ”; 14 RHS::make expr(src, pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src.str (); 18 }
  39. Setting kernel arguments 1 void vector:: set args(cl :: Kernel

    &krn, unsigned &pos) { 2 krn.setArg(pos++, buffer); 3 } 4 5 template <class LHS, class OP, class RHS> 6 void binary op<LHS, OP, RHS>::set args(cl::Kernel &krn, unsigned &pos) { 7 lhs . set args(krn, pos); 8 rhs. set args(krn, pos); 9 }
  40. Caching compiled kernels Each kernel is uniquely identified by its

    expression type: Expression template captures terminal types and operations. 1 template <class Expr> 2 const vector& vector::operator=(const Expr &expr) { 3 static cl :: Kernel kernel = build kernel(device, kernel source<This, Expr>()); 4 5 unsigned pos = 0; 6 7 kernel.setArg(pos++, size); // n 8 kernel.setArg(pos++, buffer); // res 9 expr.set args(kernel, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(kernel, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return ∗this; 14 }
  41. Caching compiled kernels Each kernel is uniquely identified by its

    expression type: Expression template captures terminal types and operations. 1 template <class Expr> 2 const vector& vector::operator=(const Expr &expr) { 3 static cl :: Kernel kernel = build kernel(device, kernel source<This, Expr>()); 4 5 unsigned pos = 0; 6 7 kernel.setArg(pos++, size); // n 8 kernel.setArg(pos++, buffer); // res 9 expr.set args(kernel, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(kernel, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return ∗this; 14 }
  42. What you saw is not what you get The actual

    implementation is a bit more complicated: There are unary, binary, ternary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. . . . Boost.Proto is used to keep the code mess contained. But you got the general idea.
  43. 1 Expression templates 2 OpenCL code generation 3 Extending VexCL

  44. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(

    sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }
  45. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(

    sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }
  46. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(

    sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }
  47. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(

    sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }
  48. General structure of OpenCL code VexCL code: 1 VEX FUNCTION(

    sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }
  49. Extending VexCL (adding terminal types) Each terminal should tell VexCL

    how to use it by specializing several struct templates: Does it need global preamble? What kernel parameters does it need? Does it need local preamble? What contribution does it make to expression string? How to set kernel parameters?
  50. Example: element index <vexcl/element index.hpp> 1 x = sin( 2

    ∗ M PI ∗ vex::element index() / n );
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. None
  59. None
  60. None