VexCL implementation, University of Texas, 2013

VexCL – Automatic OpenCL Code Generation Denis Demidov Supercomputer Center
of Russian Academy of Sciences Kazan Federal University October 2013, Austin, Texas

1 Expression templates 2 OpenCL code generation 3 Extending VexCL

Expression templates How to implement a DSL in C++ eﬀectively?
The idea is quite old: Todd Veldhuizen, Expression templates, C++ Report, 1995 First (?) implementation: Blitz++ is a C++ class library for scientiﬁc computing which provides performance on par with Fortran 77/90. Today: std::valarray, Boost.uBLAS, MTL, Eigen, Armadillo, etc. How does it work?

Simple example: Vector addition We want to be able to
write: 1 x = y + z; And it has to be as eﬀective as: 1 for(size t i = 0; i < n; ++i) 2 x[i ] = y[i] + z[i ];

C++ allows us to overload operators! 1 const vector operator+(const
vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 }

vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems?

vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems? Extra memory allocation

vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems? Extra memory allocation Extra memory I/O

Lazy evaluation v0.1 1 struct vsum { 2 const vector
&a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 };

&a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 }; 6 7 const vsum operator+(const vector &a, const vector &b) { 8 return vsum(a, b); 9 }

&a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 }; 6 7 const vsum operator+(const vector &a, const vector &b) { 8 return vsum(a, b); 9 } 10 11 const vector& vector::operator=(const vsum &s) { 12 for(size t i = 0; i < data.size (); ++i) 13 data[i ] = s.a[ i ] + s.b[i ]; 14 return ∗this; 15 }

Not general enough What happens if we write this? 1
a = x + y + z;

Not general enough What happens if we write this? 1
a = x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression (’const vsum’ and ’vector’) a = x + y + z; ~~~~~ ^ ~ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from ’const vsum’ to ’const vector’ for 1st argument const vsum operator+(const vector &a, const vector &b) { ^ 1 error generated. Sidenote: Clang error messages are awesome!

Lazy evaluation v0.2 1 template <class LHS, class RHS> 2
struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 };

struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 }; 10 11 template <class LHS, class RHS> 12 const vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 13 return vsum<LHS, RHS>(a, b); 14 }

struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 }; 10 11 template <class LHS, class RHS> 12 const vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 13 return vsum<LHS, RHS>(a, b); 14 } 15 16 template<class Expr> 17 const vector& vector::operator=(const Expr &expr) { 18 for(int i = 0; i < data.size (); ++i) data[i] = expr[i ]; 19 return ∗this; 20 }

Still not general enough There are times in life when
addition alone is not enough. . .

Lazy evaluation v0.3 1 struct plus { 2 static double
apply(double a, double b) { return a + b; } 3 };

apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 binary op(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 10 double operator[](size t i) const { 11 return OP::apply(lhs[i], rhs[ i ]); 12 } 13 };

apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 binary op(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 10 double operator[](size t i) const { 11 return OP::apply(lhs[i], rhs[ i ]); 12 } 13 }; 14 15 template <class LHS, class RHS> 16 const binary op<LHS, plus, RHS> operator+(const LHS &a, const RHS &b) { 17 return binary op<LHS, plus, RHS>(a, b); 18 }

Expression templates are trees The expression in the RHS of:
1 a = x + y; ... is of type: binary op< vector, plus, vector > + x y

1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y

1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ]

1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ]

1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y #pragma omp parallel for for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

So far, so good It is now possible to: 1
v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as eﬀective as: 1 for(size t i = 0; i < n; ++i) 2 v[i ] = a[i] ∗ x[i ] + b[i] ∗ y[i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.

How does OpenCL work? 1 A compute kernel is compiled
at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.

How does OpenCL work? 1 A compute kernel is compiled
at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a ﬁle, or stored in a static string, or generated.

Generating kernel source from an expression We want this expression:
1 a = x + y − z; . . . to result in this kernel: 1 kernel void vexcl vector kernel( 2 ulong n, 3 global double ∗ res, 4 global double ∗ prm1, 5 global double ∗ prm2, 6 global double ∗ prm3 7 ) 8 { 9 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 10 res[idx] = ( ( prm1[idx] + prm2[idx] ) − prm3[idx] ); 11 } 12 } − z + x y

Declaring parameters Each terminal knows what parameters it needs: 1
/∗static∗/ void vector::prm decl(std::ostream &src, unsigned &pos) { 2 src << ”,\n global double ∗ prm” << ++pos; 3 } An expression just asks its terminals to do the work: 4 template <class LHS, class OP, class RHS> 5 /∗static∗/ void binary op<LHS, OP, RHS>::prm decl( 6 std :: ostream &src, unsigned &pos) 7 { 8 LHS::prm decl(src, pos); 9 RHS::prm decl(src, pos); 10 }

Building expression string 1 struct plus { 2 static std
:: string string () { return ”+”; } 3 };

:: string string () { return ”+”; } 3 }; 4 5 /∗static∗/ void vector::make expr(std::ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 }

:: string string () { return ”+”; } 3 }; 4 5 /∗static∗/ void vector::make expr(std::ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 } 8 9 template <class LHS, class OP, class RHS> 10 /∗static∗/ void binary op<LHS, OP, RHS>::make expr( 11 std :: ostream &src, unsigned &pos) const 12 { 13 src << ”( ”; 14 LHS::make expr(src, pos); 15 src << ” ” << OP::string() << ” ”;; 16 RHS::make expr(src, pos); 17 src << ” )”; 18 }

Source generation 1 template <class LHS, class RHS> 2 std
:: string kernel source() { 3 std :: ostringstream src; 4 5 src << ”kernel void vexcl vector kernel(\n ulong n”; 6 unsigned pos = 0; 7 LHS::prm decl(src, pos); 8 RHS::prm decl(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size(0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS::make expr(src, pos); src << ” = ”; 14 RHS::make expr(src, pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src.str (); 18 }

Setting kernel arguments 1 void vector:: set args(cl :: Kernel
&krn, unsigned &pos) { 2 krn.setArg(pos++, buﬀer); 3 } 4 5 template <class LHS, class OP, class RHS> 6 void binary op<LHS, OP, RHS>::set args(cl::Kernel &krn, unsigned &pos) { 7 lhs . set args(krn, pos); 8 rhs. set args(krn, pos); 9 }

Caching compiled kernels Each kernel is uniquely identified by its
expression type: Expression template captures terminal types and operations. 1 template <class Expr> 2 const vector& vector::operator=(const Expr &expr) { 3 static cl :: Kernel kernel = build kernel(device, kernel source<This, Expr>()); 4 5 unsigned pos = 0; 6 7 kernel.setArg(pos++, size); // n 8 kernel.setArg(pos++, buffer); // res 9 expr.set args(kernel, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(kernel, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return ∗this; 14 }

What you saw is not what you get The actual
implementation is a bit more complicated: There are unary, binary, ternary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-deﬁned functions. . . . Boost.Proto is used to keep the code mess contained. But you got the general idea.

General structure of OpenCL code VexCL code: 1 VEX FUNCTION(
sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }

Extending VexCL (adding terminal types) Each terminal should tell VexCL
how to use it by specializing several struct templates: Does it need global preamble? What kernel parameters does it need? Does it need local preamble? What contribution does it make to expression string? How to set kernel parameters?

Example: element index <vexcl/element index.hpp> 1 x = sin( 2
∗ M PI ∗ vex::element index() / n );

VexCL implementation, University of Texas, 2013

VexCL implementation, University of Texas, 2013

More Decks by Denis Demidov

Other Decks in Programming

Featured

Transcript