Slide 1

Slide 1 text

VexCL – Automatic OpenCL Code Generation Denis Demidov Supercomputer Center of Russian Academy of Sciences Kazan Federal University October 2013, Austin, Texas

Slide 2

Slide 2 text

1 Expression templates 2 OpenCL code generation 3 Extending VexCL

Slide 3

Slide 3 text

Expression templates How to implement a DSL in C++ effectively? The idea is quite old: Todd Veldhuizen, Expression templates, C++ Report, 1995 First (?) implementation: Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. Today: std::valarray, Boost.uBLAS, MTL, Eigen, Armadillo, etc. How does it work?

Slide 4

Slide 4 text

Simple example: Vector addition We want to be able to write: 1 x = y + z; And it has to be as effective as: 1 for(size t i = 0; i < n; ++i) 2 x[i ] = y[i] + z[i ];

Slide 5

Slide 5 text

C++ allows us to overload operators! 1 const vector operator+(const vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 }

Slide 6

Slide 6 text

C++ allows us to overload operators! 1 const vector operator+(const vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems?

Slide 7

Slide 7 text

C++ allows us to overload operators! 1 const vector operator+(const vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems? Extra memory allocation

Slide 8

Slide 8 text

C++ allows us to overload operators! 1 const vector operator+(const vector &a, const vector &b) { 2 assert(a. size () == b.size()); 3 vector c( a. size () ); 4 for(size t i = 0; i < a.size (); ++i) 5 c[ i ] = a[i] + b[i ]; 6 return c; 7 } Any problems? Extra memory allocation Extra memory I/O

Slide 9

Slide 9 text

Lazy evaluation v0.1 1 struct vsum { 2 const vector &a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 };

Slide 10

Slide 10 text

Lazy evaluation v0.1 1 struct vsum { 2 const vector &a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 }; 6 7 const vsum operator+(const vector &a, const vector &b) { 8 return vsum(a, b); 9 }

Slide 11

Slide 11 text

Lazy evaluation v0.1 1 struct vsum { 2 const vector &a; 3 const vector &b; 4 vsum(const vector &a, const vector &b) : a(a), b(b) {} 5 }; 6 7 const vsum operator+(const vector &a, const vector &b) { 8 return vsum(a, b); 9 } 10 11 const vector& vector::operator=(const vsum &s) { 12 for(size t i = 0; i < data.size (); ++i) 13 data[i ] = s.a[ i ] + s.b[i ]; 14 return ∗this; 15 }

Slide 12

Slide 12 text

Not general enough What happens if we write this? 1 a = x + y + z;

Slide 13

Slide 13 text

Not general enough What happens if we write this? 1 a = x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression (’const vsum’ and ’vector’) a = x + y + z; ~~~~~ ^ ~ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from ’const vsum’ to ’const vector’ for 1st argument const vsum operator+(const vector &a, const vector &b) { ^ 1 error generated. Sidenote: Clang error messages are awesome!

Slide 14

Slide 14 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 };

Slide 15

Slide 15 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 }; 10 11 template 12 const vsum operator+(const LHS &a, const RHS &b) { 13 return vsum(a, b); 14 }

Slide 16

Slide 16 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 vsum(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 6 double operator[](size t i) const { 7 return lhs[i] + rhs[i ]; 8 } 9 }; 10 11 template 12 const vsum operator+(const LHS &a, const RHS &b) { 13 return vsum(a, b); 14 } 15 16 template 17 const vector& vector::operator=(const Expr &expr) { 18 for(int i = 0; i < data.size (); ++i) data[i] = expr[i ]; 19 return ∗this; 20 }

Slide 17

Slide 17 text

Still not general enough There are times in life when addition alone is not enough. . .

Slide 18

Slide 18 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 };

Slide 19

Slide 19 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 }; 4 5 template 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 binary op(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 10 double operator[](size t i) const { 11 return OP::apply(lhs[i], rhs[ i ]); 12 } 13 };

Slide 20

Slide 20 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 }; 4 5 template 6 struct binary op { 7 const LHS &lhs; 8 const RHS &rhs; 9 binary op(const LHS &lhs, const RHS &rhs) : lhs(lhs), rhs(rhs) {} 10 double operator[](size t i) const { 11 return OP::apply(lhs[i], rhs[ i ]); 12 } 13 }; 14 15 template 16 const binary op operator+(const LHS &a, const RHS &b) { 17 return binary op(a, b); 18 }

Slide 21

Slide 21 text

Expression templates are trees The expression in the RHS of: 1 a = x + y; ... is of type: binary op< vector, plus, vector > + x y

Slide 22

Slide 22 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y

Slide 23

Slide 23 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ]

Slide 24

Slide 24 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ]

Slide 25

Slide 25 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

Slide 26

Slide 26 text

Expression templates are trees The expression in the RHS of: 1 a = x + y − z; ... is of type: binary op< binary op< vector, plus, vector > , minus , vector > − z + x y #pragma omp parallel for for(size t i = 0; i < n; ++i) a[ i ] = [ i ] [ i ] [ i ]

Slide 27

Slide 27 text

So far, so good It is now possible to: 1 v = a ∗ x + b ∗ y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for(size t i = 0; i < n; ++i) 2 v[i ] = a[i] ∗ x[i ] + b[i] ∗ y[i ]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.

Slide 28

Slide 28 text

1 Expression templates 2 OpenCL code generation 3 Extending VexCL

Slide 29

Slide 29 text

How does OpenCL work? 1 A compute kernel is compiled at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.

Slide 30

Slide 30 text

How does OpenCL work? 1 A compute kernel is compiled at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a file, or stored in a static string, or generated.

Slide 31

Slide 31 text

Generating kernel source from an expression We want this expression: 1 a = x + y − z; . . . to result in this kernel: 1 kernel void vexcl vector kernel( 2 ulong n, 3 global double ∗ res, 4 global double ∗ prm1, 5 global double ∗ prm2, 6 global double ∗ prm3 7 ) 8 { 9 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 10 res[idx] = ( ( prm1[idx] + prm2[idx] ) − prm3[idx] ); 11 } 12 } − z + x y

Slide 32

Slide 32 text

Declaring parameters Each terminal knows what parameters it needs: 1 /∗static∗/ void vector::prm decl(std::ostream &src, unsigned &pos) { 2 src << ”,\n global double ∗ prm” << ++pos; 3 } An expression just asks its terminals to do the work: 4 template 5 /∗static∗/ void binary op::prm decl( 6 std :: ostream &src, unsigned &pos) 7 { 8 LHS::prm decl(src, pos); 9 RHS::prm decl(src, pos); 10 }

Slide 33

Slide 33 text

Building expression string 1 struct plus { 2 static std :: string string () { return ”+”; } 3 };

Slide 34

Slide 34 text

Building expression string 1 struct plus { 2 static std :: string string () { return ”+”; } 3 }; 4 5 /∗static∗/ void vector::make expr(std::ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 }

Slide 35

Slide 35 text

Building expression string 1 struct plus { 2 static std :: string string () { return ”+”; } 3 }; 4 5 /∗static∗/ void vector::make expr(std::ostream &src, unsigned &pos) { 6 src << ”prm” << ++pos << ”[idx]”; 7 } 8 9 template 10 /∗static∗/ void binary op::make expr( 11 std :: ostream &src, unsigned &pos) const 12 { 13 src << ”( ”; 14 LHS::make expr(src, pos); 15 src << ” ” << OP::string() << ” ”;; 16 RHS::make expr(src, pos); 17 src << ” )”; 18 }

Slide 36

Slide 36 text

Source generation 1 template 2 std :: string kernel source() { 3 std :: ostringstream src; 4 5 src << ”kernel void vexcl vector kernel(\n ulong n”; 6 unsigned pos = 0; 7 LHS::prm decl(src, pos); 8 RHS::prm decl(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size(0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS::make expr(src, pos); src << ” = ”; 14 RHS::make expr(src, pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src.str (); 18 }

Slide 37

Slide 37 text

Source generation 1 template 2 std :: string kernel source() { 3 std :: ostringstream src; 4 5 src << ”kernel void vexcl vector kernel(\n ulong n”; 6 unsigned pos = 0; 7 LHS::prm decl(src, pos); 8 RHS::prm decl(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size(0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS::make expr(src, pos); src << ” = ”; 14 RHS::make expr(src, pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src.str (); 18 }

Slide 38

Slide 38 text

Source generation 1 template 2 std :: string kernel source() { 3 std :: ostringstream src; 4 5 src << ”kernel void vexcl vector kernel(\n ulong n”; 6 unsigned pos = 0; 7 LHS::prm decl(src, pos); 8 RHS::prm decl(src, pos); 9 src << ”)\n{\n” 10 ” for( size t idx = get global id (0); idx < n; idx += get global size(0)) {\n” 11 ” ”; 12 pos = 0; 13 LHS::make expr(src, pos); src << ” = ”; 14 RHS::make expr(src, pos); src << ”;\n”; 15 src << ” }\n}\n”; 16 17 return src.str (); 18 }

Slide 39

Slide 39 text

Setting kernel arguments 1 void vector:: set args(cl :: Kernel &krn, unsigned &pos) { 2 krn.setArg(pos++, buffer); 3 } 4 5 template 6 void binary op::set args(cl::Kernel &krn, unsigned &pos) { 7 lhs . set args(krn, pos); 8 rhs. set args(krn, pos); 9 }

Slide 40

Slide 40 text

Caching compiled kernels Each kernel is uniquely identified by its expression type: Expression template captures terminal types and operations. 1 template 2 const vector& vector::operator=(const Expr &expr) { 3 static cl :: Kernel kernel = build kernel(device, kernel source()); 4 5 unsigned pos = 0; 6 7 kernel.setArg(pos++, size); // n 8 kernel.setArg(pos++, buffer); // res 9 expr.set args(kernel, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(kernel, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return ∗this; 14 }

Slide 41

Slide 41 text

Caching compiled kernels Each kernel is uniquely identified by its expression type: Expression template captures terminal types and operations. 1 template 2 const vector& vector::operator=(const Expr &expr) { 3 static cl :: Kernel kernel = build kernel(device, kernel source()); 4 5 unsigned pos = 0; 6 7 kernel.setArg(pos++, size); // n 8 kernel.setArg(pos++, buffer); // res 9 expr.set args(kernel, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(kernel, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return ∗this; 14 }

Slide 42

Slide 42 text

What you saw is not what you get The actual implementation is a bit more complicated: There are unary, binary, ternary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. . . . Boost.Proto is used to keep the code mess contained. But you got the general idea.

Slide 43

Slide 43 text

1 Expression templates 2 OpenCL code generation 3 Extending VexCL

Slide 44

Slide 44 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION( sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }

Slide 45

Slide 45 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION( sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }

Slide 46

Slide 46 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION( sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }

Slide 47

Slide 47 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION( sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }

Slide 48

Slide 48 text

General structure of OpenCL code VexCL code: 1 VEX FUNCTION( sqr, double(double, double), ”return prm1 ∗ prm1 + prm2 ∗ prm2;” ); 2 auto tmp = vex::make temp<1>( x ); 3 y = sqr( sin(tmp), cos(tmp) ); OpenCL code: 1 double func1(double prm1, double prm2) { 2 return prm1 ∗ prm1 + prm2 ∗ prm2; 3 } 4 kernel void vexcl vector kernel( 5 ulong n, 6 global double ∗ res, 7 global double ∗ prm1 8 ) 9 { 10 for(size t idx = get global id(0); idx < n; idx += get global size(0)) { 11 double temp1 = prm1[idx]; 12 res[idx] = func1( ( sin( temp1 ), cos( temp1 ) ) ); 13 } 14 }

Slide 49

Slide 49 text

Extending VexCL (adding terminal types) Each terminal should tell VexCL how to use it by specializing several struct templates: Does it need global preamble? What kernel parameters does it need? Does it need local preamble? What contribution does it make to expression string? How to set kernel parameters?

Slide 50

Slide 50 text

Example: element index 1 x = sin( 2 ∗ M PI ∗ vex::element index() / n );

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content