VexCL: Experiences in Developing a C++ Wrapper Library for OpenCL

Transcript

VexCL Experiences in Developing a C++ Wrapper Library for OpenCL

Denis Demidov Kazan Federal University / Institute of System Research, Russian Academy of Sciences 06.2016, Lausanne

VexCL — a vector expression template library for OpenCL/CUDA Fork

m e on G itH ub Created for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only Supported backends: OpenCL (Khronos C++ bindings, Boost.Compute) NVIDIA CUDA The source code is available under MIT license: https://github.com/ddemidov/vexcl

Motivation/Hello World OpenCL 1 #include <iostream> 2 #include <vector> 3

#include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get ﬁrst available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get ﬁrst available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Motivation/Hello World OpenCL 1 #include <iostream> 2 #include <vector> 3

OpenCL/CUDA code is generated from user expressions Expression: x =

2 * y - sin(z); when compiled with: g++ -DVEXCL_BACKEND_OPENCL -lOpenCL ... Generated kernel: 1 kernel void vexcl_vector_kernel 2 ( 3 ulong n, 4 global double * prm_1, // x 5 int prm_2, // 2 6 global double * prm_3, // y 7 global double * prm_4 // z 8 ) 9 { 10 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) 11 { 12 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 13 } 14 }

OpenCL/CUDA code is generated from user expressions Expression: x =

2 * y - sin(z); when compiled with: g++ -DVEXCL_BACKEND_CUDA -lcuda ... Generated kernel: 1 extern "C" __global__ void vexcl_vector_kernel ( 2 ulong n, 3 double * prm_1, // x 4 int prm_2, // 2 5 double * prm_3, // y 6 double * prm_4 // z 7 ) 8 { 9 for(ulong idx = blockDim.x*blockIdx.x + threadIdx.x, grid_size = blockDim.x*gridDim.x; 10 idx < n; idx += grid_size) 11 { 12 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 13 } 14 }

Monte Carlo π Compute approximate value of π: area of

circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex::Random<cl_double2> rnd; 2 vex::Reductor<size_t, vex::SUM> sum(ctx); 3 double pi = 4.0 * sum( length( rnd(vex::element_index(0, n), seed) ) < 1 ) / n;

Expression templates VexCL uses expression templates to generate compute kernels.

Todd Veldhuizen, Expression templates, C++ Report, 1995. Example: vector operations 1 template <class LHS, class OP, class RHS> 2 struct binary_op { 3 const LHS &lhs; 4 const RHS &rhs; 5 }; 6 7 template <class LHS, class RHS> 8 binary_op<LHS, plus, RHS> operator+(const LHS &a, const RHS &b) { 9 return binary_op<LHS, plus, RHS>{a, b}; 10 }

Vector operations 1 struct plus { static double apply(double a,

double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y

Vector operations 1 struct plus { static double apply(double a,

double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i]

Vector operations 1 struct plus { static double apply(double a,

double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i]

Vector operations 1 struct plus { static double apply(double a,

double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]

Vector operations 1 struct plus { static double apply(double a,

double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y #pragma omp parallel for for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]

Source code generation Expression: a = x + y -

z; Kernel: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3, 6 global double * prm_4 7 ) 8 { 9 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 10 prm_1[idx] = ( ( prm_2[idx] + prm_3[idx] ) - prm_4[idx] ); 11 } 12 } − z + x y

Declaring parameters Each terminal knows what parameters it needs: 1

/∗static∗/ void vector::declare_params(std::ostream &src, unsigned &pos) 2 { 3 src << ",\n global double * prm" << ++pos; 4 } An expression just asks its terminals to do the job: 5 template <class LHS, class OP, class RHS> 6 /∗static∗/ void binary_op<LHS, OP, RHS>::declare_params(std::ostream &src, unsigned &pos) 7 { 8 LHS::declare_params(src, pos); 9 RHS::declare_params(src, pos); 10 }

Building string representation for an expression 1 struct plus {

2 static std::string to_string(std::ostream &src) { src << " + "; } 3 }; 4 5 /∗static∗/ void vector::to_string(std::ostream &src, unsigned &pos) { 6 src << "prm" << ++pos << "[idx]"; 7 } 8 9 template <class LHS, class OP, class RHS> 10 /∗static∗/ void binary_op<LHS, OP, RHS>::to_string(std::ostream &src, unsigned &pos) { 11 src << "( "; 12 LHS::to_string(src, pos); 13 OP::to_string(src); 14 RHS::to_string(src, pos); 15 src << " )"; 16 }

Generating kernel source 1 template <class LHS, class RHS> 2

std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }

Generating/launching the kernel on demand 1 template <class Expr> 2

const vector& vector::operator=(const Expr &expr) { 3 static cl::Kernel krn = build_kernel(device, kernel_source<This, Expr>()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 this->set_args(krn, pos); // result 9 expr.set_args(krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return *this; 14 } The kernel is uniquely identiﬁed by the expression type. Local static variable may be used to cache the kernel. The kernel is generated and compiled once, applied many times.

Using Boost.Proto Problem: this does not scale very well There

are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-deﬁned functions. We need to be able to combine all of the above in an expression. Solution: Boost.Proto Framework for building Embedded Domain-Speciﬁc Languages in C++. Provides tools for constructing, type-checking, transforming and executing expression templates.

Proto grammar 1 struct vector_grammar 2 : proto::or_< 3 proto::and_<

4 proto::terminal<proto::_>, 5 proto::if_<is_vector_expr_terminal<proto::_value>() > 6 >, 7 proto::or_< 8 proto::unary_plus<vector_grammar>, 9 proto::negate<vector_grammar>, 10 ... 11 proto::plus <vector_grammar, vector_grammar>, 12 proto::minus<vector_grammar, vector_grammar>, 13 ... 14 >, 15 ... 16 > 17 {};

Proto contexts 1 template <class Term> struct partial_vector_expr { 2

static void get(std::ostream &os, unsigned &pos) { os << "prm_" << ++pos; } 3 }; 4 struct vector_expr_context { 5 std::ostream &os; unsigned pos = 0; 6 template <typename Expr, typename Tag = typename Expr::proto_tag> struct eval; 7 8 template <typename Expr> struct eval<Expr, boost::proto::tag::plus> { 9 typedef void result_type; 10 void operator()(const Expr &expr, vector_expr_context &ctx) const { 11 os << "( "; proto::eval(proto::left (expr), ctx); 12 os << " + "; proto::eval(proto::right(expr), ctx); os << " )"; 13 } 14 }; 15 ... 16 template <typename Expr> struct eval<Expr, boost::proto::tag::terminal> { 17 typedef void result_type; 18 void operator()(const Expr &expr, vector_expr_context &ctx) const { 19 partial_vector_expr<typename proto::result_of::value<Expr>::type>::get(os, ctx.pos); 20 } 21 }; 22 };

Proto contexts 1 template <class Term> struct partial_vector_expr { 2

static void get(std::ostream &os, unsigned &pos) { os << "prm_" << ++pos; } 3 }; 4 struct vector_expr_context { 5 std::ostream &os; unsigned pos = 0; 6 template <typename Expr, typename Tag = typename Expr::proto_tag> struct eval; 7 8 template <typename Expr> struct eval<Expr, boost::proto::tag::plus> { 9 typedef void result_type; 10 void operator()(const Expr &expr, vector_expr_context &ctx) const { 11 os << "( "; proto::eval(proto::left (expr), ctx); 12 os << " + "; proto::eval(proto::right(expr), ctx); os << " )"; 13 } 14 }; 15 ... 16 template <typename Expr> struct eval<Expr, boost::proto::tag::terminal> { 17 typedef void result_type; 18 void operator()(const Expr &expr, vector_expr_context &ctx) const { 19 partial_vector_expr<typename proto::result_of::value<Expr>::type>::get(os, ctx.pos); 20 } 21 }; 22 };

Adding new terminals Once the DSL engine is set up,

adding new terminals is easy: x = sin( 2 * M_PI * vex::element index() / n ); Let Proto know the terminal is allowed in vector expressions: 1 struct elem_index { 2 size_t offset; 3 size_t length; 4 }; 5 6 template <> struct is_vector_expr_terminal< elem_index > : std::true_type {}; 7 8 inline auto element_index(size_t offset = 0, size_t length = 0) { 9 return proto::as_expr<vector_domain>( elem_index{offset, length} ); 10 }

Parameter declaration: 1 template <> struct kernel_param_declaration<elem_index> { 2 static

std::string get( 3 const elem_index&, const cl::Device&, const std::string &prm_name) 4 { 5 return ",\n\t" + type_name<size_t>() + " " + prm_name; 6 } 7 }; Contribution to expression string: 1 template <> struct partial_vector_expr<elem_index> { 2 static std::string get( 3 const elem_index&, const cl::Device&, const std::string &prm_name) 4 { 5 return "(" + prm_name + " + idx)"; 6 } 7 };

Setting kernel arguments: 1 template <> struct kernel_arg_setter<elem_index> { 2

static void set( 3 const elem_index &term, cl::Kernel &krn, unsigned /∗ device ∗/, 4 size_t index_offset, unsigned &position) 5 { 6 krn.setArg(position++, term.offset + index_offset); 7 } 8 }; Done! We can use vex::element_index() in arbitrary vector expressions.

Supporting diﬀerent backends Supported backends: OpenCL (Khronos C++ bindings) OpenCL

(Boost.Compute) NVIDIA CUDA Diﬀerences between backends: Host-side API Compute kernel language JIT compilation support

Abstraction layer Any backend in VexCL consists of: filter.hpp Device

ﬁlters device vector.hpp Memory abstraction source.hpp Source generation compiler.hpp Compiling compute kernels kernel.hpp Compute kernel context.hpp Host-side objects

Source generation 1 class source_generator { 2 source_generator &kernel(const std::string

&name); 3 4 template <typename Return> 5 source_generator& function(const std::string &name); 6 7 template <typename Param> 8 source_generator& parameter(const std::string &name); 9 10 source_generator& open(const std::string &bracket); 11 source_generator& close(const std::string &bracket); 12 13 source_generator& new_line(); 14 source_generator& grid_stride_loop( 15 const std::string &idx = "idx", const std::string &bnd = "n"); 16 17 template <class T> 18 friend source_generator& operator<<(source_generator &src, const T &t); 19 };

1 source_generator src; 2 src.kernel("mul2") 3 .open("(") 4 .parameter< size_t

>("n") 5 .parameter< global_ptr<double> >("x") 6 .parameter< global_ptr<const double> >("y") 7 .close(")"); 8 src.open("{").grid_stride_loop("idx", "n").open("{"); 9 src.new_line() << "x[idx] = 2 * y[idx];"; 10 src.close("}").close("}"); 11 std::cout << src.str() << std::endl; OpenCL 1 kernel void mul2 ( 2 ulong n, 3 global double * x, 4 global const double * y 5 ) 6 { 7 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 8 x[idx] = 2 * y[idx]; 9 } 10 }

1 source_generator src; 2 src.kernel("mul2") 3 .open("(") 4 .parameter< size_t

>("n") 5 .parameter< global_ptr<double> >("x") 6 .parameter< global_ptr<const double> >("y") 7 .close(")"); 8 src.open("{").grid_stride_loop("idx", "n").open("{"); 9 src.new_line() << "x[idx] = 2 * y[idx];"; 10 src.close("}").close("}"); 11 std::cout << src.str() << std::endl; CUDA 1 extern "C" __global__ void mul2 ( 2 ulong n, 3 double * x, 4 const double * y, 5 ) 6 { 7 for(ulong idx=blockDim.x*blockIdx.x+threadIdx.x, grid_size=blockDim.x*gridDim.x; idx<n; idx+=grid_size) { 8 x[idx] = 2 * y[idx]; 9 } 10 }

Compiling compute kernels JIT compilation: OpenCL: builtin functionality CUDA: generated

sources are saved to disk and passed to nvcc to create PTX module. Oﬄine kernel caching: OpenCL: save/load binaries CUDA: free side eﬀect of the compilation process SHA1 hash of the source is used as unique id of the cached kernel.

Writing generic code OpenCL compute kernel language is a dialect

of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code?

Writing generic code OpenCL compute kernel language is a dialect

of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 template <class T> 2 std::string mul2_source() { 3 source_generator src; 4 src.kernel("mul2") 5 .open("(") 6 .parameter< size_t >("n") 7 .parameter< global_ptr<T> >("x") 8 .parameter< global_ptr<const T> >("y") 9 .close(")"); 10 src.open("{").grid_stride_loop("idx", "n").open("{"); 11 src.new_line() << "x[idx] = 2 * y[idx];"; 12 src.close("}").close("}"); 13 return src.str(); 14 }

Writing generic code OpenCL compute kernel language is a dialect

of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 struct mul2_t { 2 template <class Expr> 3 auto operator()(Expr &&expr) const { 4 typedef typename vex::detail::return_type<Expr>::type value_type; 5 VEX_FUNCTION(value_type, _mul2, (value_type, x), return 2 * x; ); 6 return _mul2(expr); 7 } 8 } const mul2; v = mul2(x + y);

Writing generic code OpenCL compute kernel language is a dialect

of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 struct mul2_t { 2 template <class Expr> 3 auto operator()(Expr &&expr) const { 4 typedef typename vex::detail::return_type<Expr>::type value_type; 5 VEX_FUNCTION(value_type, _mul2, (value_type, x), return 2 * x; ); 6 return _mul2(expr); 7 } 8 } const mul2; v = mul2(x + y); We don’t need compiler’s help when we can generate our own code!

Limitations of expressions templates Impossible to alter execution details for

an expression Which queue to use here? 1 vex::vector<double> x({q1}, n); 2 vex::vector<double> y({q2}, n); 3 x = 2 * y; One-by-one execution may be ineﬀective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y;

Limitations of expressions templates Impossible to alter execution details for

an expression Which queue to use here? 1 vex::vector<double> x({q1}, n); 2 vex::vector<double> y({q2}, n); 3 x = 2 * y; vex::enqueue({q2}, x) = 2 * y; One-by-one execution may be ineﬀective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y;

Limitations of expressions templates Impossible to alter execution details for

an expression Which queue to use here? 1 vex::vector<double> x({q1}, n); 2 vex::vector<double> y({q2}, n); 3 x = 2 * y; vex::enqueue({q2}, x) = 2 * y; One-by-one execution may be ineﬀective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y; vex::tie(u, v) = std::tie(x + y, x - y);

VexCL: Experiences in Developing a C++ Wrapper ...

VexCL: Experiences in Developing a C++ Wrapper Library for OpenCL

More Decks by Denis Demidov

Other Decks in Programming

Featured

Transcript