Upgrade to Pro — share decks privately, control downloads, hide ads and more …

VexCL: Experiences in Developing a C++ Wrapper Library for OpenCL

VexCL: Experiences in Developing a C++ Wrapper Library for OpenCL

VexCL is a C++ vector expression template library for fast GPGPU prototyping and development. It provides convenient notation for various vector operations, and generates OpenCL/CUDA kernels on the fly from C++ expressions. Among the challenges met during the library development were the need to support multiple backends (OpenCL/CUDA), the fact that the OpenCL compute kernel language is C99 and thus lacks some conveniences of C++, etc. The talk provides brief overview of the library implementation and the C++ techniques used to overcome the difficulties.

Denis Demidov

June 09, 2016
Tweet

More Decks by Denis Demidov

Other Decks in Programming

Transcript

  1. VexCL Experiences in Developing a C++ Wrapper Library for OpenCL

    Denis Demidov Kazan Federal University / Institute of System Research, Russian Academy of Sciences 06.2016, Lausanne
  2. VexCL — a vector expression template library for OpenCL/CUDA Fork

    m e on G itH ub Created for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only Supported backends: OpenCL (Khronos C++ bindings, Boost.Compute) NVIDIA CUDA The source code is available under MIT license: https://github.com/ddemidov/vexcl
  3. Motivation/Hello World OpenCL 1 #include <iostream> 2 #include <vector> 3

    #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  4. Motivation/Hello World OpenCL 1 #include <iostream> 2 #include <vector> 3

    #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  5. Motivation/Hello World OpenCL 1 #include <iostream> 2 #include <vector> 3

    #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  6. Motivation/Hello World OpenCL 1 #include <iostream> 2 #include <vector> 3

    #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  7. Motivation/Hello World OpenCL 1 #include <iostream> 2 #include <vector> 3

    #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  8. OpenCL/CUDA code is generated from user expressions Expression: x =

    2 * y - sin(z); when compiled with: g++ -DVEXCL_BACKEND_OPENCL -lOpenCL ... Generated kernel: 1 kernel void vexcl_vector_kernel 2 ( 3 ulong n, 4 global double * prm_1, // x 5 int prm_2, // 2 6 global double * prm_3, // y 7 global double * prm_4 // z 8 ) 9 { 10 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) 11 { 12 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 13 } 14 }
  9. OpenCL/CUDA code is generated from user expressions Expression: x =

    2 * y - sin(z); when compiled with: g++ -DVEXCL_BACKEND_CUDA -lcuda ... Generated kernel: 1 extern "C" __global__ void vexcl_vector_kernel ( 2 ulong n, 3 double * prm_1, // x 4 int prm_2, // 2 5 double * prm_3, // y 6 double * prm_4 // z 7 ) 8 { 9 for(ulong idx = blockDim.x*blockIdx.x + threadIdx.x, grid_size = blockDim.x*gridDim.x; 10 idx < n; idx += grid_size) 11 { 12 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 13 } 14 }
  10. Monte Carlo π Compute approximate value of π: area of

    circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex::Random<cl_double2> rnd; 2 vex::Reductor<size_t, vex::SUM> sum(ctx); 3 double pi = 4.0 * sum( length( rnd(vex::element_index(0, n), seed) ) < 1 ) / n;
  11. Expression templates VexCL uses expression templates to generate compute kernels.

    Todd Veldhuizen, Expression templates, C++ Report, 1995. Example: vector operations 1 template <class LHS, class OP, class RHS> 2 struct binary_op { 3 const LHS &lhs; 4 const RHS &rhs; 5 }; 6 7 template <class LHS, class RHS> 8 binary_op<LHS, plus, RHS> operator+(const LHS &a, const RHS &b) { 9 return binary_op<LHS, plus, RHS>{a, b}; 10 }
  12. Vector operations 1 struct plus { static double apply(double a,

    double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y
  13. Vector operations 1 struct plus { static double apply(double a,

    double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i]
  14. Vector operations 1 struct plus { static double apply(double a,

    double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i]
  15. Vector operations 1 struct plus { static double apply(double a,

    double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]
  16. Vector operations 1 struct plus { static double apply(double a,

    double b) { return a + b; } }; 2 3 template <class LHS, class OP, class RHS> 4 double binary_op<LHS, OP, RHS>::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y #pragma omp parallel for for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]
  17. Source code generation Expression: a = x + y -

    z; Kernel: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3, 6 global double * prm_4 7 ) 8 { 9 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 10 prm_1[idx] = ( ( prm_2[idx] + prm_3[idx] ) - prm_4[idx] ); 11 } 12 } − z + x y
  18. Declaring parameters Each terminal knows what parameters it needs: 1

    /∗static∗/ void vector::declare_params(std::ostream &src, unsigned &pos) 2 { 3 src << ",\n global double * prm" << ++pos; 4 } An expression just asks its terminals to do the job: 5 template <class LHS, class OP, class RHS> 6 /∗static∗/ void binary_op<LHS, OP, RHS>::declare_params(std::ostream &src, unsigned &pos) 7 { 8 LHS::declare_params(src, pos); 9 RHS::declare_params(src, pos); 10 }
  19. Building string representation for an expression 1 struct plus {

    2 static std::string to_string(std::ostream &src) { src << " + "; } 3 }; 4 5 /∗static∗/ void vector::to_string(std::ostream &src, unsigned &pos) { 6 src << "prm" << ++pos << "[idx]"; 7 } 8 9 template <class LHS, class OP, class RHS> 10 /∗static∗/ void binary_op<LHS, OP, RHS>::to_string(std::ostream &src, unsigned &pos) { 11 src << "( "; 12 LHS::to_string(src, pos); 13 OP::to_string(src); 14 RHS::to_string(src, pos); 15 src << " )"; 16 }
  20. Generating kernel source 1 template <class LHS, class RHS> 2

    std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }
  21. Generating/launching the kernel on demand 1 template <class Expr> 2

    const vector& vector::operator=(const Expr &expr) { 3 static cl::Kernel krn = build_kernel(device, kernel_source<This, Expr>()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 this->set_args(krn, pos); // result 9 expr.set_args(krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return *this; 14 } The kernel is uniquely identified by the expression type. Local static variable may be used to cache the kernel. The kernel is generated and compiled once, applied many times.
  22. Using Boost.Proto Problem: this does not scale very well There

    are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. We need to be able to combine all of the above in an expression. Solution: Boost.Proto Framework for building Embedded Domain-Specific Languages in C++. Provides tools for constructing, type-checking, transforming and executing expression templates.
  23. Proto grammar 1 struct vector_grammar 2 : proto::or_< 3 proto::and_<

    4 proto::terminal<proto::_>, 5 proto::if_<is_vector_expr_terminal<proto::_value>() > 6 >, 7 proto::or_< 8 proto::unary_plus<vector_grammar>, 9 proto::negate<vector_grammar>, 10 ... 11 proto::plus <vector_grammar, vector_grammar>, 12 proto::minus<vector_grammar, vector_grammar>, 13 ... 14 >, 15 ... 16 > 17 {};
  24. Proto contexts 1 template <class Term> struct partial_vector_expr { 2

    static void get(std::ostream &os, unsigned &pos) { os << "prm_" << ++pos; } 3 }; 4 struct vector_expr_context { 5 std::ostream &os; unsigned pos = 0; 6 template <typename Expr, typename Tag = typename Expr::proto_tag> struct eval; 7 8 template <typename Expr> struct eval<Expr, boost::proto::tag::plus> { 9 typedef void result_type; 10 void operator()(const Expr &expr, vector_expr_context &ctx) const { 11 os << "( "; proto::eval(proto::left (expr), ctx); 12 os << " + "; proto::eval(proto::right(expr), ctx); os << " )"; 13 } 14 }; 15 ... 16 template <typename Expr> struct eval<Expr, boost::proto::tag::terminal> { 17 typedef void result_type; 18 void operator()(const Expr &expr, vector_expr_context &ctx) const { 19 partial_vector_expr<typename proto::result_of::value<Expr>::type>::get(os, ctx.pos); 20 } 21 }; 22 };
  25. Proto contexts 1 template <class Term> struct partial_vector_expr { 2

    static void get(std::ostream &os, unsigned &pos) { os << "prm_" << ++pos; } 3 }; 4 struct vector_expr_context { 5 std::ostream &os; unsigned pos = 0; 6 template <typename Expr, typename Tag = typename Expr::proto_tag> struct eval; 7 8 template <typename Expr> struct eval<Expr, boost::proto::tag::plus> { 9 typedef void result_type; 10 void operator()(const Expr &expr, vector_expr_context &ctx) const { 11 os << "( "; proto::eval(proto::left (expr), ctx); 12 os << " + "; proto::eval(proto::right(expr), ctx); os << " )"; 13 } 14 }; 15 ... 16 template <typename Expr> struct eval<Expr, boost::proto::tag::terminal> { 17 typedef void result_type; 18 void operator()(const Expr &expr, vector_expr_context &ctx) const { 19 partial_vector_expr<typename proto::result_of::value<Expr>::type>::get(os, ctx.pos); 20 } 21 }; 22 };
  26. Adding new terminals Once the DSL engine is set up,

    adding new terminals is easy: x = sin( 2 * M_PI * vex::element index() / n ); Let Proto know the terminal is allowed in vector expressions: 1 struct elem_index { 2 size_t offset; 3 size_t length; 4 }; 5 6 template <> struct is_vector_expr_terminal< elem_index > : std::true_type {}; 7 8 inline auto element_index(size_t offset = 0, size_t length = 0) { 9 return proto::as_expr<vector_domain>( elem_index{offset, length} ); 10 }
  27. Parameter declaration: 1 template <> struct kernel_param_declaration<elem_index> { 2 static

    std::string get( 3 const elem_index&, const cl::Device&, const std::string &prm_name) 4 { 5 return ",\n\t" + type_name<size_t>() + " " + prm_name; 6 } 7 }; Contribution to expression string: 1 template <> struct partial_vector_expr<elem_index> { 2 static std::string get( 3 const elem_index&, const cl::Device&, const std::string &prm_name) 4 { 5 return "(" + prm_name + " + idx)"; 6 } 7 };
  28. Setting kernel arguments: 1 template <> struct kernel_arg_setter<elem_index> { 2

    static void set( 3 const elem_index &term, cl::Kernel &krn, unsigned /∗ device ∗/, 4 size_t index_offset, unsigned &position) 5 { 6 krn.setArg(position++, term.offset + index_offset); 7 } 8 }; Done! We can use vex::element_index() in arbitrary vector expressions.
  29. Supporting different backends Supported backends: OpenCL (Khronos C++ bindings) OpenCL

    (Boost.Compute) NVIDIA CUDA Differences between backends: Host-side API Compute kernel language JIT compilation support
  30. Abstraction layer Any backend in VexCL consists of: filter.hpp Device

    filters device vector.hpp Memory abstraction source.hpp Source generation compiler.hpp Compiling compute kernels kernel.hpp Compute kernel context.hpp Host-side objects
  31. Source generation 1 class source_generator { 2 source_generator &kernel(const std::string

    &name); 3 4 template <typename Return> 5 source_generator& function(const std::string &name); 6 7 template <typename Param> 8 source_generator& parameter(const std::string &name); 9 10 source_generator& open(const std::string &bracket); 11 source_generator& close(const std::string &bracket); 12 13 source_generator& new_line(); 14 source_generator& grid_stride_loop( 15 const std::string &idx = "idx", const std::string &bnd = "n"); 16 17 template <class T> 18 friend source_generator& operator<<(source_generator &src, const T &t); 19 };
  32. 1 source_generator src; 2 src.kernel("mul2") 3 .open("(") 4 .parameter< size_t

    >("n") 5 .parameter< global_ptr<double> >("x") 6 .parameter< global_ptr<const double> >("y") 7 .close(")"); 8 src.open("{").grid_stride_loop("idx", "n").open("{"); 9 src.new_line() << "x[idx] = 2 * y[idx];"; 10 src.close("}").close("}"); 11 std::cout << src.str() << std::endl; OpenCL 1 kernel void mul2 ( 2 ulong n, 3 global double * x, 4 global const double * y 5 ) 6 { 7 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 8 x[idx] = 2 * y[idx]; 9 } 10 }
  33. 1 source_generator src; 2 src.kernel("mul2") 3 .open("(") 4 .parameter< size_t

    >("n") 5 .parameter< global_ptr<double> >("x") 6 .parameter< global_ptr<const double> >("y") 7 .close(")"); 8 src.open("{").grid_stride_loop("idx", "n").open("{"); 9 src.new_line() << "x[idx] = 2 * y[idx];"; 10 src.close("}").close("}"); 11 std::cout << src.str() << std::endl; CUDA 1 extern "C" __global__ void mul2 ( 2 ulong n, 3 double * x, 4 const double * y, 5 ) 6 { 7 for(ulong idx=blockDim.x*blockIdx.x+threadIdx.x, grid_size=blockDim.x*gridDim.x; idx<n; idx+=grid_size) { 8 x[idx] = 2 * y[idx]; 9 } 10 }
  34. Compiling compute kernels JIT compilation: OpenCL: builtin functionality CUDA: generated

    sources are saved to disk and passed to nvcc to create PTX module. Offline kernel caching: OpenCL: save/load binaries CUDA: free side effect of the compilation process SHA1 hash of the source is used as unique id of the cached kernel.
  35. Writing generic code OpenCL compute kernel language is a dialect

    of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code?
  36. Writing generic code OpenCL compute kernel language is a dialect

    of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 template <class T> 2 std::string mul2_source() { 3 source_generator src; 4 src.kernel("mul2") 5 .open("(") 6 .parameter< size_t >("n") 7 .parameter< global_ptr<T> >("x") 8 .parameter< global_ptr<const T> >("y") 9 .close(")"); 10 src.open("{").grid_stride_loop("idx", "n").open("{"); 11 src.new_line() << "x[idx] = 2 * y[idx];"; 12 src.close("}").close("}"); 13 return src.str(); 14 }
  37. Writing generic code OpenCL compute kernel language is a dialect

    of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 struct mul2_t { 2 template <class Expr> 3 auto operator()(Expr &&expr) const { 4 typedef typename vex::detail::return_type<Expr>::type value_type; 5 VEX_FUNCTION(value_type, _mul2, (value_type, x), return 2 * x; ); 6 return _mul2(expr); 7 } 8 } const mul2; v = mul2(x + y);
  38. Writing generic code OpenCL compute kernel language is a dialect

    of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 struct mul2_t { 2 template <class Expr> 3 auto operator()(Expr &&expr) const { 4 typedef typename vex::detail::return_type<Expr>::type value_type; 5 VEX_FUNCTION(value_type, _mul2, (value_type, x), return 2 * x; ); 6 return _mul2(expr); 7 } 8 } const mul2; v = mul2(x + y); We don’t need compiler’s help when we can generate our own code!
  39. Limitations of expressions templates Impossible to alter execution details for

    an expression Which queue to use here? 1 vex::vector<double> x({q1}, n); 2 vex::vector<double> y({q2}, n); 3 x = 2 * y; One-by-one execution may be ineffective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y;
  40. Limitations of expressions templates Impossible to alter execution details for

    an expression Which queue to use here? 1 vex::vector<double> x({q1}, n); 2 vex::vector<double> y({q2}, n); 3 x = 2 * y; vex::enqueue({q2}, x) = 2 * y; One-by-one execution may be ineffective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y;
  41. Limitations of expressions templates Impossible to alter execution details for

    an expression Which queue to use here? 1 vex::vector<double> x({q1}, n); 2 vex::vector<double> y({q2}, n); 3 x = 2 * y; vex::enqueue({q2}, x) = 2 * y; One-by-one execution may be ineffective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y; vex::tie(u, v) = std::tie(x + y, x - y);