Slide 1

Slide 1 text

VexCL Experiences in Developing a C++ Wrapper Library for OpenCL Denis Demidov Kazan Federal University / Institute of System Research, Russian Academy of Sciences 06.2016, Lausanne

Slide 2

Slide 2 text

VexCL — a vector expression template library for OpenCL/CUDA Fork m e on G itH ub Created for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only Supported backends: OpenCL (Khronos C++ bindings, Boost.Compute) NVIDIA CUDA The source code is available under MIT license: https://github.com/ddemidov/vexcl

Slide 3

Slide 3 text

Motivation/Hello World OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 4

Slide 4 text

Motivation/Hello World OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 5

Slide 5 text

Motivation/Hello World OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 6

Slide 6 text

Motivation/Hello World OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 7

Slide 7 text

Motivation/Hello World OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 8

Slide 8 text

OpenCL/CUDA code is generated from user expressions Expression: x = 2 * y - sin(z); when compiled with: g++ -DVEXCL_BACKEND_OPENCL -lOpenCL ... Generated kernel: 1 kernel void vexcl_vector_kernel 2 ( 3 ulong n, 4 global double * prm_1, // x 5 int prm_2, // 2 6 global double * prm_3, // y 7 global double * prm_4 // z 8 ) 9 { 10 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) 11 { 12 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 13 } 14 }

Slide 9

Slide 9 text

OpenCL/CUDA code is generated from user expressions Expression: x = 2 * y - sin(z); when compiled with: g++ -DVEXCL_BACKEND_CUDA -lcuda ... Generated kernel: 1 extern "C" __global__ void vexcl_vector_kernel ( 2 ulong n, 3 double * prm_1, // x 4 int prm_2, // 2 5 double * prm_3, // y 6 double * prm_4 // z 7 ) 8 { 9 for(ulong idx = blockDim.x*blockIdx.x + threadIdx.x, grid_size = blockDim.x*gridDim.x; 10 idx < n; idx += grid_size) 11 { 12 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 13 } 14 }

Slide 10

Slide 10 text

Monte Carlo π Compute approximate value of π: area of circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex::Random rnd; 2 vex::Reductor sum(ctx); 3 double pi = 4.0 * sum( length( rnd(vex::element_index(0, n), seed) ) < 1 ) / n;

Slide 11

Slide 11 text

Expression templates VexCL uses expression templates to generate compute kernels. Todd Veldhuizen, Expression templates, C++ Report, 1995. Example: vector operations 1 template 2 struct binary_op { 3 const LHS &lhs; 4 const RHS &rhs; 5 }; 6 7 template 8 binary_op operator+(const LHS &a, const RHS &b) { 9 return binary_op{a, b}; 10 }

Slide 12

Slide 12 text

Vector operations 1 struct plus { static double apply(double a, double b) { return a + b; } }; 2 3 template 4 double binary_op::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y

Slide 13

Slide 13 text

Vector operations 1 struct plus { static double apply(double a, double b) { return a + b; } }; 2 3 template 4 double binary_op::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i]

Slide 14

Slide 14 text

Vector operations 1 struct plus { static double apply(double a, double b) { return a + b; } }; 2 3 template 4 double binary_op::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i]

Slide 15

Slide 15 text

Vector operations 1 struct plus { static double apply(double a, double b) { return a + b; } }; 2 3 template 4 double binary_op::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]

Slide 16

Slide 16 text

Vector operations 1 struct plus { static double apply(double a, double b) { return a + b; } }; 2 3 template 4 double binary_op::operator[](size_t i) const { 5 return OP::apply(lhs[i], rhs[i]); 6 } 7 vector& vector::operator=(const Expr &expr) { 8 for(size_t i = 0; i < _n; ++i) _data[i] = expr[i]; 9 } a = x + y - z; − z + x y #pragma omp parallel for for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]

Slide 17

Slide 17 text

Source code generation Expression: a = x + y - z; Kernel: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3, 6 global double * prm_4 7 ) 8 { 9 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 10 prm_1[idx] = ( ( prm_2[idx] + prm_3[idx] ) - prm_4[idx] ); 11 } 12 } − z + x y

Slide 18

Slide 18 text

Declaring parameters Each terminal knows what parameters it needs: 1 /∗static∗/ void vector::declare_params(std::ostream &src, unsigned &pos) 2 { 3 src << ",\n global double * prm" << ++pos; 4 } An expression just asks its terminals to do the job: 5 template 6 /∗static∗/ void binary_op::declare_params(std::ostream &src, unsigned &pos) 7 { 8 LHS::declare_params(src, pos); 9 RHS::declare_params(src, pos); 10 }

Slide 19

Slide 19 text

Building string representation for an expression 1 struct plus { 2 static std::string to_string(std::ostream &src) { src << " + "; } 3 }; 4 5 /∗static∗/ void vector::to_string(std::ostream &src, unsigned &pos) { 6 src << "prm" << ++pos << "[idx]"; 7 } 8 9 template 10 /∗static∗/ void binary_op::to_string(std::ostream &src, unsigned &pos) { 11 src << "( "; 12 LHS::to_string(src, pos); 13 OP::to_string(src); 14 RHS::to_string(src, pos); 15 src << " )"; 16 }

Slide 20

Slide 20 text

Generating kernel source 1 template 2 std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }

Slide 21

Slide 21 text

Generating/launching the kernel on demand 1 template 2 const vector& vector::operator=(const Expr &expr) { 3 static cl::Kernel krn = build_kernel(device, kernel_source()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 this->set_args(krn, pos); // result 9 expr.set_args(krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return *this; 14 } The kernel is uniquely identified by the expression type. Local static variable may be used to cache the kernel. The kernel is generated and compiled once, applied many times.

Slide 22

Slide 22 text

Using Boost.Proto Problem: this does not scale very well There are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. We need to be able to combine all of the above in an expression. Solution: Boost.Proto Framework for building Embedded Domain-Specific Languages in C++. Provides tools for constructing, type-checking, transforming and executing expression templates.

Slide 23

Slide 23 text

Proto grammar 1 struct vector_grammar 2 : proto::or_< 3 proto::and_< 4 proto::terminal, 5 proto::if_() > 6 >, 7 proto::or_< 8 proto::unary_plus, 9 proto::negate, 10 ... 11 proto::plus , 12 proto::minus, 13 ... 14 >, 15 ... 16 > 17 {};

Slide 24

Slide 24 text

Proto contexts 1 template struct partial_vector_expr { 2 static void get(std::ostream &os, unsigned &pos) { os << "prm_" << ++pos; } 3 }; 4 struct vector_expr_context { 5 std::ostream &os; unsigned pos = 0; 6 template struct eval; 7 8 template struct eval { 9 typedef void result_type; 10 void operator()(const Expr &expr, vector_expr_context &ctx) const { 11 os << "( "; proto::eval(proto::left (expr), ctx); 12 os << " + "; proto::eval(proto::right(expr), ctx); os << " )"; 13 } 14 }; 15 ... 16 template struct eval { 17 typedef void result_type; 18 void operator()(const Expr &expr, vector_expr_context &ctx) const { 19 partial_vector_expr::type>::get(os, ctx.pos); 20 } 21 }; 22 };

Slide 25

Slide 25 text

Proto contexts 1 template struct partial_vector_expr { 2 static void get(std::ostream &os, unsigned &pos) { os << "prm_" << ++pos; } 3 }; 4 struct vector_expr_context { 5 std::ostream &os; unsigned pos = 0; 6 template struct eval; 7 8 template struct eval { 9 typedef void result_type; 10 void operator()(const Expr &expr, vector_expr_context &ctx) const { 11 os << "( "; proto::eval(proto::left (expr), ctx); 12 os << " + "; proto::eval(proto::right(expr), ctx); os << " )"; 13 } 14 }; 15 ... 16 template struct eval { 17 typedef void result_type; 18 void operator()(const Expr &expr, vector_expr_context &ctx) const { 19 partial_vector_expr::type>::get(os, ctx.pos); 20 } 21 }; 22 };

Slide 26

Slide 26 text

Adding new terminals Once the DSL engine is set up, adding new terminals is easy: x = sin( 2 * M_PI * vex::element index() / n ); Let Proto know the terminal is allowed in vector expressions: 1 struct elem_index { 2 size_t offset; 3 size_t length; 4 }; 5 6 template <> struct is_vector_expr_terminal< elem_index > : std::true_type {}; 7 8 inline auto element_index(size_t offset = 0, size_t length = 0) { 9 return proto::as_expr( elem_index{offset, length} ); 10 }

Slide 27

Slide 27 text

Parameter declaration: 1 template <> struct kernel_param_declaration { 2 static std::string get( 3 const elem_index&, const cl::Device&, const std::string &prm_name) 4 { 5 return ",\n\t" + type_name() + " " + prm_name; 6 } 7 }; Contribution to expression string: 1 template <> struct partial_vector_expr { 2 static std::string get( 3 const elem_index&, const cl::Device&, const std::string &prm_name) 4 { 5 return "(" + prm_name + " + idx)"; 6 } 7 };

Slide 28

Slide 28 text

Setting kernel arguments: 1 template <> struct kernel_arg_setter { 2 static void set( 3 const elem_index &term, cl::Kernel &krn, unsigned /∗ device ∗/, 4 size_t index_offset, unsigned &position) 5 { 6 krn.setArg(position++, term.offset + index_offset); 7 } 8 }; Done! We can use vex::element_index() in arbitrary vector expressions.

Slide 29

Slide 29 text

Supporting different backends Supported backends: OpenCL (Khronos C++ bindings) OpenCL (Boost.Compute) NVIDIA CUDA Differences between backends: Host-side API Compute kernel language JIT compilation support

Slide 30

Slide 30 text

Abstraction layer Any backend in VexCL consists of: filter.hpp Device filters device vector.hpp Memory abstraction source.hpp Source generation compiler.hpp Compiling compute kernels kernel.hpp Compute kernel context.hpp Host-side objects

Slide 31

Slide 31 text

Source generation 1 class source_generator { 2 source_generator &kernel(const std::string &name); 3 4 template 5 source_generator& function(const std::string &name); 6 7 template 8 source_generator& parameter(const std::string &name); 9 10 source_generator& open(const std::string &bracket); 11 source_generator& close(const std::string &bracket); 12 13 source_generator& new_line(); 14 source_generator& grid_stride_loop( 15 const std::string &idx = "idx", const std::string &bnd = "n"); 16 17 template 18 friend source_generator& operator<<(source_generator &src, const T &t); 19 };

Slide 32

Slide 32 text

1 source_generator src; 2 src.kernel("mul2") 3 .open("(") 4 .parameter< size_t >("n") 5 .parameter< global_ptr >("x") 6 .parameter< global_ptr >("y") 7 .close(")"); 8 src.open("{").grid_stride_loop("idx", "n").open("{"); 9 src.new_line() << "x[idx] = 2 * y[idx];"; 10 src.close("}").close("}"); 11 std::cout << src.str() << std::endl; OpenCL 1 kernel void mul2 ( 2 ulong n, 3 global double * x, 4 global const double * y 5 ) 6 { 7 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 8 x[idx] = 2 * y[idx]; 9 } 10 }

Slide 33

Slide 33 text

1 source_generator src; 2 src.kernel("mul2") 3 .open("(") 4 .parameter< size_t >("n") 5 .parameter< global_ptr >("x") 6 .parameter< global_ptr >("y") 7 .close(")"); 8 src.open("{").grid_stride_loop("idx", "n").open("{"); 9 src.new_line() << "x[idx] = 2 * y[idx];"; 10 src.close("}").close("}"); 11 std::cout << src.str() << std::endl; CUDA 1 extern "C" __global__ void mul2 ( 2 ulong n, 3 double * x, 4 const double * y, 5 ) 6 { 7 for(ulong idx=blockDim.x*blockIdx.x+threadIdx.x, grid_size=blockDim.x*gridDim.x; idx

Slide 34

Slide 34 text

Compiling compute kernels JIT compilation: OpenCL: builtin functionality CUDA: generated sources are saved to disk and passed to nvcc to create PTX module. Offline kernel caching: OpenCL: save/load binaries CUDA: free side effect of the compilation process SHA1 hash of the source is used as unique id of the cached kernel.

Slide 35

Slide 35 text

Writing generic code OpenCL compute kernel language is a dialect of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code?

Slide 36

Slide 36 text

Writing generic code OpenCL compute kernel language is a dialect of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 template 2 std::string mul2_source() { 3 source_generator src; 4 src.kernel("mul2") 5 .open("(") 6 .parameter< size_t >("n") 7 .parameter< global_ptr >("x") 8 .parameter< global_ptr >("y") 9 .close(")"); 10 src.open("{").grid_stride_loop("idx", "n").open("{"); 11 src.new_line() << "x[idx] = 2 * y[idx];"; 12 src.close("}").close("}"); 13 return src.str(); 14 }

Slide 37

Slide 37 text

Writing generic code OpenCL compute kernel language is a dialect of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 struct mul2_t { 2 template 3 auto operator()(Expr &&expr) const { 4 typedef typename vex::detail::return_type::type value_type; 5 VEX_FUNCTION(value_type, _mul2, (value_type, x), return 2 * x; ); 6 return _mul2(expr); 7 } 8 } const mul2; v = mul2(x + y);

Slide 38

Slide 38 text

Writing generic code OpenCL compute kernel language is a dialect of C99: No templates, no function overloads, etc. CUDA backend uses same subset of C++ by necessity. Can we still write generic code? 1 struct mul2_t { 2 template 3 auto operator()(Expr &&expr) const { 4 typedef typename vex::detail::return_type::type value_type; 5 VEX_FUNCTION(value_type, _mul2, (value_type, x), return 2 * x; ); 6 return _mul2(expr); 7 } 8 } const mul2; v = mul2(x + y); We don’t need compiler’s help when we can generate our own code!

Slide 39

Slide 39 text

Limitations of expressions templates Impossible to alter execution details for an expression Which queue to use here? 1 vex::vector x({q1}, n); 2 vex::vector y({q2}, n); 3 x = 2 * y; One-by-one execution may be ineffective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y;

Slide 40

Slide 40 text

Limitations of expressions templates Impossible to alter execution details for an expression Which queue to use here? 1 vex::vector x({q1}, n); 2 vex::vector y({q2}, n); 3 x = 2 * y; vex::enqueue({q2}, x) = 2 * y; One-by-one execution may be ineffective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y;

Slide 41

Slide 41 text

Limitations of expressions templates Impossible to alter execution details for an expression Which queue to use here? 1 vex::vector x({q1}, n); 2 vex::vector y({q2}, n); 3 x = 2 * y; vex::enqueue({q2}, x) = 2 * y; One-by-one execution may be ineffective: x and y below are read twice. How to fuse the expressions into a single kernel? 1 u = x + y; 2 v = x - y; vex::tie(u, v) = std::tie(x + y, x - y);

Slide 42

Slide 42 text

Questions?