Upgrade to Pro — share decks privately, control downloads, hide ads and more …

VexCL: GPGPU without the agonizing pain

Denis Demidov
September 16, 2015

VexCL: GPGPU without the agonizing pain

VexCL is an opensource C++ vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++ and provides convenient and intuitive notation for linear algebra operations, vector arithmetic and various parallel primitives. This talk gives a brief overview of VexCL interface and discusses C++ techniques that VexCL uses to effectively generate OpenCL/CUDA compute kernels from user expressions.

CIMNE, Barcelona, 2015

Denis Demidov

September 16, 2015
Tweet

More Decks by Denis Demidov

Other Decks in Programming

Transcript

  1. VexCL GPGPU without the agonizing pain* Denis Demidov Institute of

    System Research, Russian Academy of Sciences 16.09.2015, CIMNE, Barcelona
  2. Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA

    hardware More mature, many libraries Kernels are written in C++ OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99
  3. Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA

    hardware More mature, many libraries Kernels are written in C++ Kernels are compiled to PTX together with host program OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99 Kernels are compiled at runtime, adding an initialization overhead The latter distinction is usually considered to be an OpenCL drawback. But it also allows us to generate more efficient kernels at runtime! VexCL takes care of this part.
  4. VexCL — a vector expression template library for OpenCL/CUDA Fork

    m e on G itH ub Created for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only Supported backends: OpenCL (Khronos C++ bindings, Boost.Compute) NVIDIA CUDA The source code is available under MIT license: https://github.com/ddemidov/vexcl
  5. Hello OpenCL: vector sum Compute sum of two vectors in

    parallel: A, B, and C are large vectors. Compute C = A + B. Overview of (any) OpenCL solution: 1 Initialize OpenCL context 2 Allocate memory 3 Transfer input data 4 Run computations 5 Get the results
  6. Hello OpenCL: vector sum 1. Query platforms 10 std::vector<cl::Platform> platform;

    11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms");
  7. Hello OpenCL: vector sum 2. Get the first available GPU

    17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs");
  8. Hello OpenCL: vector sum 3. Create command queue 35 cl::CommandQueue

    queue(context, device[0]); 4. Prepare and copy input data 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float));
  9. Hello OpenCL: vector sum 5. Create kernel source 51 const

    char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )";
  10. Hello OpenCL: vector sum 6. Compile the kernel 65 cl::Program

    program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add");
  11. Hello OpenCL: vector sum 7. Set kernel arguments 80 add.setArg(0,

    static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 8. Launch the kernel 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 9. Get result back to host 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here.
  12. Hello VexCL: vector sum OpenCL 1 #include <iostream> 2 #include

    <vector> 3 #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 }
  13. Hello VexCL: vector sum OpenCL 1 #include <iostream> 2 #include

    <vector> 3 #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  14. Hello VexCL: vector sum OpenCL 1 #include <iostream> 2 #include

    <vector> 3 #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  15. Hello VexCL: vector sum OpenCL 1 #include <iostream> 2 #include

    <vector> 3 #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  16. Hello VexCL: vector sum OpenCL 1 #include <iostream> 2 #include

    <vector> 3 #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  17. Hello VexCL: vector sum OpenCL 1 #include <iostream> 2 #include

    <vector> 3 #include <string> 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include <CL/cl.hpp> 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector<cl::Platform> platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector<cl::Device> device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector<cl::Device> pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector<float> a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast<cl_ulong>(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include <iostream> 2 #include <vector> 3 #include <vexcl/vexcl.hpp> 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector<float> a(n, 1), b(n, 2), c(n); 13 vex::vector<float> A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }
  18. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx( vex::Filter::Any );
  19. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx( vex::Filter::GPU );
  20. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx(vex::Filter::Accelerator && vex::Filter::Platform("Intel"));
  21. Initialization Multi-device and multi-platform computations are supported. VexCL context is

    initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx( 2 vex::Filter::DoublePrecision && 3 [](const vex::backend::device &d) { 4 return d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>() >= 16_GB; 5 });
  22. Memory and work splitting 1 vex::Context ctx( vex::Filter::Name("Tesla") ); 2

    3 vex::vector<double> x(ctx, N); 4 vex::vector<double> y(ctx, N); 5 6 x = vex::element_index() * (1.0 / N); 7 y = sin(2 * x) + sqrt(1 - x * x); x y
  23. Memory and work splitting 1 vex::Context ctx( vex::Filter::Type(CL_DEVICE_TYPE_GPU) ); 2

    3 vex::vector<double> x(ctx, N); 4 vex::vector<double> y(ctx, N); 5 6 x = vex::element_index() * (1.0 / N); 7 y = sin(2 * x) + sqrt(1 - x * x); x y
  24. Memory and work splitting 1 vex::Context ctx( vex::Filter::DoublePrecision ); 2

    3 vex::vector<double> x(ctx, N); 4 vex::vector<double> y(ctx, N); 5 6 x = vex::element_index() * (1.0 / N); 7 y = sin(2 * x) + sqrt(1 - x * x); x y
  25. Copies between host and device memory 1 vex::vector<double> d(ctx, n);

    2 std::vector<double> h(n); 3 double a[100]; STL-like range copies 1 vex::copy(d.begin(), d.end(), h.begin()); 2 vex::copy(d.begin(), d.begin() + 100, a); Simple copies 1 vex::copy(d, h); 2 vex::copy(h, d); Map OpenCL buffer to host pointer 1 auto p = d.map(devnum); 2 std::sort(&p[0], &p[d.part_size(devnum)]); Access single element (slow) 1 double v = d[42]; 2 d[0] = 0;
  26. What vector expressions are supported? All vectors in an expression

    have to be compatible: Have same size Located on same devices What may be used: Vectors, scalars, constants Arithmetic, logical operators Built-in functions User-defined functions Random number generators Slicing and permutations Reduce to a scalar (sum, min, max) Reduce across chosen dimensions Stencil operations Sparse matrix – vector products Fast Fourier Transform Sort, scan, reduce by key
  27. Builtin operations and functions This expression: 1 x = 2

    * y - sin(z); export VEXCL_SHOW_KERNELS=1 to see the generated code. . . . results in this kernel: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 int prm_2, 5 global double * prm_3, 6 global double * prm_4 7 ) 8 { 9 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 10 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 11 } 12 } − ∗ sin 2 y z
  28. Element indices vex::element_index(size_t offset = 0, size_t size = 0)

    returns index of a vector element. The numbering starts with offset and is continuous across devices. The optional size may be used to explicitly set the expression size. Linear function: 1 vex::vector<double> X(ctx, N); 2 double x0 = 0, dx = 1e-3; 3 X = x0 + dx * vex::element_index(); Single period of sine function: 1 X = sin(2 * M_PI / N * vex::element_index());
  29. User-defined functions Defining a function: 1 VEX_FUNCTION( double, sqr, (double,

    x)(double, y), 2 return x * x + y * y; 3 ); Using the function: 1 Z = sqrt( sqr(X, Y) );
  30. User functions are translated to OpenCL functions 1 Z =

    sqrt( sqr(X, Y) ); . . . gets translated to: 1 double sqr(double x, double y) { 2 return x * x + y * y; 3 } 4 5 kernel void vexcl_vector_kernel( 6 ulong n, 7 global double * prm_1, 8 global double * prm_2, 9 global double * prm_3 10 ) 11 { 12 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 13 prm_1[idx] = sqrt( sqr( prm_2[idx], prm_3[idx] ) ); 14 } 15 } sqrt sqr x y
  31. Functions may be not only convenient, but also effective Same

    example without using a function: 1 Z = sqrt( X * X + Y * Y ); . . . gets translated to: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3, 6 global double * prm_4, 7 global double * prm_5 8 ) 9 { 10 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 11 prm_1[idx] = sqrt( ( ( prm_2[idx] * prm_3[idx] ) + ( prm_4[idx] * prm_5[idx] ) ) ); 12 } 13 } sqrt + ∗ ∗ x x y y
  32. Tagged terminals Programmer may help VexCL to recognize same terminals

    by tagging them: Like this: 1 using vex::tag; 2 Z = sqrt(tag<1>(X) * tag<1>(X) + 3 tag<2>(Y) * tag<2>(Y)); or, equivalently: 1 auto x = tag<1>(X); 2 auto y = tag<2>(Y); 3 Z = sqrt(x * x + y * y); 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3 6 ) 7 { 8 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 9 prm_1[idx] = sqrt( ( ( prm_2[idx] * prm_2[idx] ) + ( prm_3[idx] * prm_3[idx] ) ) ); 10 } 11 }
  33. Reusing intermediate results Some expressions may have several inclusions of

    the same subexpression: 1 Z = log(X) * (log(X) + Y); log(X) will be computed twice here. One could tag X and hope that the compiler is smart enough. . . ∗ + log log x x y
  34. Temporaries But it is also possible to introduce a temporary

    variable explicitly: 1 auto tmp = vex::make_temp<1>( log(X) ); 2 Z = tmp * (tmp + Y); 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3 6 ) 7 { 8 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 9 double temp_1 = log( prm_2[idx] ); 10 prm_1[idx] = ( temp_1 * ( temp_1 + prm_3[idx] ) ); 11 } 12 }
  35. Permutations Single device only vex::permutation(expr) takes arbitrary integral valued expression

    and returns a functor: 1 auto reverse = vex::permutation(N - 1 - vex::element_index()); 2 y = reverse(x);
  36. Permutations Single device only vex::permutation(expr) takes arbitrary integral valued expression

    and returns a functor: 1 auto reverse = vex::permutation(N - 1 - vex::element_index()); 2 y = reverse(x); Permutations are writable: 1 reverse(y) = x;
  37. Slicing Single device only When working with dense multidimensional matrices,

    it is general practice to store those in continuous 1D arrays. An instance of vex::slicer<NDIM> class allows to access sub-blocks of such matrix. n-by-n matrix and a slicer: 1 vex::vector<double> x(ctx, n * n); 2 vex::slicer<2> slice(vex::extents[n][n]); // Can be used with any vector of appropriate size
  38. Slicing Single device only When working with dense multidimensional matrices,

    it is general practice to store those in continuous 1D arrays. An instance of vex::slicer<NDIM> class allows to access sub-blocks of such matrix. n-by-n matrix and a slicer: 1 vex::vector<double> x(ctx, n * n); 2 vex::slicer<2> slice(vex::extents[n][n]); // Can be used with any vector of appropriate size Access row or column of the matrix: 3 using vex::_; 4 y = slice[42](x); // 42nd row 5 y = slice[_][42](x); // 42nd column 6 slice[_][10](x) = y; // Slices are writable
  39. Slicing Single device only When working with dense multidimensional matrices,

    it is general practice to store those in continuous 1D arrays. An instance of vex::slicer<NDIM> class allows to access sub-blocks of such matrix. n-by-n matrix and a slicer: 1 vex::vector<double> x(ctx, n * n); 2 vex::slicer<2> slice(vex::extents[n][n]); // Can be used with any vector of appropriate size Access row or column of the matrix: 3 using vex::_; 4 y = slice[42](x); // 42nd row 5 y = slice[_][42](x); // 42nd column 6 slice[_][10](x) = y; // Slices are writable Use ranges to select sub-blocks: 7 z = slice[vex::range(0, 2, n)][vex::range(10, 20)](x);
  40. Random number generation VexCL provides1 counter-based random number generators from

    Random1232 suite. The generators are stateless; mixing functions are applied to element indices. Implemented families: threefry and philox. Both pass TestU01/BigCrush; up to 264 independent streams with a period of 2128. Performance: ≈ 1010 Samples/sec (Tesla K20c). vex::Random<T,G> — uniform distribution. vex::RandomNormal<T,G> — normal distribution. 1 vex::Random<double> rnd; 2 vex::vector<double> x(ctx, n); 3 4 x = rnd(vex::element_index(), std::rand()); 1Contributed by Pascal Germroth [email protected] 2D. E. Shaw Research, http://www.deshawresearch.com/resources random123.html
  41. Reductions Class vex::Reductor<T, kind> allows to reduce arbitrary vector expression

    to a single value of type T. Supported reduction kinds: SUM, SUM_Kahan, MIN, MAX Inner product 1 vex::Reductor<double, vex::SUM> sum(ctx); 2 double s = sum(x * y); Number of elements in x between 0 and 1 1 vex::Reductor<size_t, vex::SUM> sum(ctx); 2 size_t n = sum( (x > 0) && (x < 1) ); Maximum distance from origin 1 vex::Reductor<double, vex::MAX> max(ctx); 2 double d = max( sqrt(x * x + y * y) );
  42. Monte Carlo π Compute approximate value of π: area of

    circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex::Random<cl_double2> rnd; // Generates 2D points in [0, 1] × [0, 1] 2 vex::Reductor<size_t, vex::SUM> sum(ctx); 3 4 double pi = 4.0 * sum( length( rnd(vex::element_index(0, n), seed) ) < 1 ) / n;
  43. Monte Carlo π: the generated kernel 1 #if defined(cl_khr_fp64) 2

    #pragma OPENCL EXTENSION cl_khr_fp64: enable 3 #elif defined(cl_amd_fp64) 4 #pragma OPENCL EXTENSION cl_amd_fp64: enable 5 #endif 6 7 void philox_uint_4_10 8 ( 9 uint * ctr, 10 uint * key 11 ) 12 { 13 uint m[4]; 14 m[0] = mul_hi(0xD2511F53, ctr[0]); 15 m[1] = 0xD2511F53 * ctr[0]; 16 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 17 m[3] = 0xCD9E8D57 * ctr[2]; 18 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 19 ctr[1] = m[3]; 20 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 21 ctr[3] = m[1]; 22 key[0] += 0x9E3779B9; 23 key[1] += 0xBB67AE85; 24 m[0] = mul_hi(0xD2511F53, ctr[0]); 25 m[1] = 0xD2511F53 * ctr[0]; 26 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 27 m[3] = 0xCD9E8D57 * ctr[2]; 28 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 29 ctr[1] = m[3]; 30 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 31 ctr[3] = m[1]; 32 key[0] += 0x9E3779B9; 33 key[1] += 0xBB67AE85; 34 m[0] = mul_hi(0xD2511F53, ctr[0]); 35 m[1] = 0xD2511F53 * ctr[0]; 36 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 37 m[3] = 0xCD9E8D57 * ctr[2]; 38 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 39 ctr[1] = m[3]; 40 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 41 ctr[3] = m[1]; 42 key[0] += 0x9E3779B9; 43 key[1] += 0xBB67AE85; 44 m[0] = mul_hi(0xD2511F53, ctr[0]); 45 m[1] = 0xD2511F53 * ctr[0]; 46 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 47 m[3] = 0xCD9E8D57 * ctr[2]; 48 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 49 ctr[1] = m[3]; 50 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 51 ctr[3] = m[1]; 52 key[0] += 0x9E3779B9; 53 key[1] += 0xBB67AE85; 54 m[0] = mul_hi(0xD2511F53, ctr[0]); 55 m[1] = 0xD2511F53 * ctr[0]; 56 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 57 m[3] = 0xCD9E8D57 * ctr[2]; 58 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 59 ctr[1] = m[3]; 60 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 61 ctr[3] = m[1]; 62 key[0] += 0x9E3779B9; 63 key[1] += 0xBB67AE85; 64 m[0] = mul_hi(0xD2511F53, ctr[0]); 65 m[1] = 0xD2511F53 * ctr[0]; 66 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 67 m[3] = 0xCD9E8D57 * ctr[2]; 68 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 69 ctr[1] = m[3]; 70 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 71 ctr[3] = m[1]; 72 key[0] += 0x9E3779B9; 73 key[1] += 0xBB67AE85; 74 m[0] = mul_hi(0xD2511F53, ctr[0]); 75 m[1] = 0xD2511F53 * ctr[0]; 76 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 77 m[3] = 0xCD9E8D57 * ctr[2]; 78 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 79 ctr[1] = m[3]; 80 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 81 ctr[3] = m[1]; 82 key[0] += 0x9E3779B9; 83 key[1] += 0xBB67AE85; 84 m[0] = mul_hi(0xD2511F53, ctr[0]); 85 m[1] = 0xD2511F53 * ctr[0]; 86 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 87 m[3] = 0xCD9E8D57 * ctr[2]; 88 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 89 ctr[1] = m[3]; 90 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 91 ctr[3] = m[1]; 92 key[0] += 0x9E3779B9; 93 key[1] += 0xBB67AE85; 94 m[0] = mul_hi(0xD2511F53, ctr[0]); 95 m[1] = 0xD2511F53 * ctr[0]; 96 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 97 m[3] = 0xCD9E8D57 * ctr[2]; 98 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 99 ctr[1] = m[3]; 100 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 101 ctr[3] = m[1]; 102 key[0] += 0x9E3779B9; 103 key[1] += 0xBB67AE85; 104 m[0] = mul_hi(0xD2511F53, ctr[0]); 105 m[1] = 0xD2511F53 * ctr[0]; 106 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 107 m[3] = 0xCD9E8D57 * ctr[2]; 108 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 109 ctr[1] = m[3]; 110 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 111 ctr[3] = m[1]; 112 } 113 double2 random_double2_philox 114 ( 115 ulong prm1, 116 ulong prm2 117 ) 118 { 119 union 120 { 121 uint ctr[4]; 122 ulong res_i[2]; 123 double res_f[2]; 124 double2 res; 125 } ctr; 126 uint key[2]; 127 ctr.ctr[0] = prm1; ctr.ctr[1] = prm2; 128 ctr.ctr[2] = prm1; ctr.ctr[3] = prm2; 129 key[0] = 0x12345678; 130 key[1] = 0x12345678; 131 philox_uint_4_10(ctr.ctr, key); 132 ctr.res_f[0] = ctr.res_i[0] / 18446744073709551615.0; 133 ctr.res_f[1] = ctr.res_i[1] / 18446744073709551615.0; 134 return ctr.res; 135 } 136 ulong SUM_ulong 137 ( 138 ulong prm1, 139 ulong prm2 140 ) 141 { 142 return prm1 + prm2; 143 } 144 kernel void vexcl_reductor_kernel 145 ( 146 ulong n, 147 ulong prm_1, 148 ulong prm_2, 149 double prm_3, 150 global ulong * g_odata, 151 local ulong * smem 152 ) 153 { 154 ulong mySum = (ulong)0; 155 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) 156 { 157 mySum = SUM_ulong(mySum, ( length( random_double2_philox( (prm_1 + idx), prm_2 ) ) < prm_3 )); 158 } 159 local ulong * sdata = smem; 160 size_t tid = get_local_id(0); 161 size_t block_size = get_local_size(0); 162 sdata[tid] = mySum; 163 barrier(CLK_LOCAL_MEM_FENCE); 164 if (block_size >= 1024) 165 { 166 if (tid < 512) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 512]); } 167 barrier(CLK_LOCAL_MEM_FENCE); 168 } 169 if (block_size >= 512) 170 { 171 if (tid < 256) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 256]); } 172 barrier(CLK_LOCAL_MEM_FENCE); 173 } 174 if (block_size >= 256) 175 { 176 if (tid < 128) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 128]); } 177 barrier(CLK_LOCAL_MEM_FENCE); 178 } 179 if (block_size >= 128) 180 { 181 if (tid < 64) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 64]); } 182 barrier(CLK_LOCAL_MEM_FENCE); 183 } 184 if (tid < 32) 185 { 186 volatile local ulong * smem = sdata; 187 if (block_size >= 64) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 32]); } 188 if (block_size >= 32) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 16]); } 189 if (block_size >= 16) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 8]); } 190 if (block_size >= 8) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 4]); } 191 if (block_size >= 4) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 2]); } 192 if (block_size >= 2) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 1]); } 193 } 194 if (tid == 0) g_odata[get_group_id(0)] = sdata[0]; 195 }
  44. Sparse matrix – vector products A dditive expressions Class vex::SpMat<T>

    holds representation of a sparse matrix on compute devices. Constructor accepts matrix in common CRS format: row indices, columns and values of nonzero entries. Construct matrix 1 vex::SpMat<double> A(ctx, n, n, row.data(), col.data(), val.data()); Compute residual value 2 // vex::vector<double> u, f, r; 3 r = f - A * u; 4 double res = max( fabs(r) );
  45. Inlining sparse matrix – vector products Single device only SpMV

    may only be used in additive expressions: Needs data exchange between compute devices. Impossible to implement with single kernel. This restriction may be lifted for single-device contexts: r = f - vex::make_inline(A * u); double res = max( fabs(r) ); // Do not store intermediate results: res = max( fabs( f - vex::make_inline(A * u) ) );
  46. Raw pointer arithmetic Single device only raw_pointer(const vector<T>&) function returns

    pointer to vector’s data inside vector expression. May be used to implement complex access patterns. 1D Laplace operator: 1 auto ptr = vex::raw_pointer(x); 2 auto idx = vex::element_index(); 3 4 auto left = if_else(idx > 0, i - 1, i); 5 auto right = if_else(idx + 1 < n, i + 1, i); 6 7 y = 2 * ptr[idx] - ptr[left] - ptr[right];
  47. N-body problem with raw pointers yi = j=i e−|xi −xj

    | 1 VEX_FUNCTION(double, nbody, (size_t, n)(size_t, i)(double*, x), 2 double sum = 0, myval = x[i]; 3 for(size_t j = 0; j < n; ++j) 4 if (j != i) sum += exp(-fabs(x[j] - myval)); 5 return sum; 6 ); 7 8 y = nbody(x.size(), vex::element_index(), raw_pointer(x));
  48. Fast Fourier Transform Single device only VexCL provides FFT implementation3:

    Arbitrary vector expressions as input Multidimensional transforms Arbitrary sizes 2-3 times slower than Nvidia’s CUFFT, about same speed as AMD’s clFFT Solve Poisson equation with FFT: 1 vex::FFT<double, cl_double2> fft(ctx, n); 2 vex::FFT<cl_double2, double> ifft(ctx, n, vex::inverse); 3 4 vex::vector<double> rhs(ctx, n), u(ctx, n), K(ctx, n); 5 // ... initialize vectors ... 6 7 u = ifft( K * fft(rhs) ); 3Contributed by Pascal Germroth [email protected]
  49. Summary VexCL allows to write compact and readable code. Its

    great for fast prototyping of scientific GPGPU applications. The performance is often on-par with manually written kernels. [1] D. Demidov, K. Ahnert, K. Rupp, and P. Gottschling. Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries. SIAM J. Sci. Comput., 35(5):C453 – C472, 2013. doi:10.1137/120903683 [2] K. Ahnert, D. Demidov, and M. Mulansky. Solving ordinary differential equations on GPUs. In Numerical Computations with GPUs (pp. 125-157). Springer, 2014. doi:10.1007/978-3-319-06548-9 7
  50. Expression templates How to implement a DSL in C++ effectively?

    The idea is quite old: 1995: Todd Veldhuizen, Expression templates, C++ Report. 1996: Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. Today: Armadillo, Blaze, Boost.uBLAS, Eigen, MTL, etc. But how does it work?
  51. Simple example: Vector arithmetic We want to be able to

    write: 1 x = y + z; and it has to be as effective as: 1 for(size_t i = 0; i < n; ++i) 2 x[i] = y[i] + z[i];
  52. C++ allows us to overload operators! 1 vector operator+(const vector

    &a, const vector &b) { 2 vector tmp( a.size() ); 3 for(size_t i = 0; i < a.size(); ++i) 4 tmp[i] = a[i] + b[i]; 5 return tmp; 6 } Any problems?
  53. C++ allows us to overload operators! 1 vector operator+(const vector

    &a, const vector &b) { 2 vector tmp( a.size() ); 3 for(size_t i = 0; i < a.size(); ++i) 4 tmp[i] = a[i] + b[i]; 5 return tmp; 6 } Any problems? Extra memory allocation Extra memory I/O 1 a = x + y + z; 2 temporary vectors 8 × n memory reads/writes 1 for(size_t i = 0; i < n; ++i) 2 a[i] = x[i] + y[i] + z[i]; no temporaries 4 × n memory reads/writes
  54. Lazy evaluation v0.1 The idea: postpone the actual evaluation until

    assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 };
  55. Lazy evaluation v0.1 The idea: postpone the actual evaluation until

    assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 }
  56. Lazy evaluation v0.1 The idea: postpone the actual evaluation until

    assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 } 9 10 const vector& vector::operator=(const vsum &s) { 11 for(size_t i = 0; i < n; ++i) 12 data[i] = s.lhs[i] + s.rhs[i]; 13 return *this; 14 }
  57. Problem: its not composable What will happen here? 1 a

    = x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression ( vsum and vector ) a = x + y + z; ˜˜˜˜˜ ˆ ˜ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from vsum to const vector for 1st argument vsum operator+(const vector &a, const vector &b) { ˆ 1 error generated.
  58. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[](size_t i) const { return lhs[i] + rhs[i]; } 7 };
  59. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[](size_t i) const { return lhs[i] + rhs[i]; } 7 }; 8 9 template <class LHS, class RHS> 10 vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 11 return vsum<LHS, RHS>{a, b}; 12 }
  60. Lazy evaluation v0.2 1 template <class LHS, class RHS> 2

    struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[](size_t i) const { return lhs[i] + rhs[i]; } 7 }; 8 9 template <class LHS, class RHS> 10 vsum<LHS, RHS> operator+(const LHS &a, const RHS &b) { 11 return vsum<LHS, RHS>{a, b}; 12 } 13 14 template<class Expr> 15 const vector& vector::operator=(const Expr &expr) { 16 for(int i = 0; i < n; ++i) data[i] = expr[i]; 17 return *this; 18 }
  61. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 };
  62. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary_op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[](size_t i) const { return OP::apply(lhs[i], rhs[i]); } 11 };
  63. Lazy evaluation v0.3 1 struct plus { 2 static double

    apply(double a, double b) { return a + b; } 3 }; 4 5 template <class LHS, class OP, class RHS> 6 struct binary_op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[](size_t i) const { return OP::apply(lhs[i], rhs[i]); } 11 }; 12 13 template <class LHS, class RHS> 14 binary_op<LHS, plus, RHS> operator+(const LHS &a, const RHS &b) { 15 return binary_op<LHS, plus, RHS>{a, b}; 16 }
  64. Expression templates are trees The expression in the RHS of:

    1 a = x + y; ... is of type: binary_op< vector, plus, vector > + x y
  65. Expression templates are trees The expression in the RHS of:

    1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y
  66. Expression templates are trees The expression in the RHS of:

    1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i]
  67. Expression templates are trees The expression in the RHS of:

    1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i]
  68. Expression templates are trees The expression in the RHS of:

    1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]
  69. Expression templates are trees The expression in the RHS of:

    1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y #pragma omp parallel for for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]
  70. Expression templates are trees The expression in the RHS of:

    1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y #pragma omp parallel for for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i] The C++ compiler walks the tree. binary_op::operator[] does the actual work.
  71. So far, so good It is now possible to: 1

    v = a * x + b * y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for(size_t i = 0; i < n; ++i) 2 v[i] = a[i] * x[i] + b[i] * y[i]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.
  72. So far, so good It is now possible to: 1

    v = a * x + b * y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for(size_t i = 0; i < n; ++i) 2 v[i] = a[i] * x[i] + b[i] * y[i]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything. But how is that related to OpenCL code generation?
  73. How does OpenCL work? 1 A compute kernel is compiled

    at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.
  74. How does OpenCL work? 1 A compute kernel is compiled

    at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a file, or stored in a static string, or generated.
  75. Generating kernel source from an expression We want this expression:

    1 a = x + y - z; . . . to result in this kernel: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * res, 4 global double * prm1, 5 global double * prm2, 6 global double * prm3 7 ) 8 { 9 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 10 res[idx] = ( ( prm1[idx] + prm2[idx] ) - prm3[idx] ); 11 } 12 } − z + x y
  76. Declaring parameters Each terminal knows what parameters it needs: 1

    /∗static∗/ void vector::declare_params(std::ostream &src, unsigned &pos) { 2 src << ",\n global double * prm" << ++pos; 3 } An expression just asks its terminals to do the work: 4 template <class LHS, class OP, class RHS> 5 /∗static∗/ void binary_op<LHS, OP, RHS>::declare_params( 6 std::ostream &src, unsigned &pos) 7 { 8 LHS::declare_params(src, pos); 9 RHS::declare_params(src, pos); 10 }
  77. Building string representation for expression 1 struct plus { 2

    static std::string to_string(std::ostream &src) { src << " + "; } 3 };
  78. Building string representation for expression 1 struct plus { 2

    static std::string to_string(std::ostream &src) { src << " + "; } 3 }; 4 5 /∗static∗/ void vector::to_string(std::ostream &src, unsigned &pos) { 6 src << "prm" << ++pos << "[idx]"; 7 }
  79. Building string representation for expression 1 struct plus { 2

    static std::string to_string(std::ostream &src) { src << " + "; } 3 }; 4 5 /∗static∗/ void vector::to_string(std::ostream &src, unsigned &pos) { 6 src << "prm" << ++pos << "[idx]"; 7 } 8 9 template <class LHS, class OP, class RHS> 10 /∗static∗/ void binary_op<LHS, OP, RHS>::to_string(std::ostream &src, unsigned &pos) { 11 src << "( "; 12 LHS::to_string(src, pos); 13 OP::to_string(src); 14 RHS::to_string(src, pos); 15 src << " )"; 16 }
  80. Generating kernel source 1 template <class LHS, class RHS> 2

    std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }
  81. Generating kernel source 1 template <class LHS, class RHS> 2

    std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }
  82. Generating kernel source 1 template <class LHS, class RHS> 2

    std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }
  83. Setting kernel arguments 1 void vector::set_args(cl::Kernel &krn, unsigned &pos) {

    2 krn.setArg(pos++, buffer); 3 } 4 5 template <class LHS, class OP, class RHS> 6 void binary_op<LHS, OP, RHS>::set_args(cl::Kernel &krn, unsigned &pos) { 7 lhs.set_args(krn, pos); 8 rhs.set_args(krn, pos); 9 } The methods are not static anymore!
  84. Gluing it all together 1 template <class Expr> 2 const

    vector& vector::operator=(const Expr &expr) { 3 static cl::Kernel krn = build_kernel(device, kernel_source<This, Expr>()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr.set_args(krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return *this; 14 } Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel. Kernel is generated and compiled once, applied many times.
  85. Gluing it all together 1 template <class Expr> 2 const

    vector& vector::operator=(const Expr &expr) { 3 static cl::Kernel krn = build_kernel(device, kernel_source<This, Expr>()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr.set_args(krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return *this; 14 } Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel. Kernel is generated and compiled once, applied many times.
  86. What you saw is not what you get The actual

    implementation is a bit more complicated: There are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. . . . Boost.Proto is used to keep the code mess contained.
  87. Projects already using VexCL AMGCL — implementation of several variants

    of algebraic multigrid: https://github.com/ddemidov/amgcl Antioch — A New Templated Implementation Of Chemistry for Hydrodynamics: https://github.com/libantioch/antioch Boost.odeint — numerical solution of ordinary differential equations: https://github.com/boostorg/odeint All of these libraries use the same generic code for CPU and GPU paths.
  88. AMGCL: an Algebraic MultiGrid implementation Solves large sparse linear systems

    Constructs an AMG hierachy from lego-like blocks Supports several backends (OpenMP, VexCL, ViennaCL, CUDA, HPX, . . . ) 1 // Create solver for the given matrix (once) 2 amgcl::make_solver< 3 amgcl::backend::builtin<double>, // OpenMP−based builtin backend 4 amgcl::coarsening::smoothed_aggregation, 5 amgcl::relaxation::spai0, 6 amgcl::solver::bicgstab 7 > solve( boost::tie(n, ptr, col, val) ); 8 9 // Solve the system for the given RHS (multiple times) 10 solve(rhs, x); // std::vector<double> rhs, x;
  89. AMGCL: an Algebraic MultiGrid implementation Solves large sparse linear systems

    Constructs an AMG hierachy from lego-like blocks Supports several backends (OpenMP, VexCL, ViennaCL, CUDA, HPX, . . . ) 1 // Create solver for the given matrix (once) 2 amgcl::make_solver< 3 amgcl::backend::vexcl<double>, // VexCL backend 4 amgcl::coarsening::smoothed_aggregation, 5 amgcl::relaxation::spai0, 6 amgcl::solver::bicgstab 7 > solve( boost::tie(n, ptr, col, val) ); 8 9 // Solve the system for the given RHS (multiple times) 10 solve(rhs, x); // vex::vector<double> rhs, x;