Slide 1

Slide 1 text

VexCL GPGPU without the agonizing pain* Denis Demidov Institute of System Research, Russian Academy of Sciences 16.09.2015, CIMNE, Barcelona

Slide 2

Slide 2 text

Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA hardware More mature, many libraries Kernels are written in C++ OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99

Slide 3

Slide 3 text

Modern GPGPU frameworks CUDA Proprietary architecture by NVIDIA Requires NVIDIA hardware More mature, many libraries Kernels are written in C++ Kernels are compiled to PTX together with host program OpenCL Open standard Supports wide range of hardware Code is much more verbose Kernels are written in C99 Kernels are compiled at runtime, adding an initialization overhead The latter distinction is usually considered to be an OpenCL drawback. But it also allows us to generate more efficient kernels at runtime! VexCL takes care of this part.

Slide 4

Slide 4 text

VexCL — a vector expression template library for OpenCL/CUDA Fork m e on G itH ub Created for ease of C++ based GPGPU development: Convenient notation for vector expressions OpenCL/CUDA JIT code generation Easily combined with existing libraries/code Header-only Supported backends: OpenCL (Khronos C++ bindings, Boost.Compute) NVIDIA CUDA The source code is available under MIT license: https://github.com/ddemidov/vexcl

Slide 5

Slide 5 text

Motivating example

Slide 6

Slide 6 text

Hello OpenCL: vector sum Compute sum of two vectors in parallel: A, B, and C are large vectors. Compute C = A + B. Overview of (any) OpenCL solution: 1 Initialize OpenCL context 2 Allocate memory 3 Transfer input data 4 Run computations 5 Get the results

Slide 7

Slide 7 text

Hello OpenCL: vector sum 1. Query platforms 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms");

Slide 8

Slide 8 text

Hello OpenCL: vector sum 2. Get the first available GPU 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs");

Slide 9

Slide 9 text

Hello OpenCL: vector sum 3. Create command queue 35 cl::CommandQueue queue(context, device[0]); 4. Prepare and copy input data 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float));

Slide 10

Slide 10 text

Hello OpenCL: vector sum 5. Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )";

Slide 11

Slide 11 text

Hello OpenCL: vector sum 6. Compile the kernel 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add");

Slide 12

Slide 12 text

Hello OpenCL: vector sum 7. Set kernel arguments 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 8. Launch the kernel 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 9. Get result back to host 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here.

Slide 13

Slide 13 text

Hello VexCL: vector sum OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 }

Slide 14

Slide 14 text

Hello VexCL: vector sum OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 15

Slide 15 text

Hello VexCL: vector sum OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 16

Slide 16 text

Hello VexCL: vector sum OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 17

Slide 17 text

Hello VexCL: vector sum OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 18

Slide 18 text

Hello VexCL: vector sum OpenCL 1 #include 2 #include 3 #include 4 5 #define __CL_ENABLE_EXCEPTIONS 6 #include 7 8 int main() { 9 // Get list of OpenCL platforms. 10 std::vector platform; 11 cl::Platform::get(&platform); 12 13 if (platform.empty()) 14 throw std::runtime_error("No OpenCL platforms"); 15 16 // Get first available GPU. 17 cl::Context context; 18 std::vector device; 19 for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) { 20 std::vector pldev; 21 try { 22 p->getDevices(CL_DEVICE_TYPE_GPU, &pldev); 23 for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) { 24 if (!d->getInfo()) continue; 25 device.push_back(*d); 26 context = cl::Context(device); 27 } 28 } catch(...) { 29 device.clear(); 30 } 31 } 32 if (device.empty()) throw std::runtime_error("No GPUs"); 33 34 // Create command queue. 35 cl::CommandQueue queue(context, device[0]); 36 37 // Prepare input data, transfer it to the device. 38 const size_t N = 1 << 20; 39 std::vector a(N, 1), b(N, 2), c(N); 40 41 cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 42 a.size() * sizeof(float), a.data()); 43 44 cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 45 b.size() * sizeof(float), b.data()); 46 47 cl::Buffer C(context, CL_MEM_READ_WRITE, 48 c.size() * sizeof(float)); 49 50 // Create kernel source 51 const char source[] = R"( 52 kernel void add( 53 ulong n, 54 global const float *a, 55 global const float *b, 56 global float *c 57 ) 58 { 59 ulong i = get_global_id(0); 60 if (i < n) c[i] = a[i] + b[i]; 61 } 62 )"; 63 64 // Compile OpenCL program for the device. 65 cl::Program program(context, cl::Program::Sources( 66 1, std::make_pair(source, strlen(source)) 67 )); 68 try { 69 program.build(device); 70 } catch (const cl::Error&) { 71 std::cerr 72 << "OpenCL compilation error" << std::endl 73 << program.getBuildInfo(device[0]) 74 << std::endl; 75 throw std::runtime_error("OpenCL build error"); 76 } 77 cl::Kernel add(program, "add"); 78 79 // Set kernel parameters. 80 add.setArg(0, static_cast(N)); 81 add.setArg(1, A); 82 add.setArg(2, B); 83 add.setArg(3, C); 84 85 // Launch kernel on the compute device. 86 queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange); 87 88 // Get result back to host. 89 queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(float), c.data()); 90 std::cout << c[42] << std::endl; // Should get '3' here. 91 } VexCL 1 #include 2 #include 3 #include 4 5 int main() { 6 // Get first available GPU 7 vex::Context ctx(vex::Filter::GPU && vex::Filter::Count(1)); 8 if (!ctx) throw std::runtime_error("No GPUs"); 9 10 // Prepare and copy input data 11 const size_t n = 1024 * 1024; 12 std::vector a(n, 1), b(n, 2), c(n); 13 vex::vector A(ctx, a), B(ctx, b), C(ctx, n); 14 15 // Launch compute kernel 16 C = A + B; 17 18 // Get results back to host 19 vex::copy(C, c); 20 std::cout << c[42] << std::endl; 21 }

Slide 19

Slide 19 text

What can it do?

Slide 20

Slide 20 text

VexCL interface

Slide 21

Slide 21 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx( vex::Filter::Any );

Slide 22

Slide 22 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx( vex::Filter::GPU );

Slide 23

Slide 23 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx(vex::Filter::Accelerator && vex::Filter::Platform("Intel"));

Slide 24

Slide 24 text

Initialization Multi-device and multi-platform computations are supported. VexCL context is initialized from combination of device filters. Device filter is a boolean functor acting on const vex::backend::device&. Initialize VexCL context on selected devices 1 vex::Context ctx( 2 vex::Filter::DoublePrecision && 3 [](const vex::backend::device &d) { 4 return d.getInfo() >= 16_GB; 5 });

Slide 25

Slide 25 text

Memory and work splitting 1 vex::Context ctx( vex::Filter::Name("Tesla") ); 2 3 vex::vector x(ctx, N); 4 vex::vector y(ctx, N); 5 6 x = vex::element_index() * (1.0 / N); 7 y = sin(2 * x) + sqrt(1 - x * x); x y

Slide 26

Slide 26 text

Memory and work splitting 1 vex::Context ctx( vex::Filter::Type(CL_DEVICE_TYPE_GPU) ); 2 3 vex::vector x(ctx, N); 4 vex::vector y(ctx, N); 5 6 x = vex::element_index() * (1.0 / N); 7 y = sin(2 * x) + sqrt(1 - x * x); x y

Slide 27

Slide 27 text

Memory and work splitting 1 vex::Context ctx( vex::Filter::DoublePrecision ); 2 3 vex::vector x(ctx, N); 4 vex::vector y(ctx, N); 5 6 x = vex::element_index() * (1.0 / N); 7 y = sin(2 * x) + sqrt(1 - x * x); x y

Slide 28

Slide 28 text

Copies between host and device memory 1 vex::vector d(ctx, n); 2 std::vector h(n); 3 double a[100]; STL-like range copies 1 vex::copy(d.begin(), d.end(), h.begin()); 2 vex::copy(d.begin(), d.begin() + 100, a); Simple copies 1 vex::copy(d, h); 2 vex::copy(h, d); Map OpenCL buffer to host pointer 1 auto p = d.map(devnum); 2 std::sort(&p[0], &p[d.part_size(devnum)]); Access single element (slow) 1 double v = d[42]; 2 d[0] = 0;

Slide 29

Slide 29 text

What vector expressions are supported? All vectors in an expression have to be compatible: Have same size Located on same devices What may be used: Vectors, scalars, constants Arithmetic, logical operators Built-in functions User-defined functions Random number generators Slicing and permutations Reduce to a scalar (sum, min, max) Reduce across chosen dimensions Stencil operations Sparse matrix – vector products Fast Fourier Transform Sort, scan, reduce by key

Slide 30

Slide 30 text

Builtin operations and functions This expression: 1 x = 2 * y - sin(z); export VEXCL_SHOW_KERNELS=1 to see the generated code. . . . results in this kernel: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 int prm_2, 5 global double * prm_3, 6 global double * prm_4 7 ) 8 { 9 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 10 prm_1[idx] = ( ( prm_2 * prm_3[idx] ) - sin( prm_4[idx] ) ); 11 } 12 } − ∗ sin 2 y z

Slide 31

Slide 31 text

Element indices vex::element_index(size_t offset = 0, size_t size = 0) returns index of a vector element. The numbering starts with offset and is continuous across devices. The optional size may be used to explicitly set the expression size. Linear function: 1 vex::vector X(ctx, N); 2 double x0 = 0, dx = 1e-3; 3 X = x0 + dx * vex::element_index(); Single period of sine function: 1 X = sin(2 * M_PI / N * vex::element_index());

Slide 32

Slide 32 text

User-defined functions Defining a function: 1 VEX_FUNCTION( double, sqr, (double, x)(double, y), 2 return x * x + y * y; 3 ); Using the function: 1 Z = sqrt( sqr(X, Y) );

Slide 33

Slide 33 text

User functions are translated to OpenCL functions 1 Z = sqrt( sqr(X, Y) ); . . . gets translated to: 1 double sqr(double x, double y) { 2 return x * x + y * y; 3 } 4 5 kernel void vexcl_vector_kernel( 6 ulong n, 7 global double * prm_1, 8 global double * prm_2, 9 global double * prm_3 10 ) 11 { 12 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 13 prm_1[idx] = sqrt( sqr( prm_2[idx], prm_3[idx] ) ); 14 } 15 } sqrt sqr x y

Slide 34

Slide 34 text

Functions may be not only convenient, but also effective Same example without using a function: 1 Z = sqrt( X * X + Y * Y ); . . . gets translated to: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3, 6 global double * prm_4, 7 global double * prm_5 8 ) 9 { 10 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 11 prm_1[idx] = sqrt( ( ( prm_2[idx] * prm_3[idx] ) + ( prm_4[idx] * prm_5[idx] ) ) ); 12 } 13 } sqrt + ∗ ∗ x x y y

Slide 35

Slide 35 text

Tagged terminals Programmer may help VexCL to recognize same terminals by tagging them: Like this: 1 using vex::tag; 2 Z = sqrt(tag<1>(X) * tag<1>(X) + 3 tag<2>(Y) * tag<2>(Y)); or, equivalently: 1 auto x = tag<1>(X); 2 auto y = tag<2>(Y); 3 Z = sqrt(x * x + y * y); 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3 6 ) 7 { 8 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 9 prm_1[idx] = sqrt( ( ( prm_2[idx] * prm_2[idx] ) + ( prm_3[idx] * prm_3[idx] ) ) ); 10 } 11 }

Slide 36

Slide 36 text

Reusing intermediate results Some expressions may have several inclusions of the same subexpression: 1 Z = log(X) * (log(X) + Y); log(X) will be computed twice here. One could tag X and hope that the compiler is smart enough. . . ∗ + log log x x y

Slide 37

Slide 37 text

Temporaries But it is also possible to introduce a temporary variable explicitly: 1 auto tmp = vex::make_temp<1>( log(X) ); 2 Z = tmp * (tmp + Y); 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * prm_1, 4 global double * prm_2, 5 global double * prm_3 6 ) 7 { 8 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 9 double temp_1 = log( prm_2[idx] ); 10 prm_1[idx] = ( temp_1 * ( temp_1 + prm_3[idx] ) ); 11 } 12 }

Slide 38

Slide 38 text

Permutations Single device only vex::permutation(expr) takes arbitrary integral valued expression and returns a functor:

Slide 39

Slide 39 text

Permutations Single device only vex::permutation(expr) takes arbitrary integral valued expression and returns a functor: 1 auto reverse = vex::permutation(N - 1 - vex::element_index()); 2 y = reverse(x);

Slide 40

Slide 40 text

Permutations Single device only vex::permutation(expr) takes arbitrary integral valued expression and returns a functor: 1 auto reverse = vex::permutation(N - 1 - vex::element_index()); 2 y = reverse(x); Permutations are writable: 1 reverse(y) = x;

Slide 41

Slide 41 text

Slicing Single device only When working with dense multidimensional matrices, it is general practice to store those in continuous 1D arrays. An instance of vex::slicer class allows to access sub-blocks of such matrix. n-by-n matrix and a slicer: 1 vex::vector x(ctx, n * n); 2 vex::slicer<2> slice(vex::extents[n][n]); // Can be used with any vector of appropriate size

Slide 42

Slide 42 text

Slicing Single device only When working with dense multidimensional matrices, it is general practice to store those in continuous 1D arrays. An instance of vex::slicer class allows to access sub-blocks of such matrix. n-by-n matrix and a slicer: 1 vex::vector x(ctx, n * n); 2 vex::slicer<2> slice(vex::extents[n][n]); // Can be used with any vector of appropriate size Access row or column of the matrix: 3 using vex::_; 4 y = slice[42](x); // 42nd row 5 y = slice[_][42](x); // 42nd column 6 slice[_][10](x) = y; // Slices are writable

Slide 43

Slide 43 text

Slicing Single device only When working with dense multidimensional matrices, it is general practice to store those in continuous 1D arrays. An instance of vex::slicer class allows to access sub-blocks of such matrix. n-by-n matrix and a slicer: 1 vex::vector x(ctx, n * n); 2 vex::slicer<2> slice(vex::extents[n][n]); // Can be used with any vector of appropriate size Access row or column of the matrix: 3 using vex::_; 4 y = slice[42](x); // 42nd row 5 y = slice[_][42](x); // 42nd column 6 slice[_][10](x) = y; // Slices are writable Use ranges to select sub-blocks: 7 z = slice[vex::range(0, 2, n)][vex::range(10, 20)](x);

Slide 44

Slide 44 text

Random number generation VexCL provides1 counter-based random number generators from Random1232 suite. The generators are stateless; mixing functions are applied to element indices. Implemented families: threefry and philox. Both pass TestU01/BigCrush; up to 264 independent streams with a period of 2128. Performance: ≈ 1010 Samples/sec (Tesla K20c). vex::Random — uniform distribution. vex::RandomNormal — normal distribution. 1 vex::Random rnd; 2 vex::vector x(ctx, n); 3 4 x = rnd(vex::element_index(), std::rand()); 1Contributed by Pascal Germroth [email protected] 2D. E. Shaw Research, http://www.deshawresearch.com/resources random123.html

Slide 45

Slide 45 text

Reductions Class vex::Reductor allows to reduce arbitrary vector expression to a single value of type T. Supported reduction kinds: SUM, SUM_Kahan, MIN, MAX Inner product 1 vex::Reductor sum(ctx); 2 double s = sum(x * y); Number of elements in x between 0 and 1 1 vex::Reductor sum(ctx); 2 size_t n = sum( (x > 0) && (x < 1) ); Maximum distance from origin 1 vex::Reductor max(ctx); 2 double d = max( sqrt(x * x + y * y) );

Slide 46

Slide 46 text

Monte Carlo π Compute approximate value of π: area of circle area of square = πr2 (2r)2 = π 4 , π = 4 area of circle area of square ≈ 4 points in circle all points 0.0 0.5 1.0 0.5 1.0 1 vex::Random rnd; // Generates 2D points in [0, 1] × [0, 1] 2 vex::Reductor sum(ctx); 3 4 double pi = 4.0 * sum( length( rnd(vex::element_index(0, n), seed) ) < 1 ) / n;

Slide 47

Slide 47 text

Monte Carlo π: the generated kernel 1 #if defined(cl_khr_fp64) 2 #pragma OPENCL EXTENSION cl_khr_fp64: enable 3 #elif defined(cl_amd_fp64) 4 #pragma OPENCL EXTENSION cl_amd_fp64: enable 5 #endif 6 7 void philox_uint_4_10 8 ( 9 uint * ctr, 10 uint * key 11 ) 12 { 13 uint m[4]; 14 m[0] = mul_hi(0xD2511F53, ctr[0]); 15 m[1] = 0xD2511F53 * ctr[0]; 16 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 17 m[3] = 0xCD9E8D57 * ctr[2]; 18 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 19 ctr[1] = m[3]; 20 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 21 ctr[3] = m[1]; 22 key[0] += 0x9E3779B9; 23 key[1] += 0xBB67AE85; 24 m[0] = mul_hi(0xD2511F53, ctr[0]); 25 m[1] = 0xD2511F53 * ctr[0]; 26 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 27 m[3] = 0xCD9E8D57 * ctr[2]; 28 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 29 ctr[1] = m[3]; 30 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 31 ctr[3] = m[1]; 32 key[0] += 0x9E3779B9; 33 key[1] += 0xBB67AE85; 34 m[0] = mul_hi(0xD2511F53, ctr[0]); 35 m[1] = 0xD2511F53 * ctr[0]; 36 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 37 m[3] = 0xCD9E8D57 * ctr[2]; 38 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 39 ctr[1] = m[3]; 40 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 41 ctr[3] = m[1]; 42 key[0] += 0x9E3779B9; 43 key[1] += 0xBB67AE85; 44 m[0] = mul_hi(0xD2511F53, ctr[0]); 45 m[1] = 0xD2511F53 * ctr[0]; 46 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 47 m[3] = 0xCD9E8D57 * ctr[2]; 48 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 49 ctr[1] = m[3]; 50 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 51 ctr[3] = m[1]; 52 key[0] += 0x9E3779B9; 53 key[1] += 0xBB67AE85; 54 m[0] = mul_hi(0xD2511F53, ctr[0]); 55 m[1] = 0xD2511F53 * ctr[0]; 56 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 57 m[3] = 0xCD9E8D57 * ctr[2]; 58 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 59 ctr[1] = m[3]; 60 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 61 ctr[3] = m[1]; 62 key[0] += 0x9E3779B9; 63 key[1] += 0xBB67AE85; 64 m[0] = mul_hi(0xD2511F53, ctr[0]); 65 m[1] = 0xD2511F53 * ctr[0]; 66 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 67 m[3] = 0xCD9E8D57 * ctr[2]; 68 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 69 ctr[1] = m[3]; 70 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 71 ctr[3] = m[1]; 72 key[0] += 0x9E3779B9; 73 key[1] += 0xBB67AE85; 74 m[0] = mul_hi(0xD2511F53, ctr[0]); 75 m[1] = 0xD2511F53 * ctr[0]; 76 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 77 m[3] = 0xCD9E8D57 * ctr[2]; 78 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 79 ctr[1] = m[3]; 80 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 81 ctr[3] = m[1]; 82 key[0] += 0x9E3779B9; 83 key[1] += 0xBB67AE85; 84 m[0] = mul_hi(0xD2511F53, ctr[0]); 85 m[1] = 0xD2511F53 * ctr[0]; 86 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 87 m[3] = 0xCD9E8D57 * ctr[2]; 88 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 89 ctr[1] = m[3]; 90 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 91 ctr[3] = m[1]; 92 key[0] += 0x9E3779B9; 93 key[1] += 0xBB67AE85; 94 m[0] = mul_hi(0xD2511F53, ctr[0]); 95 m[1] = 0xD2511F53 * ctr[0]; 96 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 97 m[3] = 0xCD9E8D57 * ctr[2]; 98 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 99 ctr[1] = m[3]; 100 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 101 ctr[3] = m[1]; 102 key[0] += 0x9E3779B9; 103 key[1] += 0xBB67AE85; 104 m[0] = mul_hi(0xD2511F53, ctr[0]); 105 m[1] = 0xD2511F53 * ctr[0]; 106 m[2] = mul_hi(0xCD9E8D57, ctr[2]); 107 m[3] = 0xCD9E8D57 * ctr[2]; 108 ctr[0] = m[2] ˆ ctr[1] ˆ key[0]; 109 ctr[1] = m[3]; 110 ctr[2] = m[0] ˆ ctr[3] ˆ key[1]; 111 ctr[3] = m[1]; 112 } 113 double2 random_double2_philox 114 ( 115 ulong prm1, 116 ulong prm2 117 ) 118 { 119 union 120 { 121 uint ctr[4]; 122 ulong res_i[2]; 123 double res_f[2]; 124 double2 res; 125 } ctr; 126 uint key[2]; 127 ctr.ctr[0] = prm1; ctr.ctr[1] = prm2; 128 ctr.ctr[2] = prm1; ctr.ctr[3] = prm2; 129 key[0] = 0x12345678; 130 key[1] = 0x12345678; 131 philox_uint_4_10(ctr.ctr, key); 132 ctr.res_f[0] = ctr.res_i[0] / 18446744073709551615.0; 133 ctr.res_f[1] = ctr.res_i[1] / 18446744073709551615.0; 134 return ctr.res; 135 } 136 ulong SUM_ulong 137 ( 138 ulong prm1, 139 ulong prm2 140 ) 141 { 142 return prm1 + prm2; 143 } 144 kernel void vexcl_reductor_kernel 145 ( 146 ulong n, 147 ulong prm_1, 148 ulong prm_2, 149 double prm_3, 150 global ulong * g_odata, 151 local ulong * smem 152 ) 153 { 154 ulong mySum = (ulong)0; 155 for(ulong idx = get_global_id(0); idx < n; idx += get_global_size(0)) 156 { 157 mySum = SUM_ulong(mySum, ( length( random_double2_philox( (prm_1 + idx), prm_2 ) ) < prm_3 )); 158 } 159 local ulong * sdata = smem; 160 size_t tid = get_local_id(0); 161 size_t block_size = get_local_size(0); 162 sdata[tid] = mySum; 163 barrier(CLK_LOCAL_MEM_FENCE); 164 if (block_size >= 1024) 165 { 166 if (tid < 512) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 512]); } 167 barrier(CLK_LOCAL_MEM_FENCE); 168 } 169 if (block_size >= 512) 170 { 171 if (tid < 256) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 256]); } 172 barrier(CLK_LOCAL_MEM_FENCE); 173 } 174 if (block_size >= 256) 175 { 176 if (tid < 128) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 128]); } 177 barrier(CLK_LOCAL_MEM_FENCE); 178 } 179 if (block_size >= 128) 180 { 181 if (tid < 64) { sdata[tid] = mySum = SUM_ulong(mySum, sdata[tid + 64]); } 182 barrier(CLK_LOCAL_MEM_FENCE); 183 } 184 if (tid < 32) 185 { 186 volatile local ulong * smem = sdata; 187 if (block_size >= 64) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 32]); } 188 if (block_size >= 32) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 16]); } 189 if (block_size >= 16) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 8]); } 190 if (block_size >= 8) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 4]); } 191 if (block_size >= 4) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 2]); } 192 if (block_size >= 2) { smem[tid] = mySum = SUM_ulong(mySum, smem[tid + 1]); } 193 } 194 if (tid == 0) g_odata[get_group_id(0)] = sdata[0]; 195 }

Slide 48

Slide 48 text

Sparse matrix – vector products A dditive expressions Class vex::SpMat holds representation of a sparse matrix on compute devices. Constructor accepts matrix in common CRS format: row indices, columns and values of nonzero entries. Construct matrix 1 vex::SpMat A(ctx, n, n, row.data(), col.data(), val.data()); Compute residual value 2 // vex::vector u, f, r; 3 r = f - A * u; 4 double res = max( fabs(r) );

Slide 49

Slide 49 text

Inlining sparse matrix – vector products Single device only SpMV may only be used in additive expressions: Needs data exchange between compute devices. Impossible to implement with single kernel. This restriction may be lifted for single-device contexts: r = f - vex::make_inline(A * u); double res = max( fabs(r) ); // Do not store intermediate results: res = max( fabs( f - vex::make_inline(A * u) ) );

Slide 50

Slide 50 text

Raw pointer arithmetic Single device only raw_pointer(const vector&) function returns pointer to vector’s data inside vector expression. May be used to implement complex access patterns. 1D Laplace operator: 1 auto ptr = vex::raw_pointer(x); 2 auto idx = vex::element_index(); 3 4 auto left = if_else(idx > 0, i - 1, i); 5 auto right = if_else(idx + 1 < n, i + 1, i); 6 7 y = 2 * ptr[idx] - ptr[left] - ptr[right];

Slide 51

Slide 51 text

N-body problem with raw pointers yi = j=i e−|xi −xj | 1 VEX_FUNCTION(double, nbody, (size_t, n)(size_t, i)(double*, x), 2 double sum = 0, myval = x[i]; 3 for(size_t j = 0; j < n; ++j) 4 if (j != i) sum += exp(-fabs(x[j] - myval)); 5 return sum; 6 ); 7 8 y = nbody(x.size(), vex::element_index(), raw_pointer(x));

Slide 52

Slide 52 text

Fast Fourier Transform Single device only VexCL provides FFT implementation3: Arbitrary vector expressions as input Multidimensional transforms Arbitrary sizes 2-3 times slower than Nvidia’s CUFFT, about same speed as AMD’s clFFT Solve Poisson equation with FFT: 1 vex::FFT fft(ctx, n); 2 vex::FFT ifft(ctx, n, vex::inverse); 3 4 vex::vector rhs(ctx, n), u(ctx, n), K(ctx, n); 5 // ... initialize vectors ... 6 7 u = ifft( K * fft(rhs) ); 3Contributed by Pascal Germroth [email protected]

Slide 53

Slide 53 text

Summary VexCL allows to write compact and readable code. Its great for fast prototyping of scientific GPGPU applications. The performance is often on-par with manually written kernels. [1] D. Demidov, K. Ahnert, K. Rupp, and P. Gottschling. Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries. SIAM J. Sci. Comput., 35(5):C453 – C472, 2013. doi:10.1137/120903683 [2] K. Ahnert, D. Demidov, and M. Mulansky. Solving ordinary differential equations on GPUs. In Numerical Computations with GPUs (pp. 125-157). Springer, 2014. doi:10.1007/978-3-319-06548-9 7

Slide 54

Slide 54 text

How does it work?

Slide 55

Slide 55 text

Expression templates

Slide 56

Slide 56 text

Expression templates How to implement a DSL in C++ effectively? The idea is quite old: 1995: Todd Veldhuizen, Expression templates, C++ Report. 1996: Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. Today: Armadillo, Blaze, Boost.uBLAS, Eigen, MTL, etc. But how does it work?

Slide 57

Slide 57 text

Simple example: Vector arithmetic We want to be able to write: 1 x = y + z; and it has to be as effective as: 1 for(size_t i = 0; i < n; ++i) 2 x[i] = y[i] + z[i];

Slide 58

Slide 58 text

C++ allows us to overload operators! 1 vector operator+(const vector &a, const vector &b) { 2 vector tmp( a.size() ); 3 for(size_t i = 0; i < a.size(); ++i) 4 tmp[i] = a[i] + b[i]; 5 return tmp; 6 } Any problems?

Slide 59

Slide 59 text

C++ allows us to overload operators! 1 vector operator+(const vector &a, const vector &b) { 2 vector tmp( a.size() ); 3 for(size_t i = 0; i < a.size(); ++i) 4 tmp[i] = a[i] + b[i]; 5 return tmp; 6 } Any problems? Extra memory allocation Extra memory I/O 1 a = x + y + z; 2 temporary vectors 8 × n memory reads/writes 1 for(size_t i = 0; i < n; ++i) 2 a[i] = x[i] + y[i] + z[i]; no temporaries 4 × n memory reads/writes

Slide 60

Slide 60 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment.

Slide 61

Slide 61 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 };

Slide 62

Slide 62 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 }

Slide 63

Slide 63 text

Lazy evaluation v0.1 The idea: postpone the actual evaluation until assignment. 1 struct vsum { 2 const vector &lhs; 3 const vector &rhs; 4 }; 5 6 vsum operator+(const vector &a, const vector &b) { 7 return vsum{a, b}; 8 } 9 10 const vector& vector::operator=(const vsum &s) { 11 for(size_t i = 0; i < n; ++i) 12 data[i] = s.lhs[i] + s.rhs[i]; 13 return *this; 14 }

Slide 64

Slide 64 text

Problem: its not composable What will happen here? 1 a = x + y + z;

Slide 65

Slide 65 text

Problem: its not composable What will happen here? 1 a = x + y + z; lazy_v1.cpp:38:15: error: invalid operands to binary expression ( vsum and vector ) a = x + y + z; ˜˜˜˜˜ ˆ ˜ lazy_v1.cpp:12:12: note: candidate function not viable: no known conversion from vsum to const vector for 1st argument vsum operator+(const vector &a, const vector &b) { ˆ 1 error generated.

Slide 66

Slide 66 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[](size_t i) const { return lhs[i] + rhs[i]; } 7 };

Slide 67

Slide 67 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[](size_t i) const { return lhs[i] + rhs[i]; } 7 }; 8 9 template 10 vsum operator+(const LHS &a, const RHS &b) { 11 return vsum{a, b}; 12 }

Slide 68

Slide 68 text

Lazy evaluation v0.2 1 template 2 struct vsum { 3 const LHS &lhs; 4 const RHS &rhs; 5 6 double operator[](size_t i) const { return lhs[i] + rhs[i]; } 7 }; 8 9 template 10 vsum operator+(const LHS &a, const RHS &b) { 11 return vsum{a, b}; 12 } 13 14 template 15 const vector& vector::operator=(const Expr &expr) { 16 for(int i = 0; i < n; ++i) data[i] = expr[i]; 17 return *this; 18 }

Slide 69

Slide 69 text

Problem: its not general enough There are times in life when addition alone is not enough

Slide 70

Slide 70 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 };

Slide 71

Slide 71 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 }; 4 5 template 6 struct binary_op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[](size_t i) const { return OP::apply(lhs[i], rhs[i]); } 11 };

Slide 72

Slide 72 text

Lazy evaluation v0.3 1 struct plus { 2 static double apply(double a, double b) { return a + b; } 3 }; 4 5 template 6 struct binary_op { 7 const LHS &lhs; 8 const RHS &rhs; 9 10 double operator[](size_t i) const { return OP::apply(lhs[i], rhs[i]); } 11 }; 12 13 template 14 binary_op operator+(const LHS &a, const RHS &b) { 15 return binary_op{a, b}; 16 }

Slide 73

Slide 73 text

Expression templates are trees The expression in the RHS of: 1 a = x + y; ... is of type: binary_op< vector, plus, vector > + x y

Slide 74

Slide 74 text

Expression templates are trees The expression in the RHS of: 1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y

Slide 75

Slide 75 text

Expression templates are trees The expression in the RHS of: 1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i]

Slide 76

Slide 76 text

Expression templates are trees The expression in the RHS of: 1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i]

Slide 77

Slide 77 text

Expression templates are trees The expression in the RHS of: 1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]

Slide 78

Slide 78 text

Expression templates are trees The expression in the RHS of: 1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y #pragma omp parallel for for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i]

Slide 79

Slide 79 text

Expression templates are trees The expression in the RHS of: 1 a = x + y - z; ... is of type: binary_op< binary_op< vector, plus, vector > , minus , vector > − z + x y #pragma omp parallel for for(size_t i = 0; i < n; ++i) a[i] = [i] [i] [i] The C++ compiler walks the tree. binary_op::operator[] does the actual work.

Slide 80

Slide 80 text

So far, so good It is now possible to: 1 v = a * x + b * y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for(size_t i = 0; i < n; ++i) 2 v[i] = a[i] * x[i] + b[i] * y[i]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything.

Slide 81

Slide 81 text

So far, so good It is now possible to: 1 v = a * x + b * y; 2 3 double c = (x + y)[42]; ... and its as effective as: 1 for(size_t i = 0; i < n; ++i) 2 v[i] = a[i] * x[i] + b[i] * y[i]; 3 4 double c = x[42] + y[42]; No temporaries involved. Optimizing compiler is able to inline everything. But how is that related to OpenCL code generation?

Slide 82

Slide 82 text

OpenCL code generation

Slide 83

Slide 83 text

How does OpenCL work? 1 A compute kernel is compiled at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device.

Slide 84

Slide 84 text

How does OpenCL work? 1 A compute kernel is compiled at runtime from C99 source. 2 Kernel parameters are set through API calls. 3 Kernel is launched on a compute device. Source may be read from a file, or stored in a static string, or generated.

Slide 85

Slide 85 text

Generating kernel source from an expression We want this expression: 1 a = x + y - z; . . . to result in this kernel: 1 kernel void vexcl_vector_kernel( 2 ulong n, 3 global double * res, 4 global double * prm1, 5 global double * prm2, 6 global double * prm3 7 ) 8 { 9 for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) { 10 res[idx] = ( ( prm1[idx] + prm2[idx] ) - prm3[idx] ); 11 } 12 } − z + x y

Slide 86

Slide 86 text

Declaring parameters Each terminal knows what parameters it needs: 1 /∗static∗/ void vector::declare_params(std::ostream &src, unsigned &pos) { 2 src << ",\n global double * prm" << ++pos; 3 } An expression just asks its terminals to do the work: 4 template 5 /∗static∗/ void binary_op::declare_params( 6 std::ostream &src, unsigned &pos) 7 { 8 LHS::declare_params(src, pos); 9 RHS::declare_params(src, pos); 10 }

Slide 87

Slide 87 text

Building string representation for expression 1 struct plus { 2 static std::string to_string(std::ostream &src) { src << " + "; } 3 };

Slide 88

Slide 88 text

Building string representation for expression 1 struct plus { 2 static std::string to_string(std::ostream &src) { src << " + "; } 3 }; 4 5 /∗static∗/ void vector::to_string(std::ostream &src, unsigned &pos) { 6 src << "prm" << ++pos << "[idx]"; 7 }

Slide 89

Slide 89 text

Building string representation for expression 1 struct plus { 2 static std::string to_string(std::ostream &src) { src << " + "; } 3 }; 4 5 /∗static∗/ void vector::to_string(std::ostream &src, unsigned &pos) { 6 src << "prm" << ++pos << "[idx]"; 7 } 8 9 template 10 /∗static∗/ void binary_op::to_string(std::ostream &src, unsigned &pos) { 11 src << "( "; 12 LHS::to_string(src, pos); 13 OP::to_string(src); 14 RHS::to_string(src, pos); 15 src << " )"; 16 }

Slide 90

Slide 90 text

Generating kernel source 1 template 2 std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }

Slide 91

Slide 91 text

Generating kernel source 1 template 2 std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }

Slide 92

Slide 92 text

Generating kernel source 1 template 2 std::string kernel_source() { 3 std::ostringstream src; 4 5 src << "kernel void vexcl_vector_kernel(\n ulong n"; 6 unsigned pos = 0; 7 LHS::declare_params(src, pos); 8 RHS::declare_params(src, pos); 9 src << ")\n{\n" 10 " for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {\n" 11 " "; 12 pos = 0; 13 LHS::to_string(src, pos); src << " = "; 14 RHS::to_string(src, pos); src << ";\n"; 15 src << " }\n}\n"; 16 17 return src.str(); 18 }

Slide 93

Slide 93 text

Setting kernel arguments 1 void vector::set_args(cl::Kernel &krn, unsigned &pos) { 2 krn.setArg(pos++, buffer); 3 } 4 5 template 6 void binary_op::set_args(cl::Kernel &krn, unsigned &pos) { 7 lhs.set_args(krn, pos); 8 rhs.set_args(krn, pos); 9 } The methods are not static anymore!

Slide 94

Slide 94 text

Gluing it all together 1 template 2 const vector& vector::operator=(const Expr &expr) { 3 static cl::Kernel krn = build_kernel(device, kernel_source()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr.set_args(krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return *this; 14 } Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel. Kernel is generated and compiled once, applied many times.

Slide 95

Slide 95 text

Gluing it all together 1 template 2 const vector& vector::operator=(const Expr &expr) { 3 static cl::Kernel krn = build_kernel(device, kernel_source()); 4 5 unsigned pos = 0; 6 7 krn.setArg(pos++, size); // n 8 krn.setArg(pos++, buffer); // result 9 expr.set_args(krn, pos); // other parameters 10 11 queue.enqueueNDRangeKernel(krn, cl::NullRange, buffer.size(), cl::NullRange); 12 13 return *this; 14 } Each kernel is uniquely identified by its expression type. Hence, we may use local static variable to cache the kernel. Kernel is generated and compiled once, applied many times.

Slide 96

Slide 96 text

What you saw is not what you get The actual implementation is a bit more complicated: There are unary, binary, n-ary expressions. There are special terminals requiring either global or local preambles. There are builtin and user-defined functions. . . . Boost.Proto is used to keep the code mess contained.

Slide 97

Slide 97 text

Projects already using VexCL AMGCL — implementation of several variants of algebraic multigrid: https://github.com/ddemidov/amgcl Antioch — A New Templated Implementation Of Chemistry for Hydrodynamics: https://github.com/libantioch/antioch Boost.odeint — numerical solution of ordinary differential equations: https://github.com/boostorg/odeint All of these libraries use the same generic code for CPU and GPU paths.

Slide 98

Slide 98 text

AMGCL: an Algebraic MultiGrid implementation Solves large sparse linear systems Constructs an AMG hierachy from lego-like blocks Supports several backends (OpenMP, VexCL, ViennaCL, CUDA, HPX, . . . ) 1 // Create solver for the given matrix (once) 2 amgcl::make_solver< 3 amgcl::backend::builtin, // OpenMP−based builtin backend 4 amgcl::coarsening::smoothed_aggregation, 5 amgcl::relaxation::spai0, 6 amgcl::solver::bicgstab 7 > solve( boost::tie(n, ptr, col, val) ); 8 9 // Solve the system for the given RHS (multiple times) 10 solve(rhs, x); // std::vector rhs, x;

Slide 99

Slide 99 text

AMGCL: an Algebraic MultiGrid implementation Solves large sparse linear systems Constructs an AMG hierachy from lego-like blocks Supports several backends (OpenMP, VexCL, ViennaCL, CUDA, HPX, . . . ) 1 // Create solver for the given matrix (once) 2 amgcl::make_solver< 3 amgcl::backend::vexcl, // VexCL backend 4 amgcl::coarsening::smoothed_aggregation, 5 amgcl::relaxation::spai0, 6 amgcl::solver::bicgstab 7 > solve( boost::tie(n, ptr, col, val) ); 8 9 // Solve the system for the given RHS (multiple times) 10 solve(rhs, x); // vex::vector rhs, x;

Slide 100

Slide 100 text

Gracies! Source code: github.com/ddemidov/vexcl The slides: speakerdeck.com/ddemidov