Parallel Computing with Kepler and CUDA5 TN Chan1 Compucon New Zealand Abstract: This paper takes a system integration engineering perspective for industry technology transfer purposes. The context is high performance computing focussed on heterogeneous and CUDA, and it is more hardware architecture than software application based. Themes discussed include heterogeneous versus homogeneous approach, high level versus low level compilation, the latest advances of CUDA eco-system, and the differences between digital content and concept creation. Acknowledgement: Thanks to Manuel Ujaldon of the University of Malaga, Spain, for providing some illustrations of his presentations as Nvidia CUDA Fellow and to Michael Dinneen of the University of Auckland, New Zealand, for reviewing the paper. 1.0 Heterogeneous Parallel Computing All computing tasks demand high performance as time is precious. High performance computing is applicable to both parallel and serial computing, although it refers more to parallel processing of a single task than serial processing of a multitude of applications. Serial computing performance advances over time as a direct result of hardware technology developments, and parallel computing based on homogeneous architecture progresses along the same path and relies on software algorithms to synchronize hardware processes for a common task. Homogeneous parallel computing stayed as the only or primary approach until late 2010. Heterogeneous parallel computing changed the picture. It turned up due to the realisation that some processes can be arranged to take place in a massively parallel manner as rendering for graphical display has demonstrated. Applying graphical display technique to general purpose computing is just a natural step forward. Tianhe-1A Supercomputer in the National Supercomputing Centre in China rose to #1 position of the Top 500 supercomputer list in November 2010 with the help of Tesla general purpose GPU (graphics processing unit) cards supplied by Nvidia. Similarly Titan Supercomputer of Oak Ridge Laboratory in USA rose to #1 position in November 2012 with the help of Tesla Kepler GPU. On both occasions, the supercomputer used a smaller count of CPU (central processor units) than other lower ranking supercomputers. This observation pointed to the steep rise of heterogeneous computing with GPU as co- processors and the steep rise of Tesla GPU which has a different architecture to x86 CPU. Figure 1 shows the top 10 supercomputers in the world as of November 2012 published by Top500 Organisation and reported by Wikipedia www.top500.org. Implementation of heterogeneous parallel computing is extremely simple in concept. Application programmers identify which parts of the source code are looped and allocate them to the GPU for computation leaving the rest of the source code that is sequential for CPU processing.
Source: Manuel Ujaldon. However simple the concept could be, optimizing the performance of parallel applications is always a challenge and this will remain to be the situation for the foreseeable future. This paper attempts to substantiate the above views and to provide several observations that may give academic and research communities a realistic expectation of the heterogeneous parallel computing industry. 2.0 CUDA Hardware Architecture Heterogeneous parallel computing has redefined the roles of the CPU to be a serial computation processor and established the roles of the General Purpose GPU for parallel computation. It is important that we understand the hardware architecture of the GPGPU in order to be able to maximise its utilisation. In this paper, we base on the Tesla range of GPGPU from Nvidia for discussions. The hardware architecture of a GPGPU can be described in terms of processing units and memory hierarchy. The architecture design attempts to address 4 desires which could be in conflict. The desires are: o Maximum computation performance o Minimum power consumption o Lowest cost o Friendliest for programming CUDA is an Nvidia brand of heterogeneous parallel computing architecture. The hardware has gone through 3 generations that were named as G80, Fermi and Kepler respectively. The Tesla brand separates high performance computing hardware from the GeForce brand used for gaming and the Quadro brand used for professional graphics. Three different brands address three different user profiles but they share the same GPU architecture. CUDA Compute Capability (CCC) is an index to indicate the progression of hardware features, and it has progressed from version 1 in 2006 to version 3.5 in 2012 to correspond to the above mentioned 3 generations of hardware architecture evolutions. Figure 3 shows 6 steps of progression of CCC over the last 7 years. Source: Manuel Ujaldon. Memory has been the weak link of every computing system based on Von Neumann’s stored program scheme. Any improvement of memory performance will improve overall computing performance. Kepler GK110 GPU is the latest release of Tesla hardware by Nvidia as of February 2013, and we will find out what the improvement is. It has 3 levels of memory hierarchy and they are L1 cache, L2 cache and Global Memory. L1 is closest to individual GPU processors, whereas Global Memory is linked to the main memory of the CPU via PCI Express bus. L1 and L2 caches are made of high speed and expensive Static Random Access Memory (SRAM). Global Memory is made of Dynamic Random Access Memory (DRAM) and is specifically GDDR5 (Graphics Double Data Rate version 5). GDDR5 is derived from DDR3 but improved of its bandwidth and voltage among other issues. It is faster than DDR3 which is normally used for the
lot slower than L1 & L2 cache of the GPU. Figure 4 shows the memory hierarchy of Kepler GPU. Source: Manuel Ujaldon. A K20 card in the GK110 generation with PCI Express interface consists of 13 multiprocessors with each consists of 192 cores. It consumes 225W of electricity on full load and this TDP (Thermal Design Power) limit is the same as the last 2 generations of hardware. This situation implies that “performance per watt” has increased at the same rate as the maximum performance over the 3 generations of hardware technology progression. Applications compiled for Fermi will gain in performance without recompilation for Kepler. Applications will gain further in performance if the source code is revised to take advantage of new Kepler architectural features. A loop of the application source code constitutes one parcel, called a kernel, identified for parallel computing by the GPGPU. A kernel is assigned to a set of hardware multiprocessors via a grid, which consists of symmetric blocks of threads. Blocks are mapped to multiprocessors and threads are mapped to cores using a scheduling unit called warps. Each block of threads runs the same code with different data in synchronisation with each other implementing a SIMD (Single Instruction Multiple Data) approach. Figure 5 shows the mapping of kernel to hardware. Source: Manuel Ujaldon. 3.0 CUDA Software Programming Popular compilers such as C, C++ and Fortran support CUDA GPU by incorporating appropriate library calls. CUDA Libraries include CUFFT (Fast Fourier Transform), CUBLAS (Basic Linear Algebra Subsystem), CURAND (Random number generation), etc. The full list can be found in https://developer.nvidia.com/ Figure 6 is a simplified CUDA programming eco-system block diagram of libraries and compilers. Source: Nvidia CUDA C Programming Guide http://docs.nvidia.com/cuda/index.html. CUDA code is compiled with NVCC compiler. NVCC separates CPU code from GPU code called PTX for Parallel Thread eXecution. PTX is further compiled to map GPU code to the GPU hardware. Nvidia Tesla complies with OpenCL (Open Computing Language) cross-vendor standard which is maintained by Khronos Group and supported by Intel, AMD, and ARM among others. OpenCL is not expected to produce binary code that is as efficient as NVCC for Nvidia GPU due to the lack of CUDA libraries and PTX instructions that are present in NVCC.
Ujaldon. In a study of graph algorithms for high performance computation2, Dinneen compared the running times among the different implementations of the same algorithm via OpenCL and CUDA and noticed that there was no remarkable difference in computation time between them. He further pointed out that OpenCL has the additional advantage of being portable onto more devices (CPU and GPU). The OpenACC Programming Standard (http://www.openacc-standard.org/) attempts to transform OpenMP (Open Message Passing) high level directives to CUDA for existing or legacy application software. It is supported by PGI and is applicable to Fortran and C compilers and has issued V1 in 2012. Programmers simply add hints known as “directives” to the original source to identify which areas of code to accelerate and the compiler will take care of the rest. By exposing parallelism to the compiler, directives allow the compiler to do the detailed work of mapping the computation onto the accelerator. Similarly we would not expect the final code to be as efficient as native CUDA but the incentive here is to re-use legacy software on new CUDA GPU. Figure 8 compares the code for OpenML and OpenACC. Source: https://developer.nvidia.com/openacc. OpenMP and CUDA based parallel programs have their own merits and this paper does not suggest replacement of one by the other. The research communities have been vocal that the amount of human efforts for CUDA is expensive and so CUDA does not appeal to everyone. This is at least a view of Dinneen and he further made the point of hoping to use OpenMP for CUDA GPU (in lieu of CUDA or OpenCL)3. Whilst PTX is designed for CUDA hardware, a group in Georgia Institute of Technology has developed a framework called Ocelot to convert PTX code to run in 4 different non-CUDA hardware targets. Ocelot is a dynamic compilation environment for PTX code on heterogeneous systems, which allows an extensive analysis of the PTX code and its migration to other platforms. There is an attempt to port CUDA code to OpenCL. See http://multiscalelab.org/swan. The last version of Swan noted on the website in January 2013 is December 2010. Presumably there is not enough interest or incentive in code porting. Figure 9 shows an Ocelot block diagram from Georgia Institute of Technology website http://gpuocelot.gatech.edu/
of parallelism by the arrangement of multiprocessors and cores within each. Some parallelism is controlled by hardware but some are left to software to optimize. K20 adds a new parallel dimension called Hyper Q. K20 is capable of executing up to 32 kernels launched from different CPU processes simultaneously, which increments the percentage of temporal occupancy on the GPU. Previous generation CUDA hardware such as Fermi has one connection to the CPU only. The multi-connectors feature is hardware based. It improves the level of utilisation of CPU and GPU depending on individual scenarios and the key point is that it eliminates the CPU-GPU connection as a potential bottleneck. Figure 10 shows the CPU-GPU relationships with and without Hyper Q respectively. Source: Manuel Ujaldon. Another new architecture of K20 is Dynamic Parallelism. It is an ability to launch new grids from the GPU. Features: o Dynamically: Based on run-time data. o Simultaneously: From multiple threads at once. o Independently: Each thread can launch a different grid. This reduces the coordination with the CPU via the PCI Express bus and shifts coordination to within the GPU. Internal GPGPU memory transfers are faster than global memory transfer over PCI Express lanes by over 10 times. Figure 11 shows the CPU-GPU relationships without and with Dynamic Parallelism. Source: Manuel Ujaldon. CUDA SDK (software development kit) Version 5.0 supports the above mentioned new K20 features. Titan supercomputer at Oak Ridge National Laboratory published some early experience with K20X and CUDA5 at SC12 in November 2012 on 5 applications. Gain is defined as the processing time with Opteron CPU and K20X GPU over CPU without GPU. Gains ranged from 1.8 to 7.8. Figure 12 shows the performance gains for 5 applications. Source: http://www.gputechconf.com/gtcnew/on- demand-gtc.php 5.0 High Level Compilation Current programming efforts require a good understanding of the GPU memory hierarchy and GPU programming execution model in order to fully exploit the GPU capacity and capability for maximum application performance. The scope of challenge to programmers consists of the following areas. Optimisation is the best trade-off between utilisation of resources and limitations imposed by architecture. o GPU memory management o Kernel allocation o CPU and GPU coordination When hardware of new architecture is released, manual programming effort
potentials. Over time and eventually compilers would incorporate new architectural improvements as extensions to the compilers. This process appears to be irreplaceable as long as human brains outperform computers in thinking. There are many attempts to free software programmers of the complexity of hardware so that they can focus on algorithms for solving problems. If we look beyond parallel computing we will find similar attempts everywhere. The best such attempts are the TCP/IP stack and the OSI 7 layer model. Those attempts were visionary and the interfaces between layers are defined for compliance by product vendors. Commercial engineering design applications such as Mathworks Simulink and National Instruments Labview provide graphical user interface and tools to offload design engineers of some programming efforts. CUDA-CHILL by Ruby4 is an effort to automate the optimisation process of parallel programming in 2010. It uses a complete scripting language to describe composable compiler transformations that can be written, shared and reused by non- expert application and library developers. It was an effort to consolidate contributions from 66 previous research efforts of the academic communities. It compared performance against CUBLAS 2.0 library for a range of square matrix sizes up to 8192 elements and claimed to match or outperform CUBLAS. Figure 13 shows the idea in a simple block diagram and what the optimiser needs to achieve OpenCL is supposed to be a step for standardising developments for GPGPU computing. However, there is nothing to stop innovative vendors to develop vertically self serving technology eco- systems and CUDA is one example. Similarly, Microsoft Direct Compute is an attempt for standardisation but it requires hardware to comply with DirectX GPU criteria within a Microsoft eco-system. An exercise (by Wolfe quoted by Ruby section 2.3) took several days of work to get 7 lines of loop for a simple single precision matrix multiplication kernel to improve from 1.7 to 208 GFLOPS using GTX 280 in 2008. As Nicolescu summarised the situation in his review of Microsoft Research Accelerator v25, the more efforts we put into programming optimisation the more performance gains we will obtain. Less efforts such as through high level optimisation leads to less gain. 6.0 Digital Content Creation High performance computing is required for content creation as well as for concept creation. The main difference between the two terms is the need for visualisation in addition to computation in the process of design as in the situation of content creation. There are thousands of such commercial off-the-shelf (COTS) applications. These COTS applications have to be compiled for CUDA Maximus to optimise their performance6. NAMD is a good example of applications to explain the difference. The computation package takes Tesla cards to do computation with minimal display sophistications. If display of detailed simulations is needed, a Quadro card will be needed in a separate workstation linked to the computation server. Figure 14 shows the 2 molecular patterns of 10 seconds apart. Source: http://www.ks.uiuc.edu/Training/Tutorials/nam d/namd-tutorial-unix-html/index.html
or Adobe Premier Pro, having Tesla and Quadro in the same workstation under Nvidia Maximus scheme will achieve the best of both worlds. The parallel operation of Tesla and Quadro would reduce the total computing time. Figure 15 shows visualization and computing processes in series. 7.0 Summary Heterogeneous computing with GPU has given high performance computing a quantum jump over the last few years. It is not a proprietary concept and it will surely breed more varieties of approach in the near future. No doubt scientists and researchers would like to use the newest industry releases with the simplest programming efforts. This is an ideal situation and this situation will only be achievable for mainstream uses such as through instituting an abstract layer between hardware and software. Extra programming efforts will always be required for bleeding edge performance which moves ahead along time. 8.0 References 1. TN Chan graduated from the University of Hong Kong with a Science in Mechanical Engineering honours degree. He is a Chartered Electrical Engineer, member of the Institution of Engineering & Technology, and member of the Institution of Professional Engineers of New Zealand. He is the managing director and system architect of Compucon New Zealand since 1992 and Industry Supervisor of the University of Auckland since 2002. firstname.lastname@example.org 2. Michael J. Dinneen, Masoud Khosravani and Andrew Probert, University of Auckland, Using OpenCL for Implementing Simple Parallel Graph Algorithms, Proceedings of the 17th annual conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'11) , part of WORLDCOMP'11 , pages 268-273, Las Vegas, Nevada, July 18-21 2011 3. Michael J. Dinneen, Masoud Khosravani & Kuai Wei, University of Auckland, A Comparative Study of Parallel Algorithms for the Girth Problem, Proceedings of the Tenth Australasian Symposium on Parallel and Distributed Computing (AusPDC 2012), Melbourne, Australia, September 2012 4. Gabe Rudy, CUDA-CHILL: A Programming Language Interface for GPGPU Optimisations and Code Generations, University of Utah, Thesis for MSc, August 2010 5. Radu Nicolescu, Many Cores - Parallel Programming on GPU, University of Auckland, 13 March 2012 6. http://www.anandtech.com/show/5094/nvi dias-maximus-technology-quadro-tesla- launching-today END