Maurice Steinman. Micro, IEEE '12. On the Efficacy of an APU for Parallel Computing. Daga et al. SAAHPC '11. Trade-Offs The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. Spafford, Kyle L., et al. CF '12. Efficacy Patterns Characterizing the Impact of Memory Access Patterns on AMD Fusion. Lee et al. SC '12.
Cache Coherency VS Scalability - GPUs: Flat, simple, incoherent Scales well with number of processor tiles Relaxed consistency for groups of cores Hard to program for - CPUs: Multi-level, high-capacity, coherent Much less scalable - Key tradeoff Couple CPU and GPU caches to enforce coherence, while preserving scalability to a large number of cores
Latency VS Throughput Allocation of transistors and die space - CPUs: Optimizing latency in single-threaded programs Caches, instruction-level parallelism, branch predictors - GPUs: Simple, many floating point units Schedule thousands of threads onto these cores - Question: How much of the resources should we dedicate to serial VS parallel processing units?
Capacity VS Bandwidth Type of physical memory used - CPUs: Optimizing latency to high-capacity DDR3 memory - GPUs: Concerned with repeatedly streaming a fixed-size buffer Maximise bandwidth by using GDDR3 Wider memory bus, higher clock speed Lower capacity (same power budget) - Fused: GPU and CPU cores must use the same type of physical memory
Capacity VS Bandwidth Type of physical memory used - CPUs: Optimizing latency to high-capacity DDR3 memory - GPUs: Concerned with repeatedly streaming a fixed-size buffer Maximise bandwidth by using GDDR3 Wider memory bus, higher clock speed Lower capacity (same power budget) - Fused: GPU and CPU cores must use the same type of physical memory
- GPUs benefit HPC performance for data-parallel problems - Limited by the PCIe bandwidth - APUs replace the PCIe with a unified northbridge - Tradeoffs - Cache Coherency VS Scalability - Capacity VS Bandwidth - Power VS Performance