The AMD Fusion APU

The AMD Fusion APU CS 280: High Performance Computing and
Architecture Emaad Ahmed Manzoor April 22, 2014

Llano AMD Fusion APU: Llano. Branover, Alexander, Denis Foley, and
Maurice Steinman. Micro, IEEE '12. On the Efficacy of an APU for Parallel Computing. Daga et al. SAAHPC '11. Trade-Offs The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. Spafford, Kyle L., et al. CF '12. Efficacy Patterns Characterizing the Impact of Memory Access Patterns on AMD Fusion. Lee et al. SC '12.

Emaad Ahmed Manzoor April 22, 2014 The AMD Fusion APU
Top500.org Performance Energy Efficiency

GPUs VS CPUS Emaad Ahmed Manzoor April 22, 2014 The
AMD Fusion APU

On the Efficacy of an APU for Parallel Computing. Daga et al. SAAHPC '11.

Top500.org GPUs: ~50% of theoretical peak CPUs: ~78% of theoretical peak

On the Efficacy of an APU for Parallel Computing. Daga et al. SAAHPC '11.

AMD Fusion APU

The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. Spafford, Kyle L., et al. CF '12.

Tradeoffs CPU, CPU + Discrete GPU, APU? - Cache coherency VS scalability - Latency VS throughput - Power VS performance

Cache Coherency VS Scalability - GPUs: Flat, simple, incoherent Scales well with number of processor tiles Relaxed consistency for groups of cores Hard to program for - CPUs: Multi-level, high-capacity, coherent Much less scalable - Key tradeoff Couple CPU and GPU caches to enforce coherence, while preserving scalability to a large number of cores

The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. Spafford, Kyle L., et al. CF '12.

Fusion Memory Hierarchy - CPU-like accesses: Traditional caches - GPU-like accesses: Radeon Memory Bus - Cache coherence: Fusion Compute Link

Latency VS Throughput Allocation of transistors and die space - CPUs: Optimizing latency in single-threaded programs Caches, instruction-level parallelism, branch predictors - GPUs: Simple, many floating point units Schedule thousands of threads onto these cores - Question: How much of the resources should we dedicate to serial VS parallel processing units?

GPU is the computational workhorse

Memory hierarchy performance

Capacity VS Bandwidth Type of physical memory used - CPUs: Optimizing latency to high-capacity DDR3 memory - GPUs: Concerned with repeatedly streaming a fixed-size buffer Maximise bandwidth by using GDDR3 Wider memory bus, higher clock speed Lower capacity (same power budget) - Fused: GPU and CPU cores must use the same type of physical memory

GDDR5: High bandwidth at lower clock speed, low capacity

HD5670: Only 30% higher clock speed

Bandwidth-limited benchmarks

Performance improvements commensurate or greater than the increase in power consumption

PCIe bound benchmarks

Performance increase for the Llano, though its slower shader clock and lower memory bandwidth

Power VS Performance - Fusion APU: Slower shader clock Tighter integration leading to fine-grained DVFS Lesser data movement across wires

Summary

- GPUs benefit HPC performance for data-parallel problems - Limited by the PCIe bandwidth - APUs replace the PCIe with a unified northbridge - Tradeoffs - Cache Coherency VS Scalability - Capacity VS Bandwidth - Power VS Performance

The AMD Fusion APU

The AMD Fusion APU

More Decks by Emaad Manzoor

Other Decks in Science

Featured

Transcript