$30 off During Our Annual Pro Sale. View Details »

An Analysis of Accelerator Coupling in Heteroge...

An Analysis of Accelerator Coupling in Heterogeneous Architectures

Existing research on accelerators has emphasized the performance and energy efficiency improvements they can provide, devoting little attention to practical issues such as accelerator invocation and interaction with other on-chip components (e.g. cores, caches). In this paper we present a quantitative study that considers these aspects by implementing seven high-throughput accelerators following three design models: tight coupling behind a CPU, loose out-of-core coupling with Direct Memory Access (DMA) to the LLC, and loose out-of-core coupling with DMA to DRAM. A salient conclusion of our study is that working sets of non-trivial size are best served by loosely-coupled accelerators that integrate private memory blocks tailored to their needs.

Full paper: http://www.cs.columbia.edu/~cota/pubs/cota_dac15.pdf

Emilio G. Cota

June 12, 2015
Tweet

More Decks by Emilio G. Cota

Other Decks in Research

Transcript

  1. An Analysis of Accelerator Coupling in Heterogeneous Architectures Columbia University

    Columbia University Emilio G. Cota Paolo Mantovani Giuseppe Di Guglielmo Luca P. Carloni DAC'15, San Francisco, CA, USA
  2. Post-Dennard scaling and fixed power budgets are driving designs toward

    specialization Accelerators have become become essential for high-efficiency systems, e.g. SoCs Generality vs. Efficiency CPUs CMPs Many-cores Accelerators (ASICs) Generality Energy Efficiency 1 10 100 1000 DSPs, GPGPUs Specialization
  3. Our Goal Our goal: to draw observations about performance, efficiency

    and programmability of accelerators with different couplings Two main options w.r.t. CPUs: • Tightly-Coupled (TCAs) • Loosely-Coupled (LCAs) A major trade-off in accelerator design, since it determines how memory is accessed Analysis of Accelerator Couplings
  4. Tightly-Coupled (TCAs) ✔ Nil invocation overhead (via ISA extensions) ✔

    No internal storage: direct access to L1 cache ✗ Limited portability: design heavily tied to CPU a.k.a. “coprocessor model”
  5. Loosely-Coupled (LCAs) ✔ Good design reuse: no CPU-specific knowledge ✗

    Fixed set-up costs due to driver invocation and DMA ✔ Freedom to tailor private memories (scratchpads), e.g. providing different banks, ports, and bit widths ✗ Scratchpads require large area expenses a.k.a. “SoC-like model”
  6. Loosely-Coupled (LCAs) ✔ Good design reuse: no CPU-specific knowledge ✗

    Fixed set-up costs due to driver invocation and DMA ✔ Freedom to tailor private memories (scratchpads), e.g. providing different banks, ports, and bit widths ✗ Scratchpads require large area expenses a.k.a. “SoC-like model” Two flavors: • LLC-DMA • DRAM-DMA
  7. Accelerator Design • Used High-Level Synthesis for productivity • Most

    effort is on the memory subsystem to exploit parallelism, i.e. a large number of operations per clock cycle – Most accelerator area is therefore memory
  8. Experimental Methodology • Full-system simulation running Linux • In-order embedded-like

    i386 cores • Detailed Level-1 and Level-2 cache models • Accurate DRAM simulation with DRAMSim2
  9. • Latencies from RTL are back-annotated into the simulator (for

    TCAs) and SystemC (LCAs) • LCAs: SystemC accelerator simulation run in parallel with the simulator, synchronizing every 100 cycles Heterogeneous System Simulation Input C code SystemC RTL Output Add special instructions High-Level Synthesis Tool TCA simulation LCA simulation Write OS driver, add driver invocations Experimental Methodology
  10. Speedup over Software • LLC-DMA LCA > DRAM-DMA LCA >

    TCA • Ratio of scratchpad vs. input size matters, e.g. FFT • DRAM bandwidth bottleneck on accelerators with communication >> computation, e.g. sort
  11. Performance & Energy • LLC-DMA LCA > DRAM-DMA LCA >

    TCA • Efficiency gap between LCAs due to difference in off-chip accesses • LLC pollution study results in paper/poster
  12. Concluding Observations • Why LCAs > TCAs: Tailored, many-ported scratchpads

    are key to performance – L1s cannot provide this parallelism (at most 2 ports!) • LCAs best positioned to deliver high throughput given non-trivial inputs amenable to computation in bursts – DRAM bandwidth can limit this potential • Programming LCAs is not conceptually complex – Operating Systems have simple, well-defined interfaces for this