Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Velox: Offloading work to accelerators

Velox: Offloading work to accelerators

Velox pipeline performance is limited by the available compute. The talk provides an overview of common bottlenecks observed in benchmarks such as TPCH, briefly describes the kinds of parallelism accelerators are capable of, and provides a brief overview of ways an accelerator may be applied to help, as well as some of the challenges in this space.

Sergei Lewis
Principal Member of Technical Staff at Rivos

Ali LeClerc

April 05, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. Overview • Common CPU bottlenecks • SIMT introduction • SIMT

    considerations • Design space exploration 2 NVIDIA B100 HBM3e Graviton 5 DDR5-5600 AMD Genoa DDR5-4800 AMD MI300X HBM3
  2. 3

  3. Key Bottlenecks 4 Data movement (shuffle, exchange, scan) • Decode/encode

    (parquet, orc, …) • Compress/decompress Partitioning/hashing (Join, Aggregation) Calculated columns (Project, Aggregation)
  4. SIMT Accelerator Overview • Warps of multiple (32) threads ◦

    Executing in lockstep per core • Warp schedulers and large register banks ◦ For rapid context switching • Shared memory, dedicated I/O • Many cores • High bandwidth memory • Access: device / host / unified memory 5 CORE ACCELERATOR
  5. Challenges with SIMT Accelerators Data Movement and Communication • Within

    warps • Between warps • Cross-block / cross-grid (e.g. atomics) Thread Scheduling • Control flow divergence Host Control • Kernel scheduling • Data flow between host and device 6
  6. Data Compression - Design Space LZ77/LZSS algorithm family • Check

    dictionary for next input bytes • Encode copy or literal instruction • Update dictionary Performance vs compression ratio • Snappy, LZ4, … • Deflate, zstd, … 7 ABCDABC ABCDABC (literal, 4, ABCD) (copy, 4, 3)
  7. Data Compression - Parallelization Data dependent output - how to

    parallelize? • Break input into blocks • Compress independently • Gather, Stream 8 SYNC
  8. Data Compression - One Block / Core One block per

    core - Benefits • Trivial implementation • Decent performance • High resource consumption, low utilization 9
  9. Data Compression - One Block / Thread One block per

    thread, state machine, reconvergence • High complexity • Modest resource requirements • Modest synchronization requirements • Modest performance gains 10 Check for match Encode instruction SYNC Match? Done? SYNC
  10. Data Compression - Cooperative SIMT • One block per core

    + cooperative operation ◦ Highest performance ◦ Heavy communication within warp 11
  11. • Accelerators are great for embarrassingly parallel workloads • Accelerators

    can significantly improve performance even for workloads that might not initially look amenable to parallelisation • Going forward, Wave will provide a common framework for collaboration • Keep an eye on GitHub for future developments Summary 13