Velox: Offloading work to accelerators

Oﬄoading Work to Accelerators What workloads can benefit? 1 Sergei
Lewis: [email protected] VeloxCon April 3-4, 2024, San Jose

Overview • Common CPU bottlenecks • SIMT introduction • SIMT
considerations • Design space exploration 2 NVIDIA B100 HBM3e Graviton 5 DDR5-5600 AMD Genoa DDR5-4800 AMD MI300X HBM3

Key Bottlenecks 4 Data movement (shufﬂe, exchange, scan) • Decode/encode
(parquet, orc, …) • Compress/decompress Partitioning/hashing (Join, Aggregation) Calculated columns (Project, Aggregation)

SIMT Accelerator Overview • Warps of multiple (32) threads ◦
Executing in lockstep per core • Warp schedulers and large register banks ◦ For rapid context switching • Shared memory, dedicated I/O • Many cores • High bandwidth memory • Access: device / host / uniﬁed memory 5 CORE ACCELERATOR

Challenges with SIMT Accelerators Data Movement and Communication • Within
warps • Between warps • Cross-block / cross-grid (e.g. atomics) Thread Scheduling • Control ﬂow divergence Host Control • Kernel scheduling • Data ﬂow between host and device 6

Data Compression - Design Space LZ77/LZSS algorithm family • Check
dictionary for next input bytes • Encode copy or literal instruction • Update dictionary Performance vs compression ratio • Snappy, LZ4, … • Deﬂate, zstd, … 7 ABCDABC ABCDABC (literal, 4, ABCD) (copy, 4, 3)

Data Compression - Parallelization Data dependent output - how to
parallelize? • Break input into blocks • Compress independently • Gather, Stream 8 SYNC

Data Compression - One Block / Core One block per
core - Beneﬁts • Trivial implementation • Decent performance • High resource consumption, low utilization 9

Data Compression - One Block / Thread One block per
thread, state machine, reconvergence • High complexity • Modest resource requirements • Modest synchronization requirements • Modest performance gains 10 Check for match Encode instruction SYNC Match? Done? SYNC

Data Compression - Cooperative SIMT • One block per core
+ cooperative operation ◦ Highest performance ◦ Heavy communication within warp 11

Data Compression - What is Possible? 12 https://developer.nvidia.com/nvcomp

• Accelerators are great for embarrassingly parallel workloads • Accelerators
can signiﬁcantly improve performance even for workloads that might not initially look amenable to parallelisation • Going forward, Wave will provide a common framework for collaboration • Keep an eye on GitHub for future developments Summary 13

Velox: Offloading work to accelerators

Velox: Offloading work to accelerators

Ali LeClerc

More Decks by Ali LeClerc

Other Decks in Technology

Featured

Transcript

Oﬄoading Work to Accelerators What workloads can benefit? 1 Sergei

Overview • Common CPU bottlenecks • SIMT introduction • SIMT

3

Key Bottlenecks 4 Data movement (shufﬂe, exchange, scan) • Decode/encode

SIMT Accelerator Overview • Warps of multiple (32) threads ◦

Challenges with SIMT Accelerators Data Movement and Communication • Within

Data Compression - Design Space LZ77/LZSS algorithm family • Check

Data Compression - Parallelization Data dependent output - how to

Data Compression - One Block / Core One block per

Data Compression - One Block / Thread One block per

Data Compression - Cooperative SIMT • One block per core

Data Compression - What is Possible? 12 https://developer.nvidia.com/nvcomp

• Accelerators are great for embarrassingly parallel workloads • Accelerators