Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Speeding up Intel Gaudi deep-learning accelerat...

Avatar for dafnamordechai dafnamordechai
August 17, 2025
0

Speeding up Intel Gaudi deep-learning accelerators using an MLIR-based compiler

Middle-end optimizations play a critical role in generating high-performance code for deep learning accelerators. In this talk, we will present an MLIR-based fusing compiler that generates optimized LLVM IR from high-level graph IR, which is then compiled by an LLVM backend for execution on tensor processing cores in Intel Gaudi deep learning (DL) accelerator. This compiler has been in use for the past three generations of Gaudi products and provides around 54% average performance improvements at a model-level. The talk will cover the lowering pipeline, how we leverage upstream MLIR dialects and some key optimizations and learnings for compiling deep learning workloads to Gaudi.

Authors: Dafna Mordechai, Omer Paparo Bivas, Jayaram Bobba, Sergei Grechanik, Tzachi Cohen, Dibyendu Das

Avatar for dafnamordechai

dafnamordechai

August 17, 2025
Tweet

Transcript

  1. Core C++ 2024 Speeding Up Intel Gaudi Deep-Learning Accelerators Using

    an MLIR-Based Compiler Dafna Mordechai, Omer Paparo Bivas
  2. Agenda - Deep Learning Compilers: Transforming a Large Computational Graph

    into an optimized Execution Plan - The TPC Fuser: A JIT compiler for deep learning kernels that delivers significant performance improvements - Case Study: Adjusting the TPC-Fuser to LLMs Recent Challenges
  3. Intel Gaudi 3 AI accelerator Spec and Block Diagram Feature/Product

    Intel® Gaudi® 3 Accelerator BF16 Matrix TFLOPs 1835 FP8 Matrix TFLOPs 1835 BF16 Vector TFLOPs 28.7 MME Units 8 TPC Units 64 HBM Capacity 128 GB HBM Bandwidth 3.67 TB/s On-die SRAM Capacity 96 MB On-die SRAM Bandwidth RD+WR (L2 Cache) 19.2 TB/s Networking 1200 GB/s bidirectional Host Interface PCIe Gen5 x16 Host Interface Peak BW 128 GB/s bidirectional Media Engine Rotator + 14 Decoders (HEVC, H.264, JPEG, VP9) MME x8PCIe Gen5 Media Engine MME MME MME MME MME MME MME 48MB SRAM 16 TPCs 16 TPCs 16 TPCs 16 TPCs 48MB SRAM 12x 200 GbE HBM PHY HBM PHY HBM PHY HBM PHY HBM PHY HBM PHY HBM PHY HBM PHY 12x 200 GbE Media Engine x8 PCIe Gen5
  4. MME - Matrix Multiplication Engine Configurable, not programmable Each MME

    is a large output stationary systolic array ▪256x256 MAC structure w/ FP32 accumulators ▪64k MACs/cycle for BF16 and FP8 Large systolic array reduces intra-chip data movement, increasing efficiency Internal pipeline to maximize compute throughput 256 x 256 MME 256 input elements 256 input elements
  5. Vector: 256B SIMD Scalar Register File / Vector local memory(VLM)

    Store Load HBM/L3$/L2$ AGU AGU Fully Programmable using the TPC-C C enhanced with TPC intrinsics VLIW with 4 separate pipeline slots: Vector, Scalar, Load & Store Integrated Address Generation Unit for HW- accelerated address generation Supports main 1/2/4-Byte datatypes: Floating Point and Integer Tensor Processor Core (TPC): 256B-wide SIMD Vector Processor
  6. Layered View of Intel® Gaudi® Software Suite DeepSpeed Integration LLM

    serving Integration Quantization Integration Quantization Toolkit PyTorch Integration Graph Compiler Collective Communication Library (CCL) TPC Fuser Custom User TPC Kernels Optimized TPC Kernel Library User-Mode Driver/Run-Time Environment Compute Driver Network Driver Proprietary Ecosystem Integration Plugin Intel Gaudi Software Suite
  7. Compilers 101 Compiler(Source Code) --> Machine Code Compiler == (

    ) [semantic preserving] Translations + Transformations LLVM - Low Level Virtual Machine Clang - F
  8. Deep Learning Compilation 101 The original Resnet-18 architecture. Up to

    152 layers were trained in the original publication (as "ResNet-152") - Wikipedia tanh_f32
  9. Graph Compilation Flow: Transforming a Deep Learning Computational Graph into

    an Intel-Gaudi Execution Plan The original Resnet-18 architecture. Up to 152 layers were trained in the original publication (as "ResNet-152") - Wikipedia Neural Network Hardware Mapping – Use of MME and TPC
  10. Layered View of Intel® Gaudi® Software Suite DeepSpeed Integration LLM

    serving Integration Quantization Integration Quantization Toolkit PyTorch Integration Graph Compiler Collective Communication Library (CCL) TPC Fuser Custom User TPC Kernels Optimized TPC Kernel Library User-Mode Driver/Run-Time Environment Compute Driver Network Driver Proprietary Ecosystem Integration Plugin TPC Kernel Library Providers Pre-Compiled Library (TPC-C) Intel-Gaudi optimized TPC kernel library Custom user kernels JIT Library Auto-generated fused kernels, compiled during graph compilation, using the MLIR-based JIT compiler All the kernels are compiled using the Clang-based TPC Compiler
  11. Graph Compilation Flow The original Resnet-18 architecture. Up to 152

    layers were trained in the original publication (as "ResNet-152") - Wikipedia - Processes the deep learning topology to allocate operations across MME, TPC, and DMA engines ▪ Generate MME Configurations ▪ Select kernels from the different kernel library providers - Optimizes - Schedules operations while accounting for memory constraints and dependencies - Configures hardware registers and system settings based on the execution plane (recipe)
  12. Graph Compiler: Slicing + Bundling The Graph Compiler is designed

    to partition data and group operations efficiently to achieve overlapping execution between the MME and TPC units. This optimization maximizes the use of SRAM and local caches, enhancing data transfer efficiency.
  13. - Spare HBM bandwidth - Enhances caches and local memory

    efficiency - Spare kernel-to-kernel invocation latency - Applies shape-based optimizations JIT Compiler: Key Benefits and Advantages The TPC Fuser Supports element-wise operations, reductions and normalizations
  14. TPC-Fuser Performance Improvements -20 -10 0 10 20 30 40

    50 60 T5-LARGE-HF… Clip-RoBERTa… Falcon-180B-… YOLOv5 T5-LARGE-HF ViT-HF LLaMA2-7B-MDS MIXTRAL-… FLAN-T5-XXL-… LLaMA2-7B-… GPTJ-CLM-HF FLAN-T5-XXL-HF MPT-1B LLaMA-V2-… Clip-RoBERTa… LLaMA-V2-… ALBERT-XXL-HF Transformer 8K Swin-T-HF BERT-L-NV FT… Transformer 16K Bert-Base-HF DistilBERT-HF BRIDGETOWE… GPT2-XL-HF RoBERTa… RoBERTa… BERT-L-HF FT BERT-L-NV FT End-to-end model execution -20 0 20 40 60 80 100 120 140 160 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 Device execution times
  15. The compiler converts independent loop iterations into SIMD instructions, based

    on the specific hardware vector width Vectorization
  16. The compiler combines operations from consecutive iterations and merges them

    into a single iteration Loop Unroll unroll_factor= 2
  17. The compiler transforms the constant bounds of a loop into

    variables, enabling scalable parallel execution across multiple processing units Parallelization
  18. Triangular Data Access The introduction of a new data access

    pattern introduced new challenges: - Correctness challenges - Performance challenges - Operation fusion challenges
  19. Key Takeaways - The TPC Fuser is a JIT compiler

    for deep learning kernels - It is deployed as part of Gaudi Synapse SW stack - Delivers significant performance improvements - Works in-tandem with a Graph Compiler to optimize execution across the entire accelerator