MLIRベースのコンパイラ TT-Forge/TT-MLIRの解説

Tenstorrent Tech Talk #4 MLIRベースのコンパイラ TT-Forge/TT-MLIRの解説 Oct 24 2025 Kohei
Yamaguchi / Tenstorrent Japan @sott0n

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python
API DNN Compiler

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python
API このSessionでのテーマ DNN Compiler

ANY AI MODEL OPEN SOURC E OPEN SOURC E AI/ML
+ HPC Developers Model Developers Direct to metal OPTIMIZED TTNN C++ BUILD ANYTHING LIBRARY OF OPS OPEN SOURC E TTNN C++ Code HPC workloads TT-Forge: MLIR based compiler Tenstorrentが開発するMLIR-basedコンパイラで、さまざまなMLフレームワークの計算グラフをTT- Metalのカーネルにコンパイルする TT-Forgeは複数のOSSで構成される総称で、以下のコンポーネントに分けられる • Front-end: tt-forge-fe, tt-xla, tt-torch • Back-end: tt-mlir

MLIR Overview • LLVM Projectでメンテナンスされているコンパイラフレームワーク • Domain-Specificな中間表現(=Dialectと呼ぶ)を実装し、Dialect間のLoweringとTransformを行うことで、コンパイラスタックを構築することができる • MLIR自体にもさまざまなDomainのDialectsが存在し、それらを利活用できる
• E.g. Vector/Tensor dialect, SCF dialect, Arith dialect, Affine dialect, Linalg dialect, Sparsity dialect, etc • Dialectsの実装と共通機能 • 各DialectはOperation, Type, Attributeといった独自のIR定義を持つ • Transformによる最適化パスの実装 • ConversionによるDialect間の変換 • InterfaceやTraitといったIR間で共通した性質や機能の制御が可能 • など • さまざまなMLIRコンパイラが存在し、Dialect間のIntegrationが比較的容易に可能 • E.g. OpenXLA, IREE, Mojo, Triton compiler, ONNX-MLIR, etc

Sample MLIR Graph

各種フレームワークからTT-MLIRへのコンパイルは、MLIRのエコシステム（Torch- MLIR, OpenXLA, etc）によりシームレスに実行することができる MLIR seamlessly integration with open-source frameworks

• Extensive model support • Out of the box performance
• Multiple framework support • Everything from TT-Buda + • Designed for extendibility and integrations • Native integration with Tenstorrent’s entire AI stack (TT-NN, TT-Metalium, etc.) • More advanced tooling, visualizers, profilers, code generation, and more! From TT-BUDA to TT-Forge From 2024

Generality TT-Forge supports multiple ML frameworks and MLIR dialects, ensuring
broad compatibility and flexibility across diverse AI workloads with the ability to expand to future frameworks. Performance TT-Forge provides optimized compilation and execution and with our custom dialects, maximizes inference and training performance on Tenstorrent's hardware. Also, we enable further manual tuning using our GUI tools to squeeze out the last few drops of performance out of your workloads. Tooling TT-Forge’s toolchain streamlines ML model compilation, optimization, and execution. Including MLIR-based compilation and runtime inspection, memory and compute profilers as well as a graph visualizer, these tools enable efficient development, debugging, and performance tuning on Tenstorrent hardware. What is TT-Forge

• TT-XLA • Support: Jax, TensorFlow, Pytorch • TT-Torch •
Support: Pytorch, ONNX • TT-Forge-FE • Support: Pytorch, ONNX, Jax, PaddlePaddle, TensorFlow, TF Lite Frontend compiler: Support frameworks

Backend compiler: TT-MLIR • MLIRで実装されたTenstorrentデバイスターゲットのコンパイラ • https://github.com/tenstorrent/tt-mlir • TTデバイスターゲットの最適化パスが実装されており、最終的にはTT-Metal/TTNNに命令に変換され、 TT-Metal/TTNN
Runtime上で実行されるバイナリを出力する • MLIRのDialectsも利用: SCF, Arith, Affine, Linalg, Tensor, etc tt-torch tt-xla tt-forge-fe shlo2ttir pipeline ttir2ttnn pipeline ttnn2flatbuffer pipeline TTIR TTNN TTIR StableHLO Runtime Frontend Backend Runtime

Backend compiler: TT-MLIR dialects • TT-Core dialect: • 全てのTT DialectsのベースとなるDialectで共通するTypeやAttributeが定義
• TT-IR dialect • TT Deviceターゲットの高レベルなIR • FrontendからこのTTIR dialectに変換がされる • TT-NN dialect • TTNNに対応するIR。DNN命令はこのTTNN dialectの命令に変換され、Runtimeで実行される • TT-Metal dialect • TT-Metal Runtimeを通したDeviceへのOperation Dispatch • TT-Metalへのlowering pipelineがあり、TT-D2Mからloweringし、Flatbufferを出力する • TT-D2M dialect • Dialect to MetalのIR。MLIR dialectやTT-CoreからTT-Metalへloweringする • TT-Kernel dialect • TT-MetalのKernal Operationsを定義しているIR • Tile操作やCB操作、SFPU/FPUの命令までTT-Metal命令に対応 • TT-SFPI dialect • SFPU Programming Interface のIR • SFPU(Vector Engine) の命令に対して1:1でサポート • TT-EmitPy dialect • MLIRからPythonコードを生成するためのIRとPass • MLIRのEmitC Dialectの利用により、EmitPyと併せてC++と Pythonコードをtt-mlirから出力できる Common dialects DNN Operations dialect Generation Code dialect Kernel/SFPI dialects

1. ML Framework 1. Jax, ONNX, Pytorch, Tensorflow 2. Front
Ends 1. TT-XLA, TT-Torch, TT-Forge-FE 3. TT-MLIR Compiler 1. StableHLO → TTIR → Graph passes → 2. → TT-Kernel IR or TT-NN IR or TT-Metal IR 4.TT-Metalium 1. TTNN/TT-metal Runtime 5.TT-LLK 6.System Software (UMD/KMD) 7.Hardware (Wormhole, Blackhole) TT-Forge Pipelines Overview

TT-ML R‘ imiz TT-MLIRにおける主に以下の最適化を実行する仕組み • Op Fusing • Mixed Precision
• Optimizing Tensor Memory Layout • Selecting Optimal Operations Configurations Optimizer Options • `enable-optimizer` : Optimizer passのOn/Off • `memory-layout-analysis-enabled` : L1 memoryを最大限に利用 • `max-legal-layouts` : 解析時にLayoutを変更する命令の最大数 • `memory-layout-analysis-policy` : MemoryLayoutの解析ポリシー - Options: - DFSharding - GreedyL1Interleaved - BFInterleaved • `tensor-l1-usage-cap` : L1 memory spaceの上限値（0.0~1.0） • Etc..

Example: DFSharding Shared layout https://github.com/tenstorrent/tt-mlir/blob/main/lib/Dialect/TTNN/Analysis/DFShardingPolicy.cpp OptimizerのDFShardingはOpsのL1キャッシュに効率的に配置するための解析と最適化を行う ShardSolverによって、サブグラフ全体に渡る制約の整合を行い、演算間のChain経路を見つけ出す各ノードはサブグラフ内の異なるOpで、ボックスはLegal or Notを表す
最適化ポリシーを満たすLegalな経路を導き出し、それらをChainとして最適化対象とする

Example: DFShardingPolicy • スケジューリング戦略 • DFS順: 依存関係を保ちながらChainを構築 • ToLayoutOp優先: メモリレイアウト変換を先に処理してChain開始前にCleanな状態にする
• メモリ階層最適化 • L1 Chain: 連続する命令群をL1キャッシュに配置してI/Oコストを削減 • DRAM Spill: L1に収まりきらない場合や最終出力時にDRAMへ退避 • 制約と制限 • Fork禁止 • 第1 Operand制約: Chainの継続には次の命令の第1Operandである必要がある

Example: DFShardingPolicy Phase1: L1 Chain構築 FuncOpからWalkして、DFS OrderでL1 Chainを構築する 1. スケジューラから命令Aとその次の命令Bを取り出し、BはAのdirect
consumerとする 2. 命令Aの出力テンソルが、命令Bの出力およびBの実行に必要なメモリ要件とともにL1に収まるかを確認する 3. メモリ要件が満たされる場合は、Aを起点とするChainを構築 4. メモリ要件が満たされない場合、要件を満たすために必要なtensor_splitting_factorを計算する。その値が max_splitting_factor以下であり、かつLegalな分割であるかを判定し、条件を満たす場合はBを tensor_splitting_factorとともにChain構成に追加する。 5. 条件を満たさない場合は、Bを処理済みとして破棄してChain構成を完了とする。 6. 1~5のステップを繰り返す > Chain構築の条件 - Conv2d、Add、Multiply、ReLU、ReLU6、Typecast、MinimumなどのOpが対象 - 次のOpが現在のOpの第1Oprandを使用すること - 少なくとも一つの合法的な設定が存在すること > Chain分断の条件 - オペレーションが複数の使用者を持つ（Fork） - 次のOpが現在のOpを第1Operand以外で使用 - Sharding非対応Op …

Example: DFShardingPolicy Phase2: 生成されたChainの後処理 ShardSolverを使って、各Chainの最適配置を計算各L1 Chainに対して: 1. 制約ソルバーで解決 :
Tensor Layout, メモリ制約を考慮 2. DRAM Spill Check 隣接するChain構成をコストベースで統合する • Split/Combine演算の削減のトレードオフ • 実行ループ数の増加（e.g. DRAM Spill vs. 追加実行）のトレードオフ Phase3: 最適な設定選択 pickOpShardConfigsでChainを元に最適化設定を選択 DRAM Interleavedに残る演算は、計算コストおよびDRAM read estimateに基づいて、最適化計算構成を選択する

User code PyTorch Model Executor Invoked Inputs Pushed Outputs Popped
Binary Executed PT2.x Compile fx passes torch-mlir tt-mlir executor executor op- by- op User code Executor Invoked fx partitioner tt-mlir torch-mlir Inputs Pushed Outputs Popped Binary Executed Results logged Database Excel sheet optimized fx graph fx graph fx graph ONNX model stableHLO flatbuffer stableHLO for each op fx graph for each op flatbuffer for each op executor returned to user user calls torch.compile() fx graph Frontend compiler: TT-Torch TT-Torchは、PT2(torch.compiler)とtorch-mlirをベースにしたFrontend compilerで、 Fx Graph → StableHLO Graph → TT-MLIR の流れでコンパイルする

How to run TT-Torch https://docs.tenstorrent.com/tt-torch/getting_started.html PT2のtorch.compile() の引数にtt-torchのCompilerConfigと Device, Backendを指定する ※
通常のtorch.compile()では、TorchInductor (Triton compiler)やCudaGraph, IPEXなどが利用される PytorchのTorchDynamoのBackendに、Tenstorrentバックエンドである、”tt-legacy” (他にはtt-experiment)を指定することで、 TorchDynamoを経由して、計算グラフがFx Graphに変換され、その後Torch-MLIRを経由して、TT-MLIRのIRに変換がされる

runtime compiled_module = jax.jit(module) compilation versionedHLO (VHLO) PJRT PJRT_client_compile(…) stableHLO
(SHLO) TTIR TTNN flatbuffer_ binary outputs = compiled_module(inputs) PJRT PJRT_LoadedExecutable_e xecute(…) Runtime tt::runtime::submit (flatbuffer_binary) outputs TT-XLA TT-MLIR TT-XLA TT-MLIR Frontend compiler: TT-XLA TT-XLAはOpenXLAが開発しているPJRTにTT-MLIRを IntegrationしたInterfaceで、PJRTを経由することで JaxからVHLO→SHLO→TT-MLIRにloweringすることができる TT-XLAを用いることで、jax.jitでJAX modelをコンパイルしてTT device上で動かすことが可能になる Reference: • OpenXLA: https://openxla.org/ • PJRT: https://openxla.org/xla/pjrt

How to run TT-XLA https://docs.tenstorrent.com/tt-xla/getting_started.html Jax/Flax Pytorch Jax/Flaxはコードの変更なくTT-XLAでコンパイル可能 Pytorchはtorch-xlaを利用しているため、Device/Backend設定が必要

TT-Forge-FEはフロントエンドの中で最もサポートしているフレームワークが多く、TVMのRelayを利用して、TT-MLIRへコンパイルする TT-TorchやTT-XLAと違い、Graph Optimizatoin などの最適化がこのレイヤにたくさん導入されている → TT-BUDAのコードの多くを再利用しているため →
ただし、TT-MLIR上で最適化パスを集約する方針であるため、いずれForge-FEからは無くなる予定 user level interface initialize compile initialize runtime compile time runtime TVM compile forge-fe compile MLIR compile MLIR runtime forge-fe runtime silicon binary binary ... Frontend compiler: TT-Forge-FE

TT-Explorer: How to optimize and visualize your models モデルのチューニングをするためのツールとして、TT-Explorerという可視化ツールがある TTNNのC++コードを出力できるため、Forgeから出力したC++コードをチューニングすることも可能
ttnn-visualizer C++ Generation

Example: Multi-device TT-MLIR

Multi-device Computation 複数のTT Deviceを繋げて計算したい 1. 入力データを複数デバイスに配布 2. 各デバイスで独立して計算を実行 3. CCL(Collective
Communication library)を用いて結果を集約 4. 必要に応じて、結果を結合する Multi-device Matmul on n300 (1x2 mesh)

JAX/OpenXLA MLIR output (with GSPMD partitioner) • Target 1x2 for
n300 • GSPMD: General and Scalable Parallelization for ML Computation Graphs • https://arxiv.org/abs/2105.04663

Conversion from StableHLO to TTIR • StableHLOのSPMDからTTIRに変換 • `Mesh_shared` attributes
• shard_direction: full_to_shard, shard_to_full • shard_type: replicated or devices • shard_shape: determine sharding shape if shard_type is “devices”

Lowering from TTIR to TTNN/Flatbuffer • TTIRのshard OpsからTTNN/Flatbufferへのshard opsへ変換

Forge & Shardy: Multi-device Architecture 2025年からはShardyをサポート • What is Shardy
• MLIRベースのTensor partitioning system • Released in July 2024 • OpenXLA Project • Jax, torch_xla, StableHLOでサポート • Features • Shard hints • Automatic shard solver + sharding propagation • Automatically insert CCLs for sharding mismatches • Maximum flexibility in sharding representation

Conclusion • TT-ForgeはTenstorrentが開発しているMLIRベースなDNNコンパイラ • MLIRのDialectsも利用しつつ、TT Deviceターゲットに独自のDomain SpecificなDialectを実装 • MLIRエコシステムとのIntegrationがMLIRの仕組みで比較的容易にできる •
OpenXLA(PJRT, StableHLO, IREE), Torch-MLIR, etc • Future workだが、HPC領域とのMLIRベースの連携も計画にある • Welcome Contribution!! • TT-MLIR, TT-Torch, TT-XLA, TT-Forge-FEともにOSSとして公開しているので、自由に開発に参加できる！ • 開発参加や研究としての連携など、興味があれば気軽にお声がけください！

MLIRベースのコンパイラ TT-Forge/TT-MLIRの解説

MLIRベースのコンパイラ TT-Forge/TT-MLIRの解説

Tenstorrent Japan

More Decks by Tenstorrent Japan

Featured

Transcript

Tenstorrent Tech Talk #4 MLIRベースのコンパイラ TT-Forge/TT-MLIRの解説 Oct 24 2025 Kohei

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python

ANY AI MODEL OPEN SOURC E OPEN SOURC E AI/ML

Sample MLIR Graph

各種フレームワークからTT-MLIRへのコンパイルは、MLIRのエコシステム（Torch- MLIR, OpenXLA, etc）によりシームレスに実行することができる MLIR seamlessly integration with open-source frameworks

• Extensive model support • Out of the box performance

Generality TT-Forge supports multiple ML frameworks and MLIR dialects, ensuring

• TT-XLA • Support: Jax, TensorFlow, Pytorch • TT-Torch •

Backend compiler: TT-MLIR • MLIRで実装されたTenstorrentデバイスターゲットのコンパイラ • https://github.com/tenstorrent/tt-mlir • TTデバイスターゲットの最適化パスが実装されており、最終的にはTT-Metal/TTNNに命令に変換され、 TT-Metal/TTNN

Backend compiler: TT-MLIR dialects • TT-Core dialect: • 全てのTT DialectsのベースとなるDialectで共通するTypeやAttributeが定義

1. ML Framework 1. Jax, ONNX, Pytorch, Tensorflow 2. Front

TT-ML R‘ imiz TT-MLIRにおける主に以下の最適化を実行する仕組み • Op Fusing • Mixed Precision

Example: DFShardingPolicy • スケジューリング戦略 • DFS順: 依存関係を保ちながらChainを構築 • ToLayoutOp優先: メモリレイアウト変換を先に処理してChain開始前にCleanな状態にする

Example: DFShardingPolicy Phase1: L1 Chain構築 FuncOpからWalkして、DFS OrderでL1 Chainを構築する 1. スケジューラから命令Aとその次の命令Bを取り出し、BはAのdirect

Example: DFShardingPolicy Phase2: 生成されたChainの後処理 ShardSolverを使って、各Chainの最適配置を計算各L1 Chainに対して: 1. 制約ソルバーで解決 :

User code PyTorch Model Executor Invoked Inputs Pushed Outputs Popped

How to run TT-Torch https://docs.tenstorrent.com/tt-torch/getting_started.html PT2のtorch.compile() の引数にtt-torchのCompilerConfigと Device, Backendを指定する ※

runtime compiled_module = jax.jit(module) compilation versionedHLO (VHLO) PJRT PJRT_client_compile(…) stableHLO

How to run TT-XLA https://docs.tenstorrent.com/tt-xla/getting_started.html Jax/Flax Pytorch Jax/Flaxはコードの変更なくTT-XLAでコンパイル可能 Pytorchはtorch-xlaを利用しているため、Device/Backend設定が必要

TT-Explorer: How to optimize and visualize your models モデルのチューニングをするためのツールとして、TT-Explorerという可視化ツールがある TTNNのC++コードを出力できるため、Forgeから出力したC++コードをチューニングすることも可能

Example: Multi-device TT-MLIR

Multi-device Computation 複数のTT Deviceを繋げて計算したい 1. 入力データを複数デバイスに配布 2. 各デバイスで独立して計算を実行 3. CCL(Collective

JAX/OpenXLA MLIR output (with GSPMD partitioner) • Target 1x2 for

Conversion from StableHLO to TTIR • StableHLOのSPMDからTTIRに変換 • `Mesh_shared` attributes

Lowering from TTIR to TTNN/Flatbuffer • TTIRのshard OpsからTTNN/Flatbufferへのshard opsへ変換

Forge & Shardy: Multi-device Architecture 2025年からはShardyをサポート • What is Shardy

Conclusion • TT-ForgeはTenstorrentが開発しているMLIRベースなDNNコンパイラ • MLIRのDialectsも利用しつつ、TT Deviceターゲットに独自のDomain SpecificなDialectを実装 • MLIRエコシステムとのIntegrationがMLIRの仕組みで比較的容易にできる •

Q & A