TT-TransformersとCursorのAI Agent連携

Tenstorrent Tech Talk #3 TT-Transformers と CursorのAI Agent連携 Sep 11
2025 Kohei Yamaguchi / Tenstorrent Japan @sott0n

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python
API DNN Compiler

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python
API DNN Compiler このSessionでのテーマ

Goal of this session X

What is TTNN ? • TT-MetaliumのNeural Network用のライブラリ • PytorchライクなNN命令が実装されており、C++とPythonで実装可能 •
この命令を使うことでDNNモデルをTTハードウェアに対して実装することができる • ML FrameworkからTTハードウェアをターゲットにする場合 • Pytorch, JaxといったFrameworkでTTハードウェアをターゲットにして計算をする場合は、BackendでTTNNの命令に変換する必要がある • そのためのDNNコンパイラやBackend実装はTenstorrentで提供している • Forge Compiler: Supports Pytorch/Jax/ONNX/TF/PaddlePaddle • PyTorch 2.0 TT-NN Backend

Key Features of TTNN • 高レベルのNeural Network Operations • 行列積,
畳み込み, Attention, データ移動, 集団通信(CCL), element-wise Ops, Reduction, 損失関数, Pooling, などの主要なNN命令を最適化実装 • 柔軟なTensor Library • Tensor APIを通じて、TTアクセラレータに対するデータレイアウトをコントロールすることができる • Naitve Multi-Device Support • デバイスメッシュ構成をサポートし、デバイスクラスタ全体でのスケーリングを容易に実現 • 計算グラフの可視化・Profilerツールの提供 • https://github.com/tenstorrent/ttnn-visualizer

How to use TTNN Install TTNN library from PyPI Simple
TTNN Example

TTNN Example: Bert Encoder … … 一部抜粋

How to build a model from Pytorch to TTNN? https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/converting_torch_model_to_ttnn.html
Re-Writing Functional torch API Converting To TTNN operations Optimizing TTNN model

Example: Converting a BERT from Torch to TTNN BERT with
Pytorch BERT with TTNN

Experimental: Forge compiler can generate TTNN C++ Code Forge frontend
TT-MLIR TTNN/TT-Metalium Runtime TTNN C++ EmitC Forge compiler Stack Forge lowering

How to optimize TTNN Model • TTNNモデルのパフォーマンス最適化はTTアクセラレータの知識が必要 • Configuration: Tensor
Layout, Memory Layout, Data Format, Math Fidelity, etc • 複数のTensix Coresの扱い（Multi-cast, Uni-cast, Re-use data, dataflow, etc） • Profiling Toolsを用いて解析を行う • TTNN-Visulalizer: https://github.com/tenstorrent/ttnn-visualizer Performance analysis L1 Summary with Tensor

Configuration: Tensor Layout Page 0 = [0,0 to 0,63] Page
1 = [1,0 to 1,63] Page 2 = [2,0 to 2,63] Page 62 = [62,0 to 62,63] Page 63 = [63,0 to 63,63] Page 0… Page 1… Page 2… Page 3… Row-Major Layout Tiled Layout 1 Tileが1 pageとして保存される 64x64のTensorの場合、32x32の4 Tiles(4 pages)になる 2D Tensorの各行が1 pageとして保存される 64x64のTensorの場合、Bufferの中に64 pages存在する

Configuration: Memory Layout Interleaved layout Shared layout ラウンドロビンでDRAM Banksへ均等にTensorのPageを保存シンプルな構成なのでアクセスが簡単だが、
遠いDRAMへのアクセスが必要だったり、Unusedな領域が生じるコアに近いDRAMへTensorを配置することができるので高速一方で、Interleaved Layoutよりも制御が複雑

Configuration: Block FP and Math Fidelity a e e g
e a a a e e g e a a 3 a e e g e a a 1 Block Floating Point Math Fidelity GroupingしてExponentを共通化することで圧縮したData Format 5bit x 7bitの乗算器が1Cycleで計算できるそれ以上のサイズは最大4Stepsに分けて計算することで精度を担保 1~2Stepsで計算をストップさせることで高速化が可能

What is TT-Transformers ? • TTNNのTransformers実装 • tt-metal/models/tt_transformers • TTデバイスに最適化されたLLM実装が提供されている
• Model Bringupチームによって積極的にCommit • 2025.09.11時点でサポートされているLLMs • Llama, Qwen, Falcon, Mistral, DeepSeek Distill • ※ Gemma3サポートも最近Mergeされた

Optimized Configuration of TT-Transformers tt-metal/models/tt_transformers/tt/model_config.py Performanceに最適化されたConfig MLPのFF1_FF3に対して、 • BFP4 •
MathFidelity LoFI という最適化設定となっているそれ以外は、BFP8 x HiFi2という設定 ※ QWen2.5-7Bだけ独自設定となっているが今回は割愛

How to use TT-Transformers ? • Step1: Install TT-Metal and
TTNN • Step2: Install python dependencies of TT-Transformers • Step3: Download weights of LLM • Step4: Set `HF_MODEL` or `LLAMA_DIR` environment variable • Step5: Run the tt-transformers

Running a LLM on TT-Transformers Run a meta/Llama-3.1-70B-Instruct on QuiteBox
Output of decoder

What is TT-vLLM ? • TenstorrentハードウェアサポートをしたvLLMのFork • https://github.com/tenstorrent/vllm • ※
dev branchで開発中 • vLLMにtt-metalをbackendとして追加 • 内部ではtt-transformersをbackendでキックして, tt-metal runtimeで実行 • ExamplesにTenstorrentターゲットの設定を含んだスクリプトがある • examples/offline_inference_tt.py • examples/server_example_tt.py • TenstorrentターゲットのOpenAI API推論サーバを簡単に立ち上げられる • また、vLLM Benchmarkなども利用可能

How to use TT-vLLM ? • https://github.com/tenstorrent/vllm/tree/dev/tt_metal • Step1: Build
& Install tt-metal/ttnn • Step2: Run setup script • Step3: Install tt-vLLM • Step4: Run an offline inference server • Default model: meta-llama/Llama-3.1-70B • MESH_DEVICE=T3K: LoudBox/QuiteBox といったWormhole N300 x4のメッシュ構成を指定

How to use TT-vLLM ? • Measure Performance • E.g.
Llama70B on QuiteBox, single batch with 32 prompts • Run a Llama-3.1 70B on Galaxy (Wormhole x32) • TT_LLAMA_TEXT_VER: tt-metalのGalaxy用の実装を指定 • --override_tt_config: Tenstorrent専用の設定値 • Run a Llama3.2 11B Vision model

Demo: Cursor X Tenstorrent

Tenstorrent QuiteBox Blackhole QB (P150 x4) 24 Wormhole QB (N300
x4)

Demo: Overview curl -sS http://<endpoint>/v1/completions \ -H "Content-Type: application/json" \
-H " u h za : …" \ -d '{ "model": "Qwen/Qwen2.5-Coder-32B-Instruct", " ": ”W e *** Py h ", … }' X Generate Code Prompt Qwen To OpenAI API Server

Cursor AI Agent on TT-QuiteBox • Step1: tt-inference-serverのDocker Imageをpull •
https://github.com/tenstorrent/tt-inference-server/pkgs/container/tt-inference-server%2Fvllm-tt-metal-src-dev-ubuntu-22.04-amd64 • Step2: コンテナを立ち上げ Binding: `--device /dev/tenstorrent`

Cursor AI Agent on TT-QuiteBox • Step3: TT-vLLM serverの立ち上げ •
Model: Qwen/Qwen2.5-Coder-32B-Instruct • TT-vLLMへのパッチ（Tool Calling機能利用のため） • Step4: Ngrokでトンネル • この例のendpoint: https://f8481a01c9c0.ngrok-free.app python vllm/examples/server_example_tt.py

Cursor AI Agent on TT-QuiteBox • Step5: Cursorの設定でTT-vLLMサーバのEndpointを指定 • 遷移:
→ Cursor Settings → Models • モデルを登録: • Qwen2.5-Code-32B-Instruct • ※他のモデルはオフにしておく • OpenAI API Keysをダミーで設定 • Override OpenAI Base URL • Ngrokのendpointを設定 • https://f8481a01c9c0.ngrok-free.app/v1

Demo: Cursor AI Agent on TT-QuiteBox

Demo 30

Conclusion • TTNN → TT-Transformers → TT-vLLM fork を使うと簡単にLLMが動く •
TT-Transformersで既にTT Deviceターゲットにチューニング済み • TT-vLLMでOpenAI API Serverを立ち上げて推論処理可能 • TTNNでのモデル実装と最適化 • Pytorch実装からTTNN実装への移植が必要 • 最適化はTTハードウェアへの理解とProfilerなどを使って解析する必要がある • TenstorrentではSWEが日々、さまざまなモデルをサポート・チューニングをしている • https://github.com/tenstorrent/tt-metal/tree/main/models • Welcome to Contributing!!!

Thank you

TT-TransformersとCursorのAI Agent連携

TT-TransformersとCursorのAI Agent連携

Tenstorrent Japan

More Decks by Tenstorrent Japan

Featured

Transcript

Tenstorrent Tech Talk #3 TT-Transformers と CursorのAI Agent連携 Sep 11

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python

Tenstorrent OSS Stack Low level library/Kernel DNN Library / Python

Goal of this session X

What is TTNN ? • TT-MetaliumのNeural Network用のライブラリ • PytorchライクなNN命令が実装されており、C++とPythonで実装可能 •

Key Features of TTNN • 高レベルのNeural Network Operations • 行列積,

How to use TTNN Install TTNN library from PyPI Simple

TTNN Example: Bert Encoder … … 一部抜粋

How to build a model from Pytorch to TTNN? https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/converting_torch_model_to_ttnn.html

Example: Converting a BERT from Torch to TTNN BERT with

Experimental: Forge compiler can generate TTNN C++ Code Forge frontend

How to optimize TTNN Model • TTNNモデルのパフォーマンス最適化はTTアクセラレータの知識が必要 • Configuration: Tensor

Configuration: Tensor Layout Page 0 = [0,0 to 0,63] Page

Configuration: Memory Layout Interleaved layout Shared layout ラウンドロビンでDRAM Banksへ均等にTensorのPageを保存シンプルな構成なのでアクセスが簡単だが、

Configuration: Block FP and Math Fidelity a e e g

What is TT-Transformers ? • TTNNのTransformers実装 • tt-metal/models/tt_transformers • TTデバイスに最適化されたLLM実装が提供されている

Optimized Configuration of TT-Transformers tt-metal/models/tt_transformers/tt/model_config.py Performanceに最適化されたConfig MLPのFF1_FF3に対して、 • BFP4 •

How to use TT-Transformers ? • Step1: Install TT-Metal and

Running a LLM on TT-Transformers Run a meta/Llama-3.1-70B-Instruct on QuiteBox

What is TT-vLLM ? • TenstorrentハードウェアサポートをしたvLLMのFork • https://github.com/tenstorrent/vllm • ※

How to use TT-vLLM ? • https://github.com/tenstorrent/vllm/tree/dev/tt_metal • Step1: Build

How to use TT-vLLM ? • Measure Performance • E.g.

Demo: Cursor X Tenstorrent

Tenstorrent QuiteBox Blackhole QB (P150 x4) 24 Wormhole QB (N300

Demo: Overview curl -sS http://<endpoint>/v1/completions \ -H "Content-Type: application/json" \

Cursor AI Agent on TT-QuiteBox • Step1: tt-inference-serverのDocker Imageをpull •

Cursor AI Agent on TT-QuiteBox • Step3: TT-vLLM serverの立ち上げ •

Cursor AI Agent on TT-QuiteBox • Step5: Cursorの設定でTT-vLLMサーバのEndpointを指定 • 遷移:

Demo: Cursor AI Agent on TT-QuiteBox

Demo 30

Conclusion • TTNN → TT-Transformers → TT-vLLM fork を使うと簡単にLLMが動く •

Thank you