Tenstorrent 手軽にAIアプリケーションを立ち上げる

Tenstorrent TechTalk #3 手軽にAIアプリケーションを立ち上げる Tenstorrent Japan KK

#１〜#２のおさらい: • TenstorrentのTensix AI アクセラレータの戦略 • IP〜PCIeカード〜ワークステーション〜高集積サーバ〜クラウドまで提供 • スケールアウトに強み, 組込みから,
エンタープライズAIまで同じアーキテクチャ • Tensix AI アクセラレータの概要 • Tensix Coreの構成 • 計算コアのクラスタリング, NOCの動きかた • オープンかつ柔軟なSDK • Metalium: Tensix Coreのプログラミングモデルを直に制御 2 ANY AI MO DE L OPEN SOURCE OPEN SOURCE O P T I M I ZE D M L R E S U L T S CUSTOM OPS BUILD ANYTHING OPEN SOURCE OPEN SOURCE Compiler HWの成り立ちから, SWの基礎部分を解説してきた

今日のトピック: LLM, AIアプリをどれだけ手軽にデプロイできるか Partners TT-Forge vLLM Python kernels TT-NN TT-Metalium
TT-LLK (low-level-kernels) PyTorch models TT-Fabric (unified scale-up and scale-out) Manually optimized models TT-Train TT- Transformer LLM training LLM inference models LLM, t2s, s2t models Jax PyTorch TF ONNX Models AI Workloads Open Source Partners Tenstorrent Open Source Software 第4回予定?? 第6回予定第5回予定第?回予定 AI アプリ

Session1はこの辺の裏側にいるSWを見ていく 4 (第1回発表資料より)

モデルデプロイSWを内製する意義 • デモ用途 • 手離れの良いユーザインターフェイス • とりあえずモデルを実用環境に持っていきたい, SDKの違いを意識したくないユーザ向け •
言語だけでなく, 画像/音声入出力も統一したやり方でサポート • メンテナンスするソフトは一つ • 内部的にはCIのための自動化システムの一部: tps, e2elの性能追跡 • モデルデプロイ, テスト実行, 性能評価の手順をパターン化 5

AIアプリの立ち上げを補助するソフトウェアスタック 6 TT-inference-server モデルをAPIサーバとして扱うためのWrapper 言語だけでなく, (少々古い)画像系モデルもサポート構成は TT-Metalium / TT-NN
+ vLLM (LLM), FastAPI+独自のイベントループ (LLM以外) モデルごとにDockerイメージ配布しているのでそのまま使うのをお勧め TT-Studio Web UIもついたAI Playgroundの実装 TT-inference-serverによってビルドされたDocker containerたちを前提としている現状, Llama3.3-70B + RAG, StableDiffusion, YOLO等色々なモデルを切り替えて使うことができるコンテナ立ち上げ, 切り替え用のWeb UIもあり

ゲストスピーカーのご紹介: 1 •Benjamin Goel • Staff Engineer, Customer Team •
TT-Inference-server, TT-Studioの開発主導 • その他LLMモデルのBringup 7

tt-inference-server Overview September 10, 2025 Ben Goel - [email protected]

We will review the tt-inference-server project to explain its purpose,
structure, and capabilities Outline

tt-inference-server Project Overview

Why does tt-inference-server exist? Our aim is to solve these
problems with tt-inference-server: 1. Tenstorrent SW stack is complicated to build and is not packaged for easy consumption 2. We need a method of serving models for: a. Powering applications b. Simplifying model execution c. Integrating with industry standard deployment technologies d. Facilitating model performance benchmarking and accuracy evaluation 3. Need a way to encapsulate model configuration, a lot of setup is required: a. Environment variables b. Software version dependencies c. Model-specific runtime arguments d. Device configuration and topology 4. Enable local model serving 5. Need a standardized codebase for measuring model performance and accuracy 6. Need a tool to aid in application-level model regression testing (CI)

How tt-inference-server solves these problems Tenstorrent SW stack is complicated
to build and is not packaged for easy consumption - Uses Docker to containerize the SW build process

How tt-inference-server solves these problems We need a method of
serving models - Composes inference servers for LLM and non-LLM models: a. LLM -> TT vLLM fork b. non-LLM -> Media Inference Server

Need a way to encapsulate model configuration, a lot of
setup is required - ModelSpec captures all model configuration in serializable artefact - ModelSpec is consumed by Docker images How tt-inference-server solves these problems

We need to a way to enable local model serving
- Provides a CLI that can: a. Run local inference servers in Docker containers b. Setup host dependencies (model weights, ModelSpec, persistent volumes, etc) c. Run performance benchmarks against running inference servers d. Run accuracy evaluations against running inference servers How tt-inference-server solves these problems

Need a standardized codebase for measuring model performance and accuracy
- Encapsulates measuring model performance and accuracy into parameterizable Python scripts that are executed by the CLI How tt-inference-server solves these problems

Need a tool to aid in application-level model regression testing
(CI) - CLI is used by CI workflows to measure model performance and accuracy with the newest versions of SW How tt-inference-server solves these problems

Design of tt-inference-server At the heart of tt-inference-server are Docker
images which can be used to deploy our models with standard orchestration technologies (kubernetes). The CLI serves to template Docker run commands and abstract model configuration details from the user. The challenge with using the Docker images comes with knowing all the arguments and variables to set.

ModelSpec The Docker images require mounting a configuration file called
the ModelSpec. The ModelSpec is a simple object that contains all model configuration Here is an example for meta-llama/Llama-3.3-70B-Instruct on Galaxy - https://gist.github.com/bgoelTT/37bd2af4ea3a9cd8ee57e85513a65227

ModelSpec continued The combination of the Docker image + ModelSpec
is the core design pattern of tt-inference-server

Questions

tt-inference-serverの使い方 • mainよりもdev branchがお勧め • https://github.com/tenstorrent/tt-inference-server/tree/dev • 万能運用スクリプトrun.pyでの利用を推奨 • k8sなどに展開しようと思うと,
run.pyの面倒みてくれてる部分を分解する必要がありそう. • 例) dockerイメージビルド, コンテナへの環境変数設定, コンテナ立ち上げ • 立ち上げたAPIサーバの性能ベンチマーク機能もあり〼 • ご参考: docs/workflows_user_guide.md 22

run.py CLI - Required Arguments Reference - https://github.com/tenstorrent/tt-inference- server/blob/dev/workflows/README.md#runpy-cli-usage Command-line
Arguments Required Arguments: --model (required): 実行したいモデル, 選択できるモデルは後述の MODEL_SPECS で列挙されている --workflow (required): やりたい作業を指定する(ベンチマーク, 評価, サーバ立ち上げ etc.): benchmarks, evals, server release, reports, tests --device (required): 実行対象のハードウェアを指定: cpu, gpu - CPU/GPU execution n150, n300, p100, p150 - Tenstorrent Siingle PCIe card p150x4 - Tenstorrent 4xP150 card (TT-QuietBox Blackhole) t3k - Wormhole TT-QuietBox /TT-LoudBox systems galaxy - Tenstorrent Galaxy system

Reference - https://github.com/tenstorrent/tt-inference-server/blob/dev/workflows/README.md#runpy-cli-usage Optional Arguments(抜粋): --device-id (optional): 実行に供するTenstorrent deviceの IDを指定,
複数デバイスももちろん可能 (e.g. '0' for /dev/tenstorrent/0). --local-server (optional), --docker-server (optional): Inference serveをベアメタルで実行するか, コンテナで実行するか選択 --override-docker-image (optional): 前項のモデル指定を無視して, 指定のInference Serverコンテナを実行してから, 指定のworkfloを実行 --workflow-args (optional): Additional workflow arguments (e.g.,'param1=value1 param2=value2'). --override-tt-config (optional): TT-Metalに渡したい設定をJSONで指定 (e.g.,'{"data_parallel": 16}’). モデルのオプションや, SDKの動作モード等 --vllm-override-args (optional): vLLMに渡したい設定をJSONで指定. (e.g.,'{"max_model_len": 4096, "enable_chunked_prefill": true}’). 最大シーケンス長さやPrefixの動作などのオプション等 run.py CLI - Optional Arguments

Reference - https://github.com/tenstorrent/tt-inference-server/blob/dev/workflows/README.md#runpy-cli-usage Run a workflow with a Docker server:
python3 run.py --model Llama-3.3-70B-Instruct --workflow evals --device T3K --docker- server Run benchmarks workflow: python3 run.py --model Llama-3.3-70B-Instruct --workflow benchmarks --device T3K -- docker-server Run the accuracy evaluations workflow locally: python3 run.py --model Qwen2.5-72B-Instruct --workflow evals --device N150 Run with custom service port and additional workflow arguments: python3 run.py --model Qwen2.5-72B-Instruct --workflow evals --device N150 --service- port 9000 --workflow-args "batch_size=4 max_tokens=512" run.py CLI – 具体的なコマンドたち

Docker imagesを直接扱う場合 run.pyを--workflow server を指定して実行で, 直接コンテナ叩く場合のコマンドが得られる python3 run.py --model Llama-3.3-70B-Instruct
--device galaxy --workflow server --docker-server に対応する docker runのoptionは docker run --rm --name tt-inference-server-7a5e565c --env-file /home/ubuntu/tt-inference-server/.env --cap-add ALL --device /dev/tenstorrent:/dev/tenstorrent --mount type=bind,src=/dev/hugepages-1G,dst=/dev/hugepages-1G --mount type=bind,src=/home/ubuntu/tt- inference-server/persistent_volume/volume_id_llama3_70b_galaxy-Llama-3.3-70B-Instruct-v0.0.5,dst=/home/container_app_user/cache_root --mount type=bind,src=/home/ubuntu/.cache/huggingface/hub/models--meta-llama--Llama-3.3-70B- Instruct,dst=/home/container_app_user/readonly_weights_mount/Llama-3.3-70B-Instruct,readonly --mount type=bind,src=/home/ubuntu/tt-inference- server/workflow_logs/run_specs/tt_model_spec_2025-09-08_11-27-35_id_llama3-70b-galaxy_Llama-3.3-70B- Instruct_galaxy_server.json,dst=/home/container_app_user/model_spec/tt_model_spec_2025-09-08_11-27-35_id_llama3-70b-galaxy_Llama-3.3-70B- Instruct_galaxy_server.json,readonly --shm-size 32G --publish 8000:8000 -e CACHE_ROOT=/home/container_app_user/cache_root -e TT_CACHE_PATH=/home/container_app_user/cache_root/tt_metal_cache/cache_Llama-3.3-70B-Instruct/TG -e MODEL_WEIGHTS_PATH=/home/container_app_user/readonly_weights_mount/Llama-3.3-70B- Instruct/snapshots/6f6073b423013f6a7d4d9f39144961bfbfbc386b/original -e TT_LLAMA_TEXT_VER=llama3_70b_galaxy -e TT_MODEL_SPEC_JSON_PATH=/home/container_app_user/model_spec/tt_model_spec_2025-09-08_11-27-35_id_llama3-70b-galaxy_Llama-3.3-70B- Instruct_galaxy_server.json ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-dev-ubuntu-22.04-amd64:0.0.5-370f7ce-005baf4 ほとんどはコンテナ内部に渡す環境変数半年前はモデルごとに envファイルを手書きで作って docker runに渡していた

中でも大事なDocker runオプション • Devices - /dev/tenstorrent • Hugepages - /dev/hugepages-1G
以下はコンテナ内部の環境変数として • TT-MetaliumのTensor Cacheディレクトリの指定 • And setting TT_CACHE_PATH env var • 初回起動時にweightのキャッシュが保存される → でかい領域が必要 • HF等から落としてきたModel Weights ディレクトリの指定 • And setting MODEL_WEIGHTS_PATH env var • ModelSpec JSON • And setting TT_MODEL_SPEC_JSON_PATH env var 上記をうまい感じに書き換え, オリジナルllamaやQwenのフリをさせると Swallow, Sarashina等も動くようになる (もちろんWeightだけが変わっているfinetunedモデルだけサポート)

Tt-inference-serverの構成 (繰り返し) 28 構成は TT-Metalium / TT-NN + vLLM (LLM)
または FastAPI+独自のイベントループ (LLM以外) - 内部的にはmedia-inference-serverと呼ばれるらしい

Our vLLM deep dive • ForkされたvLLM,(将来的にはTT用Backendとして, upstreamにマージされる予定) • https://github.com/tenstorrent/vllm/tree/dev
• tt-metalの中で実装されている, vLLM用のクラスを引っ張ってきて使っている. • vllm/examples/offline_inference_tt.pyのregister_tt_modelsが参考になります • 現状 LLama3系, Qwen2.5系, Gemma3系がサポート • 何か新しいモデルを作ったら, 上記の関数内で追加する • [大事なポイント] GPUユーザがvLLMから受ける恩恵に相当するものは基本モデル自体(つまりTT-NNのコードとして)に実装される • Paged Attention, ContinuousBatching etc. • 基本的にはAPIサーバ, イベントループ, 複数リクエストをまとめる部分などに vLLMの実装を借りているだけと考える方がいい 29

Media-inference-server • non-LLMモデル向け • スケーラブルな設計 • 複数デバイス上にモデルを控えさせておいて, loadbalancingする機能 • 現在サポートしているモデル
◦ SDXL ◦ SD-3.5 ◦ Whisper with WhisperX ◦ Experimental modeles • YoloV4 • Resnet • … • デバイスの死活監視も構築中

Media-inference-serverの概要 API アダプタベースサービス画像/音声/ビデオ前・後処理タスクスケジューラワーカープロセス
対象モデル対象モデル対象モデル

Inference Serverソースを見たい人へのTips • vLLMを使うケース • main branchのvllm-tt-metal-llama3を覗くのがおすすめ • media-inference-serverの実装 •
dev branchのtt-metal-sdxlを覗くのがおすすめ • tt-metal-yolov4はさらに簡単というかべた書きに近い実装なので理解? しやすい • yolo -> SDXL で徐々にmedia-inference-serverの形ができてきて, まだまだ発展途上? • Run.pyの中身, workflowsの中を見る. • workflows/model_spec.py にサポートされているモデル, HW種別, それらに依存して生成される設定のリストが見て取れる • Benchmarkコード • LLM向け性能測定SUITE、色々な入出力トークン長, batch_sizeでTTFT,TPSなどを計測してます 32

ゲストスピーカーのご紹介: 2 •Sunku Ranganath • Staff Cloud Product Manager, Customer
Team • TenstorrentのクラウドであるTT-Cloudの開発者 33

Free Access of Tenstorrent-Cloud for Developers • Experienced in building
& operating internal AI cloud with over 350 systems. • Enable Developers experience the hardware for Free • Developers can build and experiment AI based applications • Build applications through VSCode Instances • Inference as a Service (Open API compatible APIs) • Reach out to us to be on the waiting list! Example Flow for Inference as a Service

評価にご興味がある方 - Webから直販 - 代理店(マクニカ様, Networld様)経由, サポート付き - クラウド(Koyeb, UnsungFields,
TT-Cloud(最近満員御礼のためお時間必要)) - お問い合わせは [email protected]まで

Media情報 Youtube: - 開発者が顔出しでライブラリの使い所などをしゃべる動画を公開中 - https://www.youtube.com/@tenstorrentinc Github: https://github.com/orgs/tenstorrent Discord: https://discord.gg/tenstorrent

Tenstorrent 手軽にAIアプリケーションを立ち上げる

Tenstorrent 手軽にAIアプリケーションを立ち上げる

More Decks by Tenstorrent Japan

Featured

Transcript