Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blackhole Architecture

Avatar for Tenstorrent Japan Tenstorrent Japan
October 27, 2025
85

Blackhole Architecture

Tenstorrent TechTalk #4, Session1

Avatar for Tenstorrent Japan

Tenstorrent Japan

October 27, 2025
Tweet

Transcript

  1. Blackhole – A Standalone AI Computer CONFIDENTIAL - CONTAINS TRADE

    SECRETS Feature Spec Tensix Compute* 745 TFLOPS (8-bit) 372 TFLOPS (16-bit) SRAM 210 MB (Tensix Cores) 32 MB (SiFive x280 RISC-V Cores) Ethernet 12x 400G NoC 2 NoC per core 2D Torus 64B per clock DRAM 32 GB GDDR6 DRAM Bandwidth 512 GB/s “Baby RISC-V” Cores 700 (Tensix Cores) 52 (Controllers) SiFive x280 “Big RISC-V” Cores 16 PCI Express Gen 5.0 x16, 128 GB/s *Planned/projected performance Tensix Cores DRAM Cores ETH Cores PCIe Core ARC Core T D E P A CPU RISC-V CPUs
  2. Blackhole – A Standalone AI Computer CONFIDENTIAL - CONTAINS TRADE

    SECRETS *Planned/projected performance Tensix Cores DRAM Cores ETH Cores PCIe Core ARC Core T D E P A CPU RISC-V CPUs コア数/SRAM容量: 1.75倍 動作周波数: 1.35倍 Compute: 2.65倍 DRAM容量: 2.7倍 帯域: 1.7倍 Ethernet帯域: 3倍 消費電力: 2~3倍!! TensixCoreの配置が変わる →Tileレベル最適化はやり直し
  3. Performance Per Core • MMUL ops/cycle: 4096 • SIMD ops/cycle:

    64 • Tensix Core – FPU uses ~85% of power Multiple Data Formats Supported • FP8/16/32, BFLOAT16, BLOCKFP2/4/8, INT8/32, TF32 General Purpose SIMD Engine (SFPU) • Fast transcendental instructions • Gelu, exponential, softmax • SFPU C++ compiler • No need to off-load to CPU CONFIDENTIAL - CONTAINS TRADE SECRETS Tensix-BH Features: 基本的にWHと同じ Silicon-Proven with Tenstorrent Galaxy, Wormhole , and Blackhole
  4. Blackhole – “Big RISC-V” and “Baby RISC-V” CONFIDENTIAL - CONTAINS

    TRADE SECRETS Feature Spec Total Cores 16 (4 clusters of 4) Compute 64-bit dual-issue, in-order L3 Cache 2 MB / Core L2 Cache 128 KB / Core L1 I-Cache 32 KB / Core (2-way associative) L1 D-Cache 32 KB / Core (4-way associative) Feature Spec Total Cores 752 Compute 32-bit INT multiplier / divider Floating point (FP32/BFLOAT16) 128-bit vector (1 per Tensix) I-Cache 4 KB D-Scratch 8 KB x280 “Big RISC-V” “Baby RISC-V” Tensix Cores DRAM Cores ETH Cores T D E CPU RISC-V CPUs Planned: • Runs Linux • On-device host for the AI accelerator 752 “Baby RISC-V” Cores 16 x280 “Big RISC-V” Cores
  5. Programming RISC-V Cores: Within the DRAM Cores RISC-V 1 Router

    0 L1 Memory RISC-V 1 Router 0 L1 Memory RISC-V 1 Router 0 L1 Memory Crossbar DRAM Bank Controller Off-Chip DRAM 4GB • Kernels for asynchronous pre-load / spill to DRAM
  6. Programming RISC-V Cores: Within the Ethernet Core 12x400GbEtherとTileの数合わない問題 • 14個あるEthernet

    tileのうち 2個は完全に死んでる(RISC-VもL1 Memもない) • P150等では残り12個のうち4つはETHの先が未接続 • Galaxyでは10個のETHが接続される RISC-V 2 RISC-V 1 Router 1 Router 0 L1 Memory ETH Off-Chip Ethernet
  7. Blackhole : Ethernet-Based Scale-Out • 1 TB/s of Blackhole Ethernet

    • Can be connected into any topology • Mesh topology is great for AI • Locality and regularity of data movement • Sharded data • 200 GB/s in N / S / W / E / Z • 2D / 3D torus 10 T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM E E E E E E E E Tile Math Engine RISC-V Router DRAM Bank Controller ETH Controller Vector Math Engine Compute Data Movement Storage RISC-V RISC-V RISC-V RISC-V user kernel user kernel user kernel user kernel user kernel
  8. Tenstorrent Galaxy Blackhole Server: 32 Chips in a 4x8 Mesh

    11 BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH BH Z dim I/O: 32 x 200 GB/s Y dim I/O: 16 x 200 GB/s X dim I/O: 8 x 200 GB/s 11.2 TB/s Galaxy I/O Z-dimのリンクが追加されたので 3D-Meshとか2Dメッシュの小分けをAll to Allなどの妄想が広がる 多分そろそろ学習を目指せそうな雰囲気
  9. Blackhole Products 6nm AI Accelerator on PCIe Gen 5 Blackhole

    p150 P150b: passive-cooled(ファンなし) P150a: active-cooled (ファンあり) • 4x QSFP-DD 800G ports アクセラレータ 連結用 Ethernet(別売り) • 32GB DDR • 774TOPS(FP8), 194TFlops(FP16) • Dual-slot, 300W TBP Single ASIC Scale-Out
  10. Blackhole Products 6nm AI Accelerator on PCIe Gen 5 Blackhole

    p300 P300c: liquid-cooled 水冷, チップ2枚搭載 • 2xWarp400 bridge ports • 32x2 GB DDR • 570x2?? TOPS(FP8) • Single-slot, 600W TBP Single ASIC Scale-Out ? No Official Image
  11. TT-LoudBox/TT-QuietBox (Blackhole ) CONFIDENTIAL - CONTAINS TRADE SECRETS TT-LoudBox TT-QuietBox

    CPU 2xAMD EPYC 9124P AMD EPYC 8124P Memory 768GB (12x64 GB) DDR5 512 GB (8x64 GB) DDR5- 4800 Storage 4 TB NVMe PCIe 4.0 x4 4 TB NVMe PCIe 4.0 x4 Ethernet 10Gb 10Gb Tensix Processors 8 x p150b (8xBH Chips) 2 x p300c (4xBH Chips) Quietboxは100Vでも動くのでオフィス内でデスクトップ的に使える. Loudboxは200V必須のためサーバルームへの設置を推奨 LoudboxはメインのLLMを2系統, または画像生成, 音声認識など 小規模モデルを同時実行可能 → 完結したAIアプライアンスに最適 *Performance as of 4/16/2025. DP/TP refer to parallelization; DP is “Data Parallel”, TP is “Tensor Parallel”