regularity of data movement • Sharded data • 200 GB/s in N / S / W / E / Z • 2D / 3D torus 3 T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T C C C C C C C C C C C C C C C C E E E E E E E E DRAM E E E E E E E E Tile Math Engine RISC-V Router DRAM Bank Controller ETH Controller Vector Math Engine Compute Data Movement Storage RISC-V RISC-V RISC-V RISC-V user kernel user kernel user kernel user kernel user kernel
E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E E E E E W W W W W W W W W W W W W W W W E E E E TT-Fabric Unicast TT-Fabric Multi-Cast (Chip and Core Level) W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W W
2D Mesh • 2D Torus また、Reliability modeを有効にすることで、これらを組み合わせたHybrid Topologyもサポートされる Why not 2D everywhere? • TT-Fabricは「Pay for what you use」という設計思想に基づいている • つまり、すべてのワークロードが高い接続性を必要とするわけではなく、単純なRing構成で十分な場合もある。そ のため、最もシンプルなトポロジーを選択することで、より高い効率で動作する場合がある。 11 x Multi-Mesh
• 現時点では、2次元メッシュのみをサポートしているが、3次元以上のトポロジーも対応予定 13 Mesh Cluster chip core(s) line fabric[0] line fabric[1] line fabric[2] line fabric[3] line fabric[4] line fabric[5]
する • 単一のSource内では必ず順序通りに到着 • Routing Plane間では順序保証はない つまり、グローバルな順序は保証されず、Source単位の順序が基本となる 18 挙動のイメージ: • CIはCI0からのTrafficは考慮しない • 自分のEndpointが受信可能かどうかのみを判断 • RP0 / RP1 の間では順序制御は行われない C0 C1 C2 X Y Z x y z RP 0 RP 1 i j k
on Chip 0 does a fabric multicast to all other devices • NoC level has multiple destinations; packet is broken into two chunks and scattered • e.g. to cores (3,0) and (1,2) for this packet • Improves efficiency; common for interleaved tensors 23 C0 C1 C2 C3 Fabric Level NoC Level dest: W(3,0) W(1,2) chunk0 chunk1
Chip 0 does a fabric multicast to all other devices • NoC level send payload to (3,3), and semaphore inc to (1,0) • Note: race condition if (1,0) needs to eventually read from (3,3) • Solved with pipeline flush (unless destination is the same, flush the pipeline) 24 C0 C1 Fabric Level NoC Level payload seminc (+1) (+1) Packet
Single Mesh Inter-Mesh, Spans Multi-Mesh Point-To Point ALLOWED ALLOWED N/A Fabric Multicast ALLOWED ALLOWED UNSUPPORTED s d s d d d d s d s s d d d d d d d d
must: • Establish a connection with local fabric endpoint • Check/Wait for buffering space in local fabric endpoint • Send packet(s) • Close connection 27 C0 C1 C2 C3 Connectionの仕組みは削除予定
completion, the worker must terminate the connection • Worker shares information such as last written buffer index for the next connection to read 31 disconnect ack
credits for the packet (2), it processes it • The packet is forwarded over Ethernet as soon as the destination has enough buffering available (3) 32 worker packet packet outbound buffer 0 outbound buffer 1 inbound buffer inbound buffer Chip 0 (1) (2) Ethernet packet (3) Chip 1
router gets a new packet (3) • It inspects the packet to see: • it’s a destination of packet; it must write out to local noc (4) • it must forward the packet further in the fabric (5) • packet only forwarded when safe (space available) 34 inbound buffer packet (3) Chip 1 outbound buffer 0 outbound buffer 1 inbound buffer worker (0,0) worker (0,1) worker (1,0) worker (1,1) payload (4) packet (5)