HPC Interconnect Technologies in 2004

Slide 1

Slide 1 text

Adrian Cockcroft HPTC Chief Architect Sun Microsystems Inc. 4/21/04 Interconnect Technologies www.sun.com/hptc

Slide 2

Slide 2 text

Agenda Agenda Architectures Architectures Interconnects Interconnects Technical Comparison Technical Comparison

Slide 3

Slide 3 text

Architectures Architectures

Slide 4

Slide 4 text

Capability and Capacity Computing Proc Memory Switch Proc Mem I/O Mem I/O Proc Network Switch Proc Mem I/O Mem I/O Proc Mem I/O Cache-coherent shared- memory multi-processors (SMP) Tightly-coupled: highest bandwidth, lowest latency Large workloads: ad-hoc transaction processing, data warehousing Shared pool to over 100 processors Single Terabyte scale memory Cluster multi-processor Loosely coupled Standard H/W & S/W Highly parallel (web, HPTC) Scale Vertically (Capability) Single OS Instance Multiple OS Instances Scale Horizontally (Capacity) Cluster Mgmt.

Slide 5

Slide 5 text

Workload Performance Factors Processor speed, capacity and throughput Memory capacity System interconnect latency & bandwidth Network and storage I/O Operating system scalability Visualization performance and quality Optimized applications Network service availability #1 issue for real world cluster performance and scaling

Slide 6

Slide 6 text

CMT On-Chip 100 – x00 GB/s 0.1 - 0.01 µs $xxx? Interconnect Components Mapping Out Bandwidth and Latency Ethernet 0.1 GB/s 100 - 10µs $xxx Myri/IB/Q/FL 0.4 – 4.8 GB/s 10 - 1 µs $x,xxx Memory 9.6 - 57 GB/s 1 - 0.1 µs $xx,xxx 0.1 1 10 100 1000 Gigabytes/sec Bandwidth (on logarithmic scale) Latency (inverted log scale) 10ns 100ns 1us 10us 100us Proximity System Call Library Call Load/Store Instruction

Slide 7

Slide 7 text

Interconnects Interconnects

Slide 8

Slide 8 text

Ethernet Bandwidth – 1GigE 90-120 MB/s, big Solaris 10 improvements – Solaris now (finally) does Jumbo frames! – 10GigE Bandwidth is I/O bus limited by PCI-X Latency improvements on the way – 100us typical Solaris MPI over TCP/IP – 40-60us MPI over TCP/IP for simpler Linux stack – 10us MPI over TCP/IP with user-mode stack – 5us MPI over raw 1Gbit Ethernet (no switch) – Buffered switch latency 1-25us, 3-6us typical

Slide 9

Slide 9 text

Ethernet Scalable NFS 1GBit common, 10Gbit starting to emerge Disks NFS/QFS Servers Ethernet Switch SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet

Slide 10

Slide 10 text

Myrinet www.myri.com Latency improving – 11us typical Solaris/SPARC MPI (rev D) – 7us typical Opteron/Linux MPI (rev D) – 5us with latest rev E interface and GM software – 3.5us with new MX software (Oct 03 announce) – Non-Buffered low latency 128 way switch Bandwidth limited by 2Gbit fiber – Dual port rev E card supports 4Gbit each way – Full duplex reaches PCI-X limit of 900 MB/s

Slide 11

Slide 11 text

Myrinet Scalable NFS Very efficient Ethernet gateway Disks NFS/QFS Servers Myrinet Switch with Ethernet ports SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP Myrinet NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet 2Gbit Myrinet

Slide 12

Slide 12 text

Infiniband www.topspin.com, www.infinicon.com, www.voltaire.com, www.mellanox.com, www.infinibandta.org IB support is in Solaris Express (S10beta) Latency – 5.5us Opteron/Linux MPI – Non-Buffered 24port switch latency 200ns – Larger switches still under 1us Bandwidth limited by PCI-X – IBx4 carries 8Gbits of data on 10Gbit wire – Current limit about 825MBytes/s – Dual IBx4 over PCI-Express x8 chipset announced

Slide 13

Slide 13 text

Infiniband Protocol Options NFS or iSCSI over IP over IB – Emulate an 8Gbit Ethernet with NFS/TCP/IP/IB – Emulate an 8Gbit Ethernet with iSCSI/TCP/IP/IB NFS over RDMA – Reduce overhead with direct NFS/IB SRP - SCSI over RDMA Protocol – Reduce overhead with direct SCSI/IB SDP – Sockets Direct Protocol – Reduce overhead with socket library/IB

Slide 14

Slide 14 text

Infiniband Scalable NFS 1Gbit Ethernet may be a bottleneck Disks NFS/QFS Servers Infiniband Switch with Ethernet ports SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP over IB NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet 8Gbit IB

Slide 15

Slide 15 text

Infiniband Scalable NFS Infiniband directly connected to NFS servers – Higher bandwidth Disks NFS/QFS Servers Infiniband Switch SAN Switch Cluster Rack QFS Cluster Filesystem NFS over RDMA 2Gbit SAN 8Gbit IB

Slide 16

Slide 16 text

Infiniband Storage Efficient Direct Access to Disk Disks Infiniband Switch SAN Switch Cluster Rack Direct mount disk over SRP/IB FC ports added to IB Switch Use QFS for shared filesystem 2Gbit SAN 8Gbit IB

Slide 17

Slide 17 text

SunFire Link www.sun.com/servers/cluster_interconnects/sun_fire_link Solaris/SPARC Large server specific – Cambridge/Aachen Univ. Proof of Concept's – E25K 144 cores x 8 = 1152 processing threads Latency – 1.5us within SMP, 3.7us over SunFire Link – Non-Buffered 8port switch Bandwidth – Each link carries 850 MB/s (1.2 Gbytes/s raw) – Stripe x4 to get 2900 MB/s (4.8 Gbytes/s raw)

Slide 18

Slide 18 text

Quadrics Moving downmarket – 1024 or more Linux nodes today – Adding 8/64/128 way options Latency – 1.8us Opteron/Linux MPI (Feb 2004 data) – Large Non-Buffered very low latency switch – Low contention fat tree with dynamic routing Bandwidth limited by PCI-X – Current limit about 850 MBytes/s

Slide 19

Slide 19 text

SMP Backplane Shared memory OpenMP model Latency – Latency starts at 56ns for single Opteron – 100-200ns at 2-4 CPU Opteron or USIIIi – 270-550ns for 16-144 core SMP US IV systems Bandwidth limited by coherency – Global coherency 9.6-57 GB/s with USIII-USIV – Distributed coherency adds bandwidth

Slide 20

Slide 20 text

Chip Multi-Processing Shared memory OpenMP model – 32 threads per chip Latency – 10-50ns thread-thread via L2 on-chip cache Bandwidth – 100s of GB/s L2 cache bandwidth

Slide 21

Slide 21 text

Technical Comparison Technical Comparison

Slide 22

Slide 22 text

PCI-Bus Limitations PCI-64bits wide, 66MHz – Most SPARC platforms, runs up to 400MBytes/s – Older generation Myrinet and Quadrics PCI-X 64bits wide, 133MHz – Current V60x (Xeon) and V20z (Opteron) – Current generation Myrinet and Quadrics – All Infiniband adaptors – Runs up to about 850 Mbytes/s

Slide 23

Slide 23 text

Next Generation PCI-Express x8 Implementations expected during 2005 Similar physical layer to Infiniband – Each wire at 2.5GHz, carries 2Gbits/s of data – Common usage expected is 8 wires each way – Bandwidth is 16 Gbits/s, 2 Gbytes/s each way Interconnect limitations – Enough capacity for full speed 10Gbit Ethernet – Enough capacity for full speed Infiniband x4 – Limits Infiniband x12 to 66% of capacity

Slide 24

Slide 24 text

Interconnect Components Mapping Out Bandwidth and Latency Estimated/Approximate performance numbers 0.1 1 10 100 1000 Gigabytes/sec Bandwidth (on logarithmic scale) Latency (inverted log scale) 10ns 100ns 1us 10us 100us GBE Solaris 9 GBE Linux 10GBE Linux AMD Usermode GBE Linux Quadrix Linux AMD Myrinet IB x4 Linux AMD IB x12 Linux AMD SMP V20z SMP E25K SMP E6900 SMP V440 CMT Future PCI-X 133 Speed Limit MPI call latency Limit 10GBE PCI-E PCI-Express x8 Speed Limit

Slide 25

Slide 25 text

Summary Lots of choice! Prices dropping! Understand what your workload needs – High capacity storage? – A single global filespace? – Low latency MPI? – Large scale SMP with threaded OpenMP? Help is on its way from HPTC@Sun – More partnering, testing, support... – Reference architecture solutions – Professional Service practice guides

Slide 26

Slide 26 text

www.sun.com/hptc www.sun.com/grid Adrian Cockcroft [email protected]