HPC Interconnect Technologies in 2004

Adrian Cockcroft HPTC Chief Architect Sun Microsystems Inc. 4/21/04 Interconnect
Technologies www.sun.com/hptc

Agenda Agenda Architectures Architectures Interconnects Interconnects Technical Comparison Technical Comparison

Architectures Architectures

Capability and Capacity Computing Proc Memory Switch Proc Mem I/O
Mem I/O Proc Network Switch Proc Mem I/O Mem I/O Proc Mem I/O Cache-coherent shared- memory multi-processors (SMP) Tightly-coupled: highest bandwidth, lowest latency Large workloads: ad-hoc transaction processing, data warehousing Shared pool to over 100 processors Single Terabyte scale memory Cluster multi-processor Loosely coupled Standard H/W & S/W Highly parallel (web, HPTC) Scale Vertically (Capability) Single OS Instance Multiple OS Instances Scale Horizontally (Capacity) Cluster Mgmt.

Workload Performance Factors Processor speed, capacity and throughput Memory capacity
System interconnect latency & bandwidth Network and storage I/O Operating system scalability Visualization performance and quality Optimized applications Network service availability #1 issue for real world cluster performance and scaling

CMT On-Chip 100 – x00 GB/s 0.1 - 0.01 µs
$xxx? Interconnect Components Mapping Out Bandwidth and Latency Ethernet 0.1 GB/s 100 - 10µs $xxx Myri/IB/Q/FL 0.4 – 4.8 GB/s 10 - 1 µs $x,xxx Memory 9.6 - 57 GB/s 1 - 0.1 µs $xx,xxx 0.1 1 10 100 1000 Gigabytes/sec Bandwidth (on logarithmic scale) Latency (inverted log scale) 10ns 100ns 1us 10us 100us Proximity System Call Library Call Load/Store Instruction

Interconnects Interconnects

Ethernet Bandwidth – 1GigE 90-120 MB/s, big Solaris 10 improvements
– Solaris now (finally) does Jumbo frames! – 10GigE Bandwidth is I/O bus limited by PCI-X Latency improvements on the way – 100us typical Solaris MPI over TCP/IP – 40-60us MPI over TCP/IP for simpler Linux stack – 10us MPI over TCP/IP with user-mode stack – 5us MPI over raw 1Gbit Ethernet (no switch) – Buffered switch latency 1-25us, 3-6us typical

Ethernet Scalable NFS 1GBit common, 10Gbit starting to emerge Disks
NFS/QFS Servers Ethernet Switch SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet

Myrinet www.myri.com Latency improving – 11us typical Solaris/SPARC MPI (rev
D) – 7us typical Opteron/Linux MPI (rev D) – 5us with latest rev E interface and GM software – 3.5us with new MX software (Oct 03 announce) – Non-Buffered low latency 128 way switch Bandwidth limited by 2Gbit fiber – Dual port rev E card supports 4Gbit each way – Full duplex reaches PCI-X limit of 900 MB/s

Myrinet Scalable NFS Very efficient Ethernet gateway Disks NFS/QFS Servers
Myrinet Switch with Ethernet ports SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP Myrinet NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet 2Gbit Myrinet

Infiniband www.topspin.com, www.infinicon.com, www.voltaire.com, www.mellanox.com, www.infinibandta.org IB support is in
Solaris Express (S10beta) Latency – 5.5us Opteron/Linux MPI – Non-Buffered 24port switch latency 200ns – Larger switches still under 1us Bandwidth limited by PCI-X – IBx4 carries 8Gbits of data on 10Gbit wire – Current limit about 825MBytes/s – Dual IBx4 over PCI-Express x8 chipset announced

Infiniband Protocol Options NFS or iSCSI over IP over IB
– Emulate an 8Gbit Ethernet with NFS/TCP/IP/IB – Emulate an 8Gbit Ethernet with iSCSI/TCP/IP/IB NFS over RDMA – Reduce overhead with direct NFS/IB SRP - SCSI over RDMA Protocol – Reduce overhead with direct SCSI/IB SDP – Sockets Direct Protocol – Reduce overhead with socket library/IB

Infiniband Scalable NFS 1Gbit Ethernet may be a bottleneck Disks
NFS/QFS Servers Infiniband Switch with Ethernet ports SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP over IB NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet 8Gbit IB

Infiniband Scalable NFS Infiniband directly connected to NFS servers –
Higher bandwidth Disks NFS/QFS Servers Infiniband Switch SAN Switch Cluster Rack QFS Cluster Filesystem NFS over RDMA 2Gbit SAN 8Gbit IB

Infiniband Storage Efficient Direct Access to Disk Disks Infiniband Switch
SAN Switch Cluster Rack Direct mount disk over SRP/IB FC ports added to IB Switch Use QFS for shared filesystem 2Gbit SAN 8Gbit IB

SunFire Link www.sun.com/servers/cluster_interconnects/sun_fire_link Solaris/SPARC Large server specific – Cambridge/Aachen Univ.
Proof of Concept's – E25K 144 cores x 8 = 1152 processing threads Latency – 1.5us within SMP, 3.7us over SunFire Link – Non-Buffered 8port switch Bandwidth – Each link carries 850 MB/s (1.2 Gbytes/s raw) – Stripe x4 to get 2900 MB/s (4.8 Gbytes/s raw)

Quadrics Moving downmarket – 1024 or more Linux nodes today
– Adding 8/64/128 way options Latency – 1.8us Opteron/Linux MPI (Feb 2004 data) – Large Non-Buffered very low latency switch – Low contention fat tree with dynamic routing Bandwidth limited by PCI-X – Current limit about 850 MBytes/s

SMP Backplane Shared memory OpenMP model Latency – Latency starts
at 56ns for single Opteron – 100-200ns at 2-4 CPU Opteron or USIIIi – 270-550ns for 16-144 core SMP US IV systems Bandwidth limited by coherency – Global coherency 9.6-57 GB/s with USIII-USIV – Distributed coherency adds bandwidth

Chip Multi-Processing Shared memory OpenMP model – 32 threads per
chip Latency – 10-50ns thread-thread via L2 on-chip cache Bandwidth – 100s of GB/s L2 cache bandwidth

Technical Comparison Technical Comparison

PCI-Bus Limitations PCI-64bits wide, 66MHz – Most SPARC platforms, runs
up to 400MBytes/s – Older generation Myrinet and Quadrics PCI-X 64bits wide, 133MHz – Current V60x (Xeon) and V20z (Opteron) – Current generation Myrinet and Quadrics – All Infiniband adaptors – Runs up to about 850 Mbytes/s

Next Generation PCI-Express x8 Implementations expected during 2005 Similar physical
layer to Infiniband – Each wire at 2.5GHz, carries 2Gbits/s of data – Common usage expected is 8 wires each way – Bandwidth is 16 Gbits/s, 2 Gbytes/s each way Interconnect limitations – Enough capacity for full speed 10Gbit Ethernet – Enough capacity for full speed Infiniband x4 – Limits Infiniband x12 to 66% of capacity

Interconnect Components Mapping Out Bandwidth and Latency Estimated/Approximate performance numbers
0.1 1 10 100 1000 Gigabytes/sec Bandwidth (on logarithmic scale) Latency (inverted log scale) 10ns 100ns 1us 10us 100us GBE Solaris 9 GBE Linux 10GBE Linux AMD Usermode GBE Linux Quadrix Linux AMD Myrinet IB x4 Linux AMD IB x12 Linux AMD SMP V20z SMP E25K SMP E6900 SMP V440 CMT Future PCI-X 133 Speed Limit MPI call latency Limit 10GBE PCI-E PCI-Express x8 Speed Limit

Summary Lots of choice! Prices dropping! Understand what your workload
needs – High capacity storage? – A single global filespace? – Low latency MPI? – Large scale SMP with threaded OpenMP? Help is on its way from HPTC@Sun – More partnering, testing, support... – Reference architecture solutions – Professional Service practice guides

www.sun.com/hptc www.sun.com/grid Adrian Cockcroft [email protected]

HPC Interconnect Technologies in 2004

HPC Interconnect Technologies in 2004

Adrian Cockcroft

More Decks by Adrian Cockcroft

Other Decks in Technology

Featured

Transcript

Adrian Cockcroft HPTC Chief Architect Sun Microsystems Inc. 4/21/04 Interconnect

Agenda Agenda Architectures Architectures Interconnects Interconnects Technical Comparison Technical Comparison

Architectures Architectures

Capability and Capacity Computing Proc Memory Switch Proc Mem I/O

Workload Performance Factors Processor speed, capacity and throughput Memory capacity

CMT On-Chip 100 – x00 GB/s 0.1 - 0.01 µs

Interconnects Interconnects

Ethernet Bandwidth – 1GigE 90-120 MB/s, big Solaris 10 improvements

Ethernet Scalable NFS 1GBit common, 10Gbit starting to emerge Disks

Myrinet www.myri.com Latency improving – 11us typical Solaris/SPARC MPI (rev

Myrinet Scalable NFS Very efficient Ethernet gateway Disks NFS/QFS Servers

Infiniband www.topspin.com, www.infinicon.com, www.voltaire.com, www.mellanox.com, www.infinibandta.org IB support is in

Infiniband Protocol Options NFS or iSCSI over IP over IB

Infiniband Scalable NFS 1Gbit Ethernet may be a bottleneck Disks

Infiniband Scalable NFS Infiniband directly connected to NFS servers –

Infiniband Storage Efficient Direct Access to Disk Disks Infiniband Switch

SunFire Link www.sun.com/servers/cluster_interconnects/sun_fire_link Solaris/SPARC Large server specific – Cambridge/Aachen Univ.

Quadrics Moving downmarket – 1024 or more Linux nodes today

SMP Backplane Shared memory OpenMP model Latency – Latency starts

Chip Multi-Processing Shared memory OpenMP model – 32 threads per

Technical Comparison Technical Comparison

PCI-Bus Limitations PCI-64bits wide, 66MHz – Most SPARC platforms, runs

Next Generation PCI-Express x8 Implementations expected during 2005 Similar physical

Interconnect Components Mapping Out Bandwidth and Latency Estimated/Approximate performance numbers

Summary Lots of choice! Prices dropping! Understand what your workload

www.sun.com/hptc www.sun.com/grid Adrian Cockcroft [email protected]