Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HPC Interconnect Technologies in 2004

HPC Interconnect Technologies in 2004

Presented at the Sun SuperG event in 2004, summarizing the different flavors of HPC interconnects at the time.

Adrian Cockcroft

November 19, 2022
Tweet

More Decks by Adrian Cockcroft

Other Decks in Technology

Transcript

  1. Capability and Capacity Computing Proc Memory Switch Proc Mem I/O

    Mem I/O Proc Network Switch Proc Mem I/O Mem I/O Proc Mem I/O Cache-coherent shared- memory multi-processors (SMP) Tightly-coupled: highest bandwidth, lowest latency Large workloads: ad-hoc transaction processing, data warehousing Shared pool to over 100 processors Single Terabyte scale memory Cluster multi-processor Loosely coupled Standard H/W & S/W Highly parallel (web, HPTC) Scale Vertically (Capability) Single OS Instance Multiple OS Instances Scale Horizontally (Capacity) Cluster Mgmt.
  2. Workload Performance Factors Processor speed, capacity and throughput Memory capacity

    System interconnect latency & bandwidth Network and storage I/O Operating system scalability Visualization performance and quality Optimized applications Network service availability #1 issue for real world cluster performance and scaling
  3. CMT On-Chip 100 – x00 GB/s 0.1 - 0.01 µs

    $xxx? Interconnect Components Mapping Out Bandwidth and Latency Ethernet 0.1 GB/s 100 - 10µs $xxx Myri/IB/Q/FL 0.4 – 4.8 GB/s 10 - 1 µs $x,xxx Memory 9.6 - 57 GB/s 1 - 0.1 µs $xx,xxx 0.1 1 10 100 1000 Gigabytes/sec Bandwidth (on logarithmic scale) Latency (inverted log scale) 10ns 100ns 1us 10us 100us Proximity System Call Library Call Load/Store Instruction
  4. Ethernet Bandwidth – 1GigE 90-120 MB/s, big Solaris 10 improvements

    – Solaris now (finally) does Jumbo frames! – 10GigE Bandwidth is I/O bus limited by PCI-X Latency improvements on the way – 100us typical Solaris MPI over TCP/IP – 40-60us MPI over TCP/IP for simpler Linux stack – 10us MPI over TCP/IP with user-mode stack – 5us MPI over raw 1Gbit Ethernet (no switch) – Buffered switch latency 1-25us, 3-6us typical
  5. Ethernet Scalable NFS 1GBit common, 10Gbit starting to emerge Disks

    NFS/QFS Servers Ethernet Switch SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet
  6. Myrinet www.myri.com Latency improving – 11us typical Solaris/SPARC MPI (rev

    D) – 7us typical Opteron/Linux MPI (rev D) – 5us with latest rev E interface and GM software – 3.5us with new MX software (Oct 03 announce) – Non-Buffered low latency 128 way switch Bandwidth limited by 2Gbit fiber – Dual port rev E card supports 4Gbit each way – Full duplex reaches PCI-X limit of 900 MB/s
  7. Myrinet Scalable NFS Very efficient Ethernet gateway Disks NFS/QFS Servers

    Myrinet Switch with Ethernet ports SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP Myrinet NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet 2Gbit Myrinet
  8. Infiniband www.topspin.com, www.infinicon.com, www.voltaire.com, www.mellanox.com, www.infinibandta.org IB support is in

    Solaris Express (S10beta) Latency – 5.5us Opteron/Linux MPI – Non-Buffered 24port switch latency 200ns – Larger switches still under 1us Bandwidth limited by PCI-X – IBx4 carries 8Gbits of data on 10Gbit wire – Current limit about 825MBytes/s – Dual IBx4 over PCI-Express x8 chipset announced
  9. Infiniband Protocol Options NFS or iSCSI over IP over IB

    – Emulate an 8Gbit Ethernet with NFS/TCP/IP/IB – Emulate an 8Gbit Ethernet with iSCSI/TCP/IP/IB NFS over RDMA – Reduce overhead with direct NFS/IB SRP - SCSI over RDMA Protocol – Reduce overhead with direct SCSI/IB SDP – Sockets Direct Protocol – Reduce overhead with socket library/IB
  10. Infiniband Scalable NFS 1Gbit Ethernet may be a bottleneck Disks

    NFS/QFS Servers Infiniband Switch with Ethernet ports SAN Switch Cluster Rack QFS Cluster Filesystem NFS over TCP/IP over IB NFS over TCP/IP on Ethernet 2Gbit SAN 1Gbit Ethernet 8Gbit IB
  11. Infiniband Scalable NFS Infiniband directly connected to NFS servers –

    Higher bandwidth Disks NFS/QFS Servers Infiniband Switch SAN Switch Cluster Rack QFS Cluster Filesystem NFS over RDMA 2Gbit SAN 8Gbit IB
  12. Infiniband Storage Efficient Direct Access to Disk Disks Infiniband Switch

    SAN Switch Cluster Rack Direct mount disk over SRP/IB FC ports added to IB Switch Use QFS for shared filesystem 2Gbit SAN 8Gbit IB
  13. SunFire Link www.sun.com/servers/cluster_interconnects/sun_fire_link Solaris/SPARC Large server specific – Cambridge/Aachen Univ.

    Proof of Concept's – E25K 144 cores x 8 = 1152 processing threads Latency – 1.5us within SMP, 3.7us over SunFire Link – Non-Buffered 8port switch Bandwidth – Each link carries 850 MB/s (1.2 Gbytes/s raw) – Stripe x4 to get 2900 MB/s (4.8 Gbytes/s raw)
  14. Quadrics Moving downmarket – 1024 or more Linux nodes today

    – Adding 8/64/128 way options Latency – 1.8us Opteron/Linux MPI (Feb 2004 data) – Large Non-Buffered very low latency switch – Low contention fat tree with dynamic routing Bandwidth limited by PCI-X – Current limit about 850 MBytes/s
  15. SMP Backplane Shared memory OpenMP model Latency – Latency starts

    at 56ns for single Opteron – 100-200ns at 2-4 CPU Opteron or USIIIi – 270-550ns for 16-144 core SMP US IV systems Bandwidth limited by coherency – Global coherency 9.6-57 GB/s with USIII-USIV – Distributed coherency adds bandwidth
  16. Chip Multi-Processing Shared memory OpenMP model – 32 threads per

    chip Latency – 10-50ns thread-thread via L2 on-chip cache Bandwidth – 100s of GB/s L2 cache bandwidth
  17. PCI-Bus Limitations PCI-64bits wide, 66MHz – Most SPARC platforms, runs

    up to 400MBytes/s – Older generation Myrinet and Quadrics PCI-X 64bits wide, 133MHz – Current V60x (Xeon) and V20z (Opteron) – Current generation Myrinet and Quadrics – All Infiniband adaptors – Runs up to about 850 Mbytes/s
  18. Next Generation PCI-Express x8 Implementations expected during 2005 Similar physical

    layer to Infiniband – Each wire at 2.5GHz, carries 2Gbits/s of data – Common usage expected is 8 wires each way – Bandwidth is 16 Gbits/s, 2 Gbytes/s each way Interconnect limitations – Enough capacity for full speed 10Gbit Ethernet – Enough capacity for full speed Infiniband x4 – Limits Infiniband x12 to 66% of capacity
  19. Interconnect Components Mapping Out Bandwidth and Latency Estimated/Approximate performance numbers

    0.1 1 10 100 1000 Gigabytes/sec Bandwidth (on logarithmic scale) Latency (inverted log scale) 10ns 100ns 1us 10us 100us GBE Solaris 9 GBE Linux 10GBE Linux AMD Usermode GBE Linux Quadrix Linux AMD Myrinet IB x4 Linux AMD IB x12 Linux AMD SMP V20z SMP E25K SMP E6900 SMP V440 CMT Future PCI-X 133 Speed Limit MPI call latency Limit 10GBE PCI-E PCI-Express x8 Speed Limit
  20. Summary Lots of choice! Prices dropping! Understand what your workload

    needs – High capacity storage? – A single global filespace? – Low latency MPI? – Large scale SMP with threaded OpenMP? Help is on its way from HPTC@Sun – More partnering, testing, support... – Reference architecture solutions – Professional Service practice guides