Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HPC Interconnect Technologies in 2004

HPC Interconnect Technologies in 2004

Presented at the Sun SuperG event in 2004, summarizing the different flavors of HPC interconnects at the time.

Adrian Cockcroft

November 19, 2022
Tweet

More Decks by Adrian Cockcroft

Other Decks in Technology

Transcript

  1. Adrian Cockcroft
    HPTC Chief Architect
    Sun Microsystems Inc.
    4/21/04
    Interconnect Technologies
    www.sun.com/hptc

    View Slide

  2. Agenda
    Agenda
    Architectures
    Architectures
    Interconnects
    Interconnects
    Technical Comparison
    Technical Comparison

    View Slide

  3. Architectures
    Architectures

    View Slide

  4. Capability and Capacity Computing
    Proc
    Memory Switch
    Proc
    Mem
    I/O
    Mem I/O
    Proc
    Network Switch
    Proc
    Mem
    I/O
    Mem
    I/O
    Proc
    Mem
    I/O
    Cache-coherent shared-
    memory multi-processors (SMP)
    Tightly-coupled: highest
    bandwidth, lowest latency
    Large workloads: ad-hoc
    transaction processing,
    data warehousing
    Shared pool to over 100 processors
    Single Terabyte scale memory
    Cluster multi-processor
    Loosely coupled
    Standard H/W & S/W
    Highly parallel (web, HPTC)
    Scale Vertically (Capability)
    Single OS
    Instance
    Multiple
    OS Instances
    Scale Horizontally (Capacity)
    Cluster Mgmt.

    View Slide

  5. Workload Performance Factors
    Processor speed, capacity and throughput
    Memory capacity
    System interconnect
    latency & bandwidth
    Network and storage I/O
    Operating system scalability
    Visualization performance and quality
    Optimized applications
    Network service availability
    #1 issue
    for real world
    cluster
    performance
    and scaling

    View Slide

  6. CMT On-Chip
    100 – x00 GB/s
    0.1 - 0.01 µs
    $xxx?
    Interconnect Components
    Mapping Out Bandwidth and Latency
    Ethernet
    0.1 GB/s
    100 - 10µs
    $xxx
    Myri/IB/Q/FL
    0.4 – 4.8 GB/s
    10 - 1 µs
    $x,xxx
    Memory
    9.6 - 57 GB/s
    1 - 0.1 µs
    $xx,xxx
    0.1 1 10 100 1000
    Gigabytes/sec Bandwidth (on logarithmic scale)
    Latency
    (inverted
    log scale)
    10ns
    100ns
    1us
    10us
    100us
    Proximity
    System
    Call
    Library
    Call
    Load/Store
    Instruction

    View Slide

  7. Interconnects
    Interconnects

    View Slide

  8. Ethernet
    Bandwidth

    1GigE 90-120 MB/s, big Solaris 10 improvements

    Solaris now (finally) does Jumbo frames!

    10GigE Bandwidth is I/O bus limited by PCI-X
    Latency improvements on the way

    100us typical Solaris MPI over TCP/IP

    40-60us MPI over TCP/IP for simpler Linux stack

    10us MPI over TCP/IP with user-mode stack

    5us MPI over raw 1Gbit Ethernet (no switch)

    Buffered switch latency 1-25us, 3-6us typical

    View Slide

  9. Ethernet Scalable NFS
    1GBit common, 10Gbit starting to emerge
    Disks NFS/QFS
    Servers
    Ethernet
    Switch
    SAN
    Switch
    Cluster Rack
    QFS Cluster Filesystem
    NFS over
    TCP/IP on
    Ethernet
    2Gbit SAN
    1Gbit Ethernet

    View Slide

  10. Myrinet www.myri.com
    Latency improving

    11us typical Solaris/SPARC MPI (rev D)

    7us typical Opteron/Linux MPI (rev D)

    5us with latest rev E interface and GM software

    3.5us with new MX software (Oct 03 announce)

    Non-Buffered low latency 128 way switch
    Bandwidth limited by 2Gbit fiber

    Dual port rev E card supports 4Gbit each way

    Full duplex reaches PCI-X limit of 900 MB/s

    View Slide

  11. Myrinet Scalable NFS
    Very efficient Ethernet gateway
    Disks NFS/QFS
    Servers
    Myrinet
    Switch with
    Ethernet ports
    SAN
    Switch
    Cluster Rack
    QFS Cluster Filesystem
    NFS over
    TCP/IP
    Myrinet
    NFS over
    TCP/IP on
    Ethernet
    2Gbit SAN
    1Gbit Ethernet
    2Gbit Myrinet

    View Slide

  12. Infiniband www.topspin.com, www.infinicon.com,
    www.voltaire.com, www.mellanox.com, www.infinibandta.org
    IB support is in Solaris Express (S10beta)
    Latency

    5.5us Opteron/Linux MPI

    Non-Buffered 24port switch latency 200ns

    Larger switches still under 1us
    Bandwidth limited by PCI-X

    IBx4 carries 8Gbits of data on 10Gbit wire

    Current limit about 825MBytes/s

    Dual IBx4 over PCI-Express x8 chipset announced

    View Slide

  13. Infiniband Protocol Options
    NFS or iSCSI over IP over IB

    Emulate an 8Gbit Ethernet with NFS/TCP/IP/IB

    Emulate an 8Gbit Ethernet with iSCSI/TCP/IP/IB
    NFS over RDMA

    Reduce overhead with direct NFS/IB
    SRP - SCSI over RDMA Protocol

    Reduce overhead with direct SCSI/IB
    SDP – Sockets Direct Protocol

    Reduce overhead with socket library/IB

    View Slide

  14. Infiniband Scalable NFS
    1Gbit Ethernet may be a bottleneck
    Disks NFS/QFS
    Servers
    Infiniband
    Switch with
    Ethernet ports
    SAN
    Switch
    Cluster Rack
    QFS Cluster Filesystem
    NFS over
    TCP/IP
    over IB
    NFS over
    TCP/IP on
    Ethernet
    2Gbit SAN
    1Gbit Ethernet
    8Gbit IB

    View Slide

  15. Infiniband Scalable NFS
    Infiniband directly connected to NFS servers – Higher bandwidth
    Disks NFS/QFS
    Servers Infiniband
    Switch
    SAN
    Switch
    Cluster Rack
    QFS Cluster Filesystem NFS over RDMA
    2Gbit SAN
    8Gbit IB

    View Slide

  16. Infiniband Storage
    Efficient Direct Access to Disk
    Disks
    Infiniband
    Switch
    SAN
    Switch
    Cluster Rack
    Direct mount disk over SRP/IB
    FC ports added to IB Switch
    Use QFS for shared filesystem
    2Gbit SAN
    8Gbit IB

    View Slide

  17. SunFire Link
    www.sun.com/servers/cluster_interconnects/sun_fire_link
    Solaris/SPARC Large server specific

    Cambridge/Aachen Univ. Proof of Concept's

    E25K 144 cores x 8 = 1152 processing threads
    Latency

    1.5us within SMP, 3.7us over SunFire Link

    Non-Buffered 8port switch
    Bandwidth

    Each link carries 850 MB/s (1.2 Gbytes/s raw)

    Stripe x4 to get 2900 MB/s (4.8 Gbytes/s raw)

    View Slide

  18. Quadrics
    Moving downmarket

    1024 or more Linux nodes today

    Adding 8/64/128 way options
    Latency

    1.8us Opteron/Linux MPI (Feb 2004 data)

    Large Non-Buffered very low latency switch

    Low contention fat tree with dynamic routing
    Bandwidth limited by PCI-X

    Current limit about 850 MBytes/s

    View Slide

  19. SMP Backplane
    Shared memory OpenMP model
    Latency

    Latency starts at 56ns for single Opteron

    100-200ns at 2-4 CPU Opteron or USIIIi

    270-550ns for 16-144 core SMP US IV systems
    Bandwidth limited by coherency

    Global coherency 9.6-57 GB/s with USIII-USIV

    Distributed coherency adds bandwidth

    View Slide

  20. Chip Multi-Processing
    Shared memory OpenMP model

    32 threads per chip
    Latency

    10-50ns thread-thread via L2 on-chip cache
    Bandwidth

    100s of GB/s L2 cache bandwidth

    View Slide

  21. Technical Comparison
    Technical Comparison

    View Slide

  22. PCI-Bus Limitations
    PCI-64bits wide, 66MHz

    Most SPARC platforms, runs up to 400MBytes/s

    Older generation Myrinet and Quadrics
    PCI-X 64bits wide, 133MHz

    Current V60x (Xeon) and V20z (Opteron)

    Current generation Myrinet and Quadrics

    All Infiniband adaptors

    Runs up to about 850 Mbytes/s

    View Slide

  23. Next Generation PCI-Express x8
    Implementations expected during 2005
    Similar physical layer to Infiniband

    Each wire at 2.5GHz, carries 2Gbits/s of data

    Common usage expected is 8 wires each way

    Bandwidth is 16 Gbits/s, 2 Gbytes/s each way
    Interconnect limitations

    Enough capacity for full speed 10Gbit Ethernet

    Enough capacity for full speed Infiniband x4

    Limits Infiniband x12 to 66% of capacity

    View Slide

  24. Interconnect Components
    Mapping Out Bandwidth and Latency
    Estimated/Approximate performance numbers
    0.1 1 10 100 1000
    Gigabytes/sec Bandwidth (on logarithmic scale)
    Latency
    (inverted
    log scale)
    10ns
    100ns
    1us
    10us
    100us
    GBE
    Solaris 9
    GBE
    Linux
    10GBE
    Linux
    AMD
    Usermode
    GBE
    Linux
    Quadrix
    Linux
    AMD
    Myrinet IB x4
    Linux
    AMD
    IB x12
    Linux
    AMD
    SMP
    V20z
    SMP
    E25K
    SMP
    E6900
    SMP
    V440
    CMT
    Future
    PCI-X 133
    Speed
    Limit
    MPI call
    latency Limit
    10GBE
    PCI-E
    PCI-Express
    x8 Speed
    Limit

    View Slide

  25. Summary
    Lots of choice! Prices dropping!
    Understand what your workload needs

    High capacity storage?

    A single global filespace?

    Low latency MPI?

    Large scale SMP with threaded OpenMP?
    Help is on its way from [email protected]

    More partnering, testing, support...

    Reference architecture solutions

    Professional Service practice guides

    View Slide

  26. www.sun.com/hptc
    www.sun.com/grid
    Adrian Cockcroft
    [email protected]

    View Slide