Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An MPI Framework for HPC Clusters Deployed with Software-Defined Networking

An MPI Framework for HPC Clusters Deployed with Software-Defined Networking

This talk will present our ongoing work on SDN-enhanced MPI, an MPI framework for HPC clusters equipped with Software-Defined Networking (SDN). The goal of this framework is to improve inter-process communication performance and network utilization through dynamically reconfiguring the interconnect based on the communication pattern of applications. We present the overview of SDN-enhanced MPI and summarize recent achievements including several accelerated MPI collectives as well as a network profiling tool for use with our framework.

Keichi Takahashi

March 22, 2018
Tweet

More Decks by Keichi Takahashi

Other Decks in Research

Transcript

  1. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    An MPI Framework for HPC Clusters 

    Deployed with Software-Defined Networking
    Keichi Takahashi, Khureltulga Dashdavaa, Susumu Date,

    Yoshiyuki Kido, Shinji Shimojo
    Cybermedia Center, Osaka University

    View Slide

  2. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Scale-out of Interconnects
    Interconnects are becoming increasingly
    larger and complex
    ‣ since the number of nodes it has to
    accommodate is increasing
    ‣ can consume up to 50% of total power [1]
    and 1/3 of total budget [2] of a cluster
    2
    Number of Cores of Top500 Systems
    1E+02 1E+04 1E+06 1E+08
    2002/06
    2004/06
    2006/06
    2008/06
    2010/06
    2012/06
    2014/06
    2016/06
    1st 10th 100th
    [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-
    Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007.
    [2] D. Abts et al., “Energy proportional datacenter networks,” ACM
    SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010.

    View Slide

  3. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Mainstream Design of Interconnects
    Network resources are statically allocated
    ‣ Routing, bandwidth allocation, topology etc. are fixed
    ‣ Simple and easy to implement
    ‣ Consequently, unaware of the communication pattern of applications
    Over-provisioned
    ‣ Redundant links and bandwidth are provisioned
    ‣ To assure the communication performance of diverse applications with
    different communication patterns
    ‣ However, over-provisioning is becoming more and more expensive due
    to the rapid scale-out of clusters
    3

    View Slide

  4. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Drawbacks of Static Interconnects
    4
    Inter-process communication
    pattern of applications
    1. Load imbalance among links
    2. Low utilization of network resources
    3. Low inter-node communication performance
    Interconnect
    Mismatch
    Drawbacks

    View Slide

  5. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Basic Concept of the Framework
    5
    0 1 3
    2 4 5 7
    6
    Cluster
    Communication Pattern
    0
    2
    5
    7
    1
    3
    4
    6
    Interconnect Configuration
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Apply
    Profile, trace, static
    analysis, etc.
    Plan &
    optimize
    0
    Switch
    Server
    Process

    View Slide

  6. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Temporal Granularity of Reconfiguration
    Per-job
    ‣ Pros: Relatively simple to implement
    ‣ Cons: Limited effect on applications with time-varying communication
    patterns
    Per-primitive
    ‣ Pros: Fine-grained control, can support time-varying comm patterns
    ‣ Cons: Potentially high overhead, needs intricate mechanism to
    synchronize application execution and interconnect control
    Per-packet (a.k.a adaptive routing)
    ‣ Pros: Works without prior knowledge of application
    ‣ Cons: Unable to utilize global view of network
    6

    View Slide

  7. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Overview
    7
    1
    0 3
    2 5
    4 7
    6
    Cluster
    Communication Pattern
    0
    2
    5
    7
    1
    3
    4
    6
    Profile, trace, static
    analysis, etc.
    Interconnect Configuration
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Apply
    Plan &
    optimize

    View Slide

  8. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    PFProf | MPI Profiler
    An MPI profiler focused on network activity monitoring
    ‣ outputs traffic matrices and other network statistics as JSON file
    ‣ can detect underlying communication of collective MPI functions
    ‣ implemented based on PMPI and PERUSE
    8
    1 2 3 4 5 6 7
    1
    4
    3
    2
    5
    7
    6
    Actual communication performed
    (when using binomial tree algorithm)
    0
    0
    Behavior of MPI_Bcast as seen from
    applications
    Keichi Takahashi et al. "PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects", HPCMASPA 2017

    View Slide

  9. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Overview
    9
    1
    0 3
    2 5
    4 7
    6
    Cluster
    Communication Pattern
    Interconnect Configuration
    0
    2
    5
    7
    1
    3
    4
    6
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Apply
    Profile, trace, static
    analysis, etc.
    Plan &
    optimize

    View Slide

  10. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Path Allocation Algorithm
    Objective: Maximize load balance and minimize congestion on links
    ‣ Currently, a simple greedy heuristic is adopted
    ‣ Finding the optimal load balancing of multiple flows is an NP-complete
    problem (variation of multi-commodity flow problem)
    10
    0
    2
    1
    3
    1
    2 2
    3
    2 3
    3
    0 2
    2
    1 3
    2
    0 1
    1
    Order by traffic
    volume Allocate path
    0 2
    1 3

    View Slide

  11. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Overview
    11
    0 1 3
    2 4 5 7
    6
    Cluster
    Communication Pattern
    0
    2
    5
    7
    1
    3
    4
    6
    Interconnect Configuration
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Apply
    Profile, trace, static
    analysis, etc.
    Plan &
    optimize

    View Slide

  12. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    SDN | Software-Defined Networking
    12
    Feature
    Control Plane
    Data Plane
    Conventional Networking
    Southbound API (e.g. OpenFlow)
    Northbound API
    App
    App App
    Control Plane
    Data Plane
    Feature
    Software Defined Networking
    Disaggregation

    View Slide

  13. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    OpenFlow | Standard Implementation of SDN
    13
    Src MAC Dst MAC … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,…
    Src MAC Dst MAC … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,…
    Src MAC Dst MAC … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,…
    Control Plane
    Data Plane
    Add/Modify/Delete flow entries
    Inject packets into data plane
    Notify flow entry misses
    Flow Table (Collection of flow entries)
    OpenFlow Controller
    OpenFlow Messages

    View Slide

  14. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Overview
    14
    1
    0 3
    2 5
    4 7
    6
    Cluster
    Communication Pattern
    Interconnect Configuration
    0
    2
    5
    7
    1
    3
    4
    6
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Source Dest … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Output to Port Y
    Apply
    Profile, trace, static
    analysis, etc.
    Plan &
    optimize
    OpenFlow

    View Slide

  15. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Putting Them Altogether
    15
    1
    0 3
    2 5
    4 7
    6
    OpenFlow

    Controller
    Job

    Scheduler
    Start, monitor and
    terminate jobs
    Reconfigure interconnect
    Integration
    Oversees
    Communication
    Oversees
    Computation

    View Slide

  16. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Overall Architecture
    16
    Head Node
    Compute Node
    slurmctld
    Launch
    sbatch, srun, …
    slurmdbd
    Job
    app
    Interconnect

    Manager
    Interconnect Manager Node
    Communication
    Pattern DB
    Flow Entries
    OpenFlow

    Controller
    PfProf
    slurmd
    Job Info
    Plugin
    User
    Developed Components
    Existing Components
    Process

    Placement
    Plugin
    Interconnect

    View Slide

  17. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Evaluated the execution time of a communication-intensive
    benchmark with and without using our framework
    ‣ Cluster of 20 compute nodes each equipped with 1CPU (8 cores)
    ‣ Interconnect is a 2 level fa-tree with 2.5:1 oversubscription ratio
    ‣ NAS CG benchmark with 128 processes was used as the workload
    ‣ Different node selection and process placement strategies
    Preliminary Evaluation
    17

    View Slide

  18. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Evaluation Results
    ‣ SDN achieves consistently better
    communication performance than
    static routing (D-mod-K)
    ‣ Process placement and node
    selection has significant effect on
    the performance
    18
    Communication Time [s]
    0
    175
    350
    525
    700
    Process Allocation
    Block A
    Block B
    Block C
    Block D
    Block E
    Cyclic
    SDN
    D-mod-K
    26% Reduction

    View Slide

  19. The 27th Workshop on Sustained Simulation Performance (WSSP27)
    Conclusion
    Summary
    ‣ A framework that dynamically reconfigures the interconnect to match the
    communication pattern of MPI applications is proposed
    ‣ The proposed framework integrates the interconnect controller into 

    the job scheduler
    ‣ Evaluation indicates improvement in communication performance
    Future Directions
    ‣ Extensive benchmark evaluation using diverse applications
    ‣ Combine application-aware process placement and node selection
    ‣ Adopt sophisticated path allocation algorithms
    19

    View Slide