Slide 1

Slide 1 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) An MPI Framework for HPC Clusters 
 Deployed with Software-Defined Networking Keichi Takahashi, Khureltulga Dashdavaa, Susumu Date,
 Yoshiyuki Kido, Shinji Shimojo Cybermedia Center, Osaka University

Slide 2

Slide 2 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Scale-out of Interconnects Interconnects are becoming increasingly larger and complex ‣ since the number of nodes it has to accommodate is increasing ‣ can consume up to 50% of total power [1] and 1/3 of total budget [2] of a cluster 2 Number of Cores of Top500 Systems 1E+02 1E+04 1E+06 1E+08 2002/06 2004/06 2006/06 2008/06 2010/06 2012/06 2014/06 2016/06 1st 10th 100th [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High- Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010.

Slide 3

Slide 3 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Mainstream Design of Interconnects Network resources are statically allocated ‣ Routing, bandwidth allocation, topology etc. are fixed ‣ Simple and easy to implement ‣ Consequently, unaware of the communication pattern of applications Over-provisioned ‣ Redundant links and bandwidth are provisioned ‣ To assure the communication performance of diverse applications with different communication patterns ‣ However, over-provisioning is becoming more and more expensive due to the rapid scale-out of clusters 3

Slide 4

Slide 4 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Drawbacks of Static Interconnects 4 Inter-process communication pattern of applications 1. Load imbalance among links 2. Low utilization of network resources 3. Low inter-node communication performance Interconnect Mismatch Drawbacks

Slide 5

Slide 5 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Basic Concept of the Framework 5 0 1 3 2 4 5 7 6 Cluster Communication Pattern 0 2 5 7 1 3 4 6 Interconnect Configuration Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize 0 Switch Server Process

Slide 6

Slide 6 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Temporal Granularity of Reconfiguration Per-job ‣ Pros: Relatively simple to implement ‣ Cons: Limited effect on applications with time-varying communication patterns Per-primitive ‣ Pros: Fine-grained control, can support time-varying comm patterns ‣ Cons: Potentially high overhead, needs intricate mechanism to synchronize application execution and interconnect control Per-packet (a.k.a adaptive routing) ‣ Pros: Works without prior knowledge of application ‣ Cons: Unable to utilize global view of network 6

Slide 7

Slide 7 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 7 1 0 3 2 5 4 7 6 Cluster Communication Pattern 0 2 5 7 1 3 4 6 Profile, trace, static analysis, etc. Interconnect Configuration Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Plan & optimize

Slide 8

Slide 8 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) PFProf | MPI Profiler An MPI profiler focused on network activity monitoring ‣ outputs traffic matrices and other network statistics as JSON file ‣ can detect underlying communication of collective MPI functions ‣ implemented based on PMPI and PERUSE 8 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed (when using binomial tree algorithm) 0 0 Behavior of MPI_Bcast as seen from applications Keichi Takahashi et al. "PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects", HPCMASPA 2017

Slide 9

Slide 9 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 9 1 0 3 2 5 4 7 6 Cluster Communication Pattern Interconnect Configuration 0 2 5 7 1 3 4 6 Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize

Slide 10

Slide 10 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Path Allocation Algorithm Objective: Maximize load balance and minimize congestion on links ‣ Currently, a simple greedy heuristic is adopted ‣ Finding the optimal load balancing of multiple flows is an NP-complete problem (variation of multi-commodity flow problem) 10 0 2 1 3 1 2 2 3 2 3 3 0 2 2 1 3 2 0 1 1 Order by traffic volume Allocate path 0 2 1 3

Slide 11

Slide 11 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 11 0 1 3 2 4 5 7 6 Cluster Communication Pattern 0 2 5 7 1 3 4 6 Interconnect Configuration Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize

Slide 12

Slide 12 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) SDN | Software-Defined Networking 12 Feature Control Plane Data Plane Conventional Networking Southbound API (e.g. OpenFlow) Northbound API App App App Control Plane Data Plane Feature Software Defined Networking Disaggregation

Slide 13

Slide 13 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) OpenFlow | Standard Implementation of SDN 13 Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Control Plane Data Plane Add/Modify/Delete flow entries Inject packets into data plane Notify flow entry misses Flow Table (Collection of flow entries) OpenFlow Controller OpenFlow Messages

Slide 14

Slide 14 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Overview 14 1 0 3 2 5 4 7 6 Cluster Communication Pattern Interconnect Configuration 0 2 5 7 1 3 4 6 Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Source Dest … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Output to Port Y Apply Profile, trace, static analysis, etc. Plan & optimize OpenFlow

Slide 15

Slide 15 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Putting Them Altogether 15 1 0 3 2 5 4 7 6 OpenFlow
 Controller Job
 Scheduler Start, monitor and terminate jobs Reconfigure interconnect Integration Oversees Communication Oversees Computation

Slide 16

Slide 16 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Overall Architecture 16 Head Node Compute Node slurmctld Launch sbatch, srun, … slurmdbd Job app Interconnect
 Manager Interconnect Manager Node Communication Pattern DB Flow Entries OpenFlow
 Controller PfProf slurmd Job Info Plugin User Developed Components Existing Components Process
 Placement Plugin Interconnect

Slide 17

Slide 17 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Evaluated the execution time of a communication-intensive benchmark with and without using our framework ‣ Cluster of 20 compute nodes each equipped with 1CPU (8 cores) ‣ Interconnect is a 2 level fa-tree with 2.5:1 oversubscription ratio ‣ NAS CG benchmark with 128 processes was used as the workload ‣ Different node selection and process placement strategies Preliminary Evaluation 17

Slide 18

Slide 18 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Evaluation Results ‣ SDN achieves consistently better communication performance than static routing (D-mod-K) ‣ Process placement and node selection has significant effect on the performance 18 Communication Time [s] 0 175 350 525 700 Process Allocation Block A Block B Block C Block D Block E Cyclic SDN D-mod-K 26% Reduction

Slide 19

Slide 19 text

The 27th Workshop on Sustained Simulation Performance (WSSP27) Conclusion Summary ‣ A framework that dynamically reconfigures the interconnect to match the communication pattern of MPI applications is proposed ‣ The proposed framework integrates the interconnect controller into 
 the job scheduler ‣ Evaluation indicates improvement in communication performance Future Directions ‣ Extensive benchmark evaluation using diverse applications ‣ Combine application-aware process placement and node selection ‣ Adopt sophisticated path allocation algorithms 19