Slide 1

Slide 1 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Towards Realizing a Dynamic and MPI Application-aware Interconnect with SDN Keichi Takahashi Cybermedia Center, Osaka University

Slide 2

Slide 2 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Introduction ‣ Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will the proposed architecture perform with various types of applications and clusters? Agenda 2

Slide 3

Slide 3 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Introduction ‣ Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will the proposed architecture perform with various types of applications and clusters? Agenda 2

Slide 4

Slide 4 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Challenges in Future Interconnects Over-provisioned designs might not scale well ‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total budget of a cluster [2] ‣ Properties such as full bisection bandwidth and non-blocking may become increasingly difficult to achieve ‣ Need to improve the utilization of the interconnect Our proposal is to adopt: ‣ Dynamic (adaptive) routing ‣ Application-awareness network control 3 [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010.

Slide 5

Slide 5 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Source of Inefficiency in Current Interconnect 4

Slide 6

Slide 6 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Source of Inefficiency in Current Interconnect 4 Communication Pattern of Applications

Slide 7

Slide 7 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Source of Inefficiency in Current Interconnect 4 Communication Pattern of Applications Topology of the Interconnect

Slide 8

Slide 8 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Source of Inefficiency in Current Interconnect 4 Communication Pattern of Applications Topology of the Interconnect Mismatch

Slide 9

Slide 9 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Source of Inefficiency in Current Interconnect 4 Communication Pattern of Applications Topology of the Interconnect Mismatch

Slide 10

Slide 10 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Source of Inefficiency in Current Interconnect 4 Communication Pattern of Applications Topology of the Interconnect [3] S. Kamil et al. ,“Communication Requirements and Interconnect Optimization for High-End Scientific Applications,” IEEE Trans. Parallel Distrib. Syst., 2010. Lower utilization Higher congestion Lower communication performance [3] Mismatch

Slide 11

Slide 11 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) SDN-enhanced MPI Framework Our prototype framework that integrates SDN into MPI ‣ Dynamically controls the interconnect based on the communication pattern of MPI applications ‣ Uses Software-Defined Networking (SDN) as a key technology to realize dynamic interconnect control (e.g. dynamic routing) ‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast and MPI_Allreduce) 5

Slide 12

Slide 12 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) SDN | Software-Defined Networking 6 Feature Control Plane Data Plane Conventional Networking

Slide 13

Slide 13 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) SDN | Software-Defined Networking 6 Feature Control Plane Data Plane Conventional Networking Disaggregation

Slide 14

Slide 14 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) SDN | Software-Defined Networking 6 Feature Control Plane Data Plane Conventional Networking Southbound API (e.g. OpenFlow) Northbound API App App App Control Plane Data Plane Feature Software Defined Networking Disaggregation

Slide 15

Slide 15 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) OpenFlow | Standard implementation of SDN 7 Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Control Plane Data Plane Add/Modify/Delete flow entries Inject packets into data plane Notify flow entry misses Flow Table (Collection of flow entries) OpenFlow Controller OpenFlow Messages

Slide 16

Slide 16 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Basic Idea of SDN-enhanced MPI 8 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq.

Slide 17

Slide 17 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Basic Idea of SDN-enhanced MPI 8 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler
 Static Analysis

Slide 18

Slide 18 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Basic Idea of SDN-enhanced MPI 8 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler
 Static Analysis Resource (Path, Bandwidth, etc.) Allocation

Slide 19

Slide 19 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Basic Idea of SDN-enhanced MPI 8 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Proactive reconfiguration using OpenFlow Extract via Tracer/Profiler
 Static Analysis Resource (Path, Bandwidth, etc.) Allocation

Slide 20

Slide 20 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Related Work SDN-enhanced InfiniBand (Lee et al. SC16) ‣ Enhancement to InfiniBand that allows dynamic and per-flow level network control Conditional OpenFlow (Benito et al. HiPC 2015) ‣ Enhanced OpenFlow that allows users to add flow entries that are activated when an Ethernet Pause (IEEE 802.3x) occurs ‣ Primary goal is to implement non-minimal adaptive routing on Ethernet Quantized Congestion Notification Switch (Benito et al. HiPINEB 2017) ‣ Another enhancement to OpenFlow that uses received QCNs (802.1 Qau Quantized Congestion Notification) to probabilistically determine which path to select 9

Slide 21

Slide 21 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Introduction ‣ Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Can we accelerate MPI primitives based on our idea? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 10

Slide 22

Slide 22 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) MPI broadcast leveraging hardware-multicast ‣ Multicast rules are dynamically installed using OpenFlow ‣ Considers background traffic from other jobs to construct optimal delivery tree SDN-accelerated MPI_Bcast 11 SW1 SW2 SW3 SW4 SW5 SW6 P0 P1 P2 P3 P0 SW3 SW2 SW6 P0 P0 P0 Delivery Tree OpenFlow Ctrl.

Slide 23

Slide 23 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Introduction ‣ Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 12

Slide 24

Slide 24 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Introduction ‣ Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 12

Slide 25

Slide 25 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Synchronizing Computation and Communication 13 #include int main() { MPI_Init(&argc, &argv); MPI_Bcast(buf, count, …); /* … */ MPI_Allreduce(buf, count, …); MPI_Finalize(); } Src MAC Dst MAC … Instructions aa:aa:aa:… ff:ff:ff:… Flood bb:bb:bb:… aa:aa:aa:… Output to Port X aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,… Time-varying Communication Pattern Reconfiguration of the Interconnect How do we synchronize these two? Execution

Slide 26

Slide 26 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Our Idea: Embed encoded MPI envelope into each packet ‣ Current implementation uses virtual MAC addresses to represent tags UnisonFlow 14 MPI Packet Tag Custom Kernel Module Kernel Space User Space MPI Library MPI Application ioctl MPI Packet Tag Instructions A Output to port X B Output to port Y … … MPI MPI Packet Tag Embed Packet Flow Controlled
 Based on Tag Value MPI Envelope: Rank, Primitive Type,
 Communicator, etc. [5] Keichi Takahashi, "Concept and Design of SDN-enhanced MPI Framework", EWSDN 2015

Slide 27

Slide 27 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Introduction ‣ Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 15

Slide 28

Slide 28 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Introduction ‣ Why do we need an MPI Application-aware Interconnect? SDN-accelerated MPI Primitives ‣ Is our idea feasible at all? A Coordination Mechanism of Computation and Communication ‣ How do we reconfigure the interconnect in accordance with the execution of applications? A Toolset for Analyzing Application-aware Dynamic Interconnects ‣ How will our idea on various types of applications and clusters? Agenda 15

Slide 29

Slide 29 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Need for a Holistic Analysis in SDN-enhanced MPI 16 Job Queue j1 j2 j3 j4

Slide 30

Slide 30 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Need for a Holistic Analysis in SDN-enhanced MPI 16 Job Queue j1 j2 j3 j4

Slide 31

Slide 31 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Need for a Holistic Analysis in SDN-enhanced MPI 16 Job Queue j1 j2 j3 j4 Job Scheduling

Slide 32

Slide 32 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Need for a Holistic Analysis in SDN-enhanced MPI 16 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Job Scheduling

Slide 33

Slide 33 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Need for a Holistic Analysis in SDN-enhanced MPI 16 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling

Slide 34

Slide 34 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Need for a Holistic Analysis in SDN-enhanced MPI 16 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling

Slide 35

Slide 35 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Need for a Holistic Analysis in SDN-enhanced MPI 16 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

Slide 36

Slide 36 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 16 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

Slide 37

Slide 37 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 16 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

Slide 38

Slide 38 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 16 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 39

Slide 39 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 16 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 40

Slide 40 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 16 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 41

Slide 41 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 16 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 42

Slide 42 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 16 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping Routing

Slide 43

Slide 43 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Q1: Impact of the Communication Pattern How does the traffic load in the interconnect change for diverse applications? ‣ What kind of application benefits most from SDN-enhanced MPI? ‣ What happens if the number of processes scales out? 17

Slide 44

Slide 44 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Q2: Impact of the Cluster Configuration How does the traffic load in the interconnect change under diverse clusters with different configurations? ‣ How do job scheduling, node selection and process mapping affect the performance of applications? ‣ How does the topology of the interconnect impact the performance? ‣ What happens if the size of cluster scales out? 18 0 Node Selection (i.e. which node should be allocated to a given job?) Process Placement (i.e. which node should execute a process?) j1 j2 j3 j4 1 2 3 Job Scheduling (i.e. which job should be executed next?)

Slide 45

Slide 45 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) We aim to develop a toolset to help answer these questions ‣ How does the traffic load in the interconnect change for diverse applications? ‣ How does the traffic load in the interconnect change under diverse clusters with different configurations? Simulator-based approach is taken to allow rapid assessment ‣ Requirements for the toolset are summarized as: Requirements for the Interconnect Analysis Toolset 19 1. Support for application-aware dynamic routing 2. Support for communication patterns of real-world applications 3. Support for diverse cluster configurations

Slide 46

Slide 46 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf (profiler) and PFSim (simulator) constitute PFAnalyzer [6] ‣ PFProf - Fine-grained MPI profiler for observing network activity caused by MPI function calls (Requirement 2) ‣ PFSim - Lightweight simulator to simulate traffic load in the interconnect targeting application-aware dynamic interconnects
 (Requirement 1, 2, 3) Overview of PFAnalyzer 20 PFSim PFProf Application Profile Result [6] Keichi Takahashi et al., "A Toolset for Analyzing Application-aware Dynamic Interconnects", HPCMASPA 2017

Slide 47

Slide 47 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf: Motivation Existing profilers do not capture the underlying pt2pt communication of collective communication ‣ They are designed to support code tuning and optimization, not network traffic analysis. ‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls. 21 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed 0 0 Behavior of MPI_Bcast as seen from applications 21

Slide 48

Slide 48 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free

Slide 49

Slide 49 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Subscribe to PERUSE Events

Slide 50

Slide 50 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Subscribe to PERUSE Events

Slide 51

Slide 51 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events

Slide 52

Slide 52 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 22 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events Hook MPI Functions with PMPI

Slide 53

Slide 53 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Representation of Communication Pattern ‣ Defined as a matrix T of which element Tij is equal to the volume of traffic sent from rank i to rank j ‣ Implies that the volume of traffic between processes as constant during the execution of a job 23 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 An example obtained from running the NERSC MILC benchmark with 128 processes The communication pattern of an application is represented using its traffic matrix

Slide 54

Slide 54 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFProf: Overhead Evaluation 24 101 103 105 107 Message Size [B] 0.0 0.2 0.4 0.6 0.8 1.0 Relative Throughput 100 101 102 Throughput [MB/s] w/o profiler w/ profiler 101 103 105 107 Message Size [B] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Relative Latency 102 103 104 Latency [µs] w/o profiler w/ profiler Throughput (osu_bw) Latency (osu_latency) Measured throughput and latency of pt2pt communication with and without PFProf using the OSU Microbenchmark

Slide 55

Slide 55 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Overview 25 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster
 Topology

Slide 56

Slide 56 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Overview 25 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster
 Topology For Requirement 2

Slide 57

Slide 57 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Overview 25 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 3 Cluster
 Topology For Requirement 2

Slide 58

Slide 58 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Overview 25 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 1 For Requirement 3 Cluster
 Topology For Requirement 2

Slide 59

Slide 59 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Architecture 26 Event Event Event Queue j1 j2 j3 j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch

Slide 60

Slide 60 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Architecture 26 Event Event Event Queue j1 j2 j3 j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch Customized via Plugins

Slide 61

Slide 61 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Example Input & Output 27 topology: topologies/milk.graphml output: output/milk-cg-dmodk algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape)

Slide 62

Slide 62 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Example Input & Output 27 topology: topologies/milk.graphml output: output/milk-cg-dmodk algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape) High Traffic Load

Slide 63

Slide 63 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) PFSim: Example Input & Output 27 topology: topologies/milk.graphml output: output/milk-cg-dmodk algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape) High Traffic Load Less Traffic Load

Slide 64

Slide 64 text

Simulated Configurations 28 Node Selection Process Placement Routing Linear Random Linear Cyclic D-mod-K Dynamic 0 1 2 3 4 5 0 3 1 4 2 5 Path selected solely based on the destination of flow 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 Path allocated based on communication pattern (heavy pairs first)

Slide 65

Slide 65 text

Simulation Results Maximum traffic load on all links is plotted as a performance indicator 29 Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.0 0.5 1.0 1.5 2.0 2.5 Maximum Tra c (Normalized) Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Maximum Tra c (Normalized) NAS CG Benchmark (128 ranks) NERC MILC Benchmark (128 ranks) D-mod-K Dynamic

Slide 66

Slide 66 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Further Challenges Simulation-based study of large-scale clusters with different topologies ‣ Currently, our institution owns only a small-scale experimental cluster
 employed with SDN Integrate interconnect controller with scheduler and MPI runtime ‣ To support multiple jobs running in parallel ‣ To investigate the effect of node allocation and process placement Better application-aware routing algorithms ‣ Currently, a simple greedy like algorithm is used ‣ How about optimization or machine learning? 30

Slide 67

Slide 67 text

The 26th Workshop on Sustained Simulation Performance (WSSP26) Summary Current static and over-provisioned interconnects might not scale well ‣ SDN allows us to build a more dynamic and application-aware interconnects ‣ Such architecture could improve the utilization of the interconnect and communication performance Our achievements so far include: ‣ SDN-accelerated MPI primitives such as Bcast and Allreduce ‣ UnisonFlow, a coordination mechanism of computation and communication ‣ PFAnalyzer, a toolset for analyzing application-aware dynamic interconnects 31