Slide 1

Slide 1 text

PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects Keichi Takahashi, Susumu Date, Dashdavaa Khureltulga, Yoshiyuki Kido, Shinji Shimojo Cybermedia Center, Osaka University

Slide 2

Slide 2 text

PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects Keichi Takahashi, Susumu Date, Dashdavaa Khureltulga, Yoshiyuki Kido, Shinji Shimojo Cybermedia Center, Osaka University

Slide 3

Slide 3 text

Challenges in Future Interconnects Over-provisioned designs might not scale well ‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total budget of a cluster [2] ‣ Properties such as full bisection bandwidth and non-blocking may become increasingly difficult to achieve ‣ Need to improve the utilization of the interconnect Our proposal is to adopt: ‣ Dynamic (adaptive) routing ‣ Application-awareness network control 2 [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010. [3] S. Kamil, L. Oliker, A. Pinar, and J. Shalf, “Communication Requirements and Interconnect Optimization for High-End Scientific Applications,” IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 2, pp. 188–202, 2010.

Slide 4

Slide 4 text

SDN-enhanced MPI Framework Our prototype framework that integrates SDN into MPI [4,5,6] ‣ Dynamically controls the interconnect based on the communication pattern of MPI applications ‣ Uses Software-Defined Networking (SDN) as a key technology to realize dynamic interconnect control (e.g. dynamic routing) ‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast, MPI_Allreduce) 3 [4] K. Takahashi et al. “Performance Evaluation of SDN-enhanced MPI_Allreduce on a Cluster System with Fat-tree Interconnect”, HPCS2014. [5] B. Munkhdorj et al. “Design and Implementation of Control Sequence Generator for SDN-enhanced MPI”, NDM’15 [6] S. Date et al.“SDN-accelerated HPC Infrastructure for Scientific Research”, IJIT, vol. 22, no. 01, 2016.

Slide 5

Slide 5 text

Software-Defined Networking (SDN) 4 Feature Control Plane Data Plane Conventional Networking

Slide 6

Slide 6 text

Software-Defined Networking (SDN) 4 Feature Control Plane Data Plane Conventional Networking Disaggregation

Slide 7

Slide 7 text

Software-Defined Networking (SDN) 4 Feature Control Plane Data Plane Conventional Networking Southbound API (e.g. OpenFlow) Northbound API App App App Control Plane Data Plane Feature Software Defined Networking Disaggregation

Slide 8

Slide 8 text

Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq.

Slide 9

Slide 9 text

Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler
 Static Analysis

Slide 10

Slide 10 text

Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler
 Static Analysis Resource (Path, Bandwidth, etc.) Allocation

Slide 11

Slide 11 text

Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1 2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Apply using OpenFlow Extract via Tracer/Profiler
 Static Analysis Resource (Path, Bandwidth, etc.) Allocation

Slide 12

Slide 12 text

Need for a Holistic Analysis in SDN-enhanced MPI 6 Job Queue j1 j2 j3 j4

Slide 13

Slide 13 text

Need for a Holistic Analysis in SDN-enhanced MPI 6 Job Queue j1 j2 j3 j4

Slide 14

Slide 14 text

Need for a Holistic Analysis in SDN-enhanced MPI 6 Job Queue j1 j2 j3 j4 Job Scheduling

Slide 15

Slide 15 text

Need for a Holistic Analysis in SDN-enhanced MPI 6 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Job Scheduling

Slide 16

Slide 16 text

Need for a Holistic Analysis in SDN-enhanced MPI 6 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling

Slide 17

Slide 17 text

Need for a Holistic Analysis in SDN-enhanced MPI 6 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling

Slide 18

Slide 18 text

Need for a Holistic Analysis in SDN-enhanced MPI 6 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

Slide 19

Slide 19 text

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 6 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

Slide 20

Slide 20 text

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

Slide 21

Slide 21 text

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 22

Slide 22 text

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 23

Slide 23 text

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 24

Slide 24 text

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

Slide 25

Slide 25 text

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping Routing

Slide 26

Slide 26 text

Q1: Impact of the Communication Pattern How does the traffic load in the interconnect change for diverse applications? ‣ What kind of application benefits most from SDN-enhanced MPI? ‣ What happens if the number of processes scales out? 7

Slide 27

Slide 27 text

Q2: Impact of the Cluster Configuration How does the traffic load in the interconnect change under diverse clusters with different configurations? ‣ How do job scheduling, node selection and process mapping affect the performance of applications? ‣ How does the topology of the interconnect impact the performance? ‣ What happens if the size of cluster scales out? 8 0 Node Selection (i.e. which node should be allocated to a given job?) Process Placement (i.e. which node should execute a process?) j1 j2 j3 j4 1 2 3 Job Scheduling (i.e. which job should be executed next?)

Slide 28

Slide 28 text

We aim to develop a toolset to help answer these questions ‣ How does the traffic load in the interconnect change for diverse applications? ‣ How does the traffic load in the interconnect change under diverse clusters with different configurations? Simulator-based approach is taken to allow rapid assessment ‣ Requirements for the toolset are summarized as: Requirements for the Interconnect Analysis Toolset 9 1. Support for application-aware dynamic routing 2. Support for communication patterns of real-world applications 3. Support for diverse cluster configurations

Slide 29

Slide 29 text

Related Work ORCS [7] ‣ Simulates the traffic load of each link in the interconnect for a given topology, communication pattern and routing algorithm INAM2 [8] ‣ Comprehensive tool to monitor and analyze network activities in an InfiniBand network PSINS [9] ‣ Trace-driven simulator for predicting the performance of applications on a variety of HPC clusters with different configurations 10 [7] T. Schneider et al., “ORCS: An Oblivious Routing Congestion Simulator”, Indiana University, Computer, no. 675, 2009. [8] H. Subramoni et al. “INAM2: InfiniBand Network Analysis and Monitoring with MPI”, ISC 2016, pp. 300–320 [9] M. M. Tikir et al. “PSINS: An Open Source Event Tracer and Execution Simulator”, HPCMP-UGC 2009, pp. 444–449.

Slide 30

Slide 30 text

PFProf (profiler) and PFSim (simulator) constitute PFAnalyzer ‣ PFProf - Fine-grained MPI profiler for observing network activity caused by MPI function calls (Requirement 2) ‣ PFSim - Lightweight simulator to simulate traffic load in the interconnect targeting application-aware dynamic interconnects
 (Requirement 1, 2, 3) Overview of PFAnalyzer 11 PFSim PFProf Application Profile Result

Slide 31

Slide 31 text

PFProf: Motivation Existing profilers do not capture the underlying pt2pt communication of collective communication ‣ They are designed to support code tuning and optimization, not network traffic analysis. ‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls. 12 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed 0 0 Behavior of MPI_Bcast as seen from applications 12

Slide 32

Slide 32 text

PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free

Slide 33

Slide 33 text

PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Subscribe to PERUSE Events

Slide 34

Slide 34 text

PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Subscribe to PERUSE Events

Slide 35

Slide 35 text

PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events

Slide 36

Slide 36 text

PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized ‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events Hook MPI Functions with PMPI

Slide 37

Slide 37 text

Representation of Communication Pattern ‣ Defined as a matrix T of which element Tij is equal to the volume of traffic sent from rank i to rank j ‣ Implies that the volume of traffic between processes as constant during the execution of a job 14 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 An example obtained from running the NERSC MILC benchmark with 128 processes The communication pattern of an application is represented using its traffic matrix

Slide 38

Slide 38 text

PFProf: Overhead Evaluation 15 101 103 105 107 Message Size [B] 0.0 0.2 0.4 0.6 0.8 1.0 Relative Throughput 100 101 102 Throughput [MB/s] w/o profiler w/ profiler 101 103 105 107 Message Size [B] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Relative Latency 102 103 104 Latency [µs] w/o profiler w/ profiler Throughput (osu_bw) Latency (osu_latency) Measured throughput and latency of pt2pt communication with and without PFProf using the OSU Microbenchmark

Slide 39

Slide 39 text

PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster

Slide 40

Slide 40 text

PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster
 Topology For Requirement 2

Slide 41

Slide 41 text

PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 3 Cluster
 Topology For Requirement 2

Slide 42

Slide 42 text

PFSim: Overview 16 PFSim Interconnect
 Usage Performance
 Metric Plot Simulation
 Log Output Simulation
 Scenario Cluster
 Configuration Communication
 Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 1 For Requirement 3 Cluster
 Topology For Requirement 2

Slide 43

Slide 43 text

PFSim: Architecture 17 Event Event Event Queue j1 j2 j3 j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch

Slide 44

Slide 44 text

PFSim: Architecture 17 Event Event Event Queue j1 j2 j3 j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch Customized via Plugins

Slide 45

Slide 45 text

PFSim: Example Input & Output 18 topology: topologies/milk.graphml output: output/milk-cg-dmodk algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape)

Slide 46

Slide 46 text

PFSim: Example Input & Output 18 topology: topologies/milk.graphml output: output/milk-cg-dmodk algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape) High Traffic Load

Slide 47

Slide 47 text

PFSim: Example Input & Output 18 topology: topologies/milk.graphml output: output/milk-cg-dmodk algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization
 (Output GraphML visualized with Cytoscape) High Traffic Load Less Traffic Load

Slide 48

Slide 48 text

Simulated Cluster ‣ Modeled after a cluster installed at our institution ‣ 20 computing nodes (160 cores) ‣ 2-level fat-tree topology (oversubscription ratio = 2.5) ‣ Switch (NEC PF5240) supports OpenFlow 1.0 (and 1.3) ‣ NAS CG and NERSC MILC are used as workloads 19 Spine Switches Leaf Switches Computing Nodes

Slide 49

Slide 49 text

Simulated Configurations 20 Node Selection Process Placement Routing Linear Random Linear Cyclic D-mod-K Dynamic 0 1 2 3 4 5 0 3 1 4 2 5 Path selected solely based on the destination of flow 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 Path allocated based on communication pattern (heavy pairs first)

Slide 50

Slide 50 text

Simulation Results Maximum traffic load on all links is plotted as a performance indicator 21 Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.0 0.5 1.0 1.5 2.0 2.5 Maximum Tra c (Normalized) Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Maximum Tra c (Normalized) NAS CG Benchmark (128 ranks) NERC MILC Benchmark (128 ranks) D-mod-K Dynamic

Slide 51

Slide 51 text

Comparison with Benchmark Results 22 DmodK Dynamic 0.0 0.2 0.4 0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0.0 0.2 0.4 0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0 100 200 300 400 Execution Time [s] DmodK Dynamic 0 100 200 300 400 Execution Time [s] NAS CG Benchmark (128 ranks) NERSC MILC Benchmark (128 ranks) Simulated Maximum
 Traffic Load Execution Time
 on Actual Cluster Simulated Maximum
 Traffic Load Execution Time
 on Actual Cluster 50% 18% 23% 8%

Slide 52

Slide 52 text

Future Work Integration into SDN-enhanced MPI framework ‣ To realize online optimization of interconnect ‣ Can be used for application-aware scheduling and process allocation Improved fidelity ‣ Currently, false-positive hot-spots may be identified due to rough approximation (drop of time-axis information) ‣ Segment profile into multiple distinct communication phases and simulate each phase separately 23

Slide 53

Slide 53 text

Conclusion ‣ SDN-enhanced MPI is an embodiment of future application-aware dynamic interconnects ‣ A tool to rapidly test different interconnect control algorithms is required for the research on SDN-enhanced MPI ‣ Our proposal: PFAnalyzer - PFProf: Collects communication pattern from applications using MPI PERUSE interface - PFSim: Simulates the interconnect in a holistic manner using the communication pattern acquired with PFProf ‣ Preliminary results are obtained that conform benchmark results on actual clusters ‣ Future plan: integrate into SDN-enhanced MPI for online simulation 24