PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects

PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects Keichi Takahashi,
Susumu Date, Dashdavaa Khureltulga, Yoshiyuki Kido, Shinji Shimojo Cybermedia Center, Osaka University

Challenges in Future Interconnects Over-provisioned designs might not scale well
‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total budget of a cluster [2] ‣ Properties such as full bisection bandwidth and non-blocking may become increasingly diﬃcult to achieve ‣ Need to improve the utilization of the interconnect Our proposal is to adopt: ‣ Dynamic (adaptive) routing ‣ Application-awareness network control 2 [1] J. Kim et al.“Flattened Butterfly : A Cost-Eﬃcient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007. [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010. [3] S. Kamil, L. Oliker, A. Pinar, and J. Shalf, “Communication Requirements and Interconnect Optimization for High-End Scientific Applications,” IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 2, pp. 188–202, 2010.

SDN-enhanced MPI Framework Our prototype framework that integrates SDN into
MPI [4,5,6] ‣ Dynamically controls the interconnect based on the communication pattern of MPI applications ‣ Uses Software-Defined Networking (SDN) as a key technology to realize dynamic interconnect control (e.g. dynamic routing) ‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast, MPI_Allreduce) 3 [4] K. Takahashi et al. “Performance Evaluation of SDN-enhanced MPI_Allreduce on a Cluster System with Fat-tree Interconnect”, HPCS2014. [5] B. Munkhdorj et al. “Design and Implementation of Control Sequence Generator for SDN-enhanced MPI”, NDM’15 [6] S. Date et al.“SDN-accelerated HPC Infrastructure for Scientific Research”, IJIT, vol. 22, no. 01, 2016.

Software-Defined Networking (SDN) 4 Feature Control Plane Data Plane Conventional
Networking

Networking Disaggregation

Networking Southbound API (e.g. OpenFlow) Northbound API App App App Control Plane Data Plane Feature Software Defined Networking Disaggregation

Basic Idea of SDN-enhanced MPI 5 Interconnect Computing Nodes 1
2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq.

2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler  Static Analysis

2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Extract via Tracer/Profiler  Static Analysis Resource (Path, Bandwidth, etc.) Allocation

2 3 0 Communication Pattern 0 1 2 3 … Interconnect Control Seq. Apply using OpenFlow Extract via Tracer/Profiler  Static Analysis Resource (Path, Bandwidth, etc.) Allocation

Need for a Holistic Analysis in SDN-enhanced MPI 6 Job
Queue j1 j2 j3 j4

Need for a Holistic Analysis in SDN-enhanced MPI 6 Job
Queue j1 j2 j3 j4 Job Scheduling

Need for a Holistic Analysis in SDN-enhanced MPI 6 2
4 5 1 0 3 Job Queue j1 j2 j3 j4 Job Scheduling

4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling

4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

PEs PEs PEs Need for a Holistic Analysis in SDN-enhanced
MPI 6 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection

MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping

MPI 6 0 1 2 3 4 5 2 4 5 1 0 3 Job Queue j1 j2 j3 j4 Communication Pattern Job Scheduling Node Selection Process Mapping Routing

Q1: Impact of the Communication Pattern How does the traﬃc
load in the interconnect change for diverse applications? ‣ What kind of application benefits most from SDN-enhanced MPI? ‣ What happens if the number of processes scales out? 7

Q2: Impact of the Cluster Configuration How does the traffic
load in the interconnect change under diverse clusters with different configurations? ‣ How do job scheduling, node selection and process mapping affect the performance of applications? ‣ How does the topology of the interconnect impact the performance? ‣ What happens if the size of cluster scales out? 8 0 Node Selection (i.e. which node should be allocated to a given job?) Process Placement (i.e. which node should execute a process?) j1 j2 j3 j4 1 2 3 Job Scheduling (i.e. which job should be executed next?)

We aim to develop a toolset to help answer these
questions ‣ How does the traffic load in the interconnect change for diverse applications? ‣ How does the traffic load in the interconnect change under diverse clusters with different configurations? Simulator-based approach is taken to allow rapid assessment ‣ Requirements for the toolset are summarized as: Requirements for the Interconnect Analysis Toolset 9 1. Support for application-aware dynamic routing 2. Support for communication patterns of real-world applications 3. Support for diverse cluster configurations

Related Work ORCS [7] ‣ Simulates the traﬃc load of
each link in the interconnect for a given topology, communication pattern and routing algorithm INAM2 [8] ‣ Comprehensive tool to monitor and analyze network activities in an InfiniBand network PSINS [9] ‣ Trace-driven simulator for predicting the performance of applications on a variety of HPC clusters with diﬀerent configurations 10 [7] T. Schneider et al., “ORCS: An Oblivious Routing Congestion Simulator”, Indiana University, Computer, no. 675, 2009. [8] H. Subramoni et al. “INAM2: InfiniBand Network Analysis and Monitoring with MPI”, ISC 2016, pp. 300–320 [9] M. M. Tikir et al. “PSINS: An Open Source Event Tracer and Execution Simulator”, HPCMP-UGC 2009, pp. 444–449.

PFProf (profiler) and PFSim (simulator) constitute PFAnalyzer ‣ PFProf -
Fine-grained MPI profiler for observing network activity caused by MPI function calls (Requirement 2) ‣ PFSim - Lightweight simulator to simulate traﬃc load in the interconnect targeting application-aware dynamic interconnects  (Requirement 1, 2, 3) Overview of PFAnalyzer 11 PFSim PFProf Application Profile Result

PFProf: Motivation Existing profilers do not capture the underlying pt2pt
communication of collective communication ‣ They are designed to support code tuning and optimization, not network traﬃc analysis. ‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls. 12 1 2 3 4 5 6 7 1 4 3 2 5 7 6 Actual communication performed 0 0 Behavior of MPI_Bcast as seen from applications 12

PFProf: Implementation MPI Performance Revealing Extension Interface (PERUSE) is utilized
‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free

‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Subscribe to PERUSE Events

‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Subscribe to PERUSE Events

‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events

‣ PERUSE exposes internal information of MPI library ‣ Notifies you when a request is posted/completed, a transfer begins/ends, etc. 13 PFProf MPI Application MPI Library • MPI_Init • MPI_Finalize • MPI_Comm_create • MPI_Comm_dup • MPI_Comm_free Call MPI Functions Notify PERUSE Events Subscribe to PERUSE Events Hook MPI Functions with PMPI

Representation of Communication Pattern ‣ Defined as a matrix T
of which element Tij is equal to the volume of traffic sent from rank i to rank j ‣ Implies that the volume of traffic between processes as constant during the execution of a job 14 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 An example obtained from running the NERSC MILC benchmark with 128 processes The communication pattern of an application is represented using its traffic matrix

PFProf: Overhead Evaluation 15 101 103 105 107 Message Size
[B] 0.0 0.2 0.4 0.6 0.8 1.0 Relative Throughput 100 101 102 Throughput [MB/s] w/o profiler w/ profiler 101 103 105 107 Message Size [B] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Relative Latency 102 103 104 Latency [µs] w/o profiler w/ profiler Throughput (osu_bw) Latency (osu_latency) Measured throughput and latency of pt2pt communication with and without PFProf using the OSU Microbenchmark

PFSim: Overview 16 PFSim Interconnect  Usage Performance  Metric Plot Simulation 
Log Output Simulation  Scenario Cluster  Configuration Communication  Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster  Topology

Log Output Simulation  Scenario Cluster  Configuration Communication  Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf Cluster  Topology For Requirement 2

Log Output Simulation  Scenario Cluster  Configuration Communication  Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 3 Cluster  Topology For Requirement 2

Log Output Simulation  Scenario Cluster  Configuration Communication  Patterns Input Scheduling Plugin Node Selection Process Placement Routing PFProf For Requirement 3 For Requirement 1 For Requirement 3 Cluster  Topology For Requirement 2

PFSim: Architecture 17 Event Event Event Queue j1 j2 j3
j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch

PFSim: Architecture 17 Event Event Event Queue j1 j2 j3
j4 Job Submitted Event Handlers … Job Started Job Finished Job Queue Simulator State Update Interconnect Computing Nodes Event Dispatch Customized via Plugins

PFSim: Example Input & Output 18 topology: topologies/milk.graphml output: output/milk-cg-dmodk
algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization  (Output GraphML visualized with Cytoscape)

algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization  (Output GraphML visualized with Cytoscape) High Traﬃc Load

algorithms: scheduler: - pfsim.scheduler.FCFSScheduler node_selector: - pfsim.node_selector.LinearNodeSelector - pfsim.node_selector.RandomNodeSelector process_mapper: - pfsim.process_mapper.LinearProcessMapper - pfsim.process_mapper.CyclicProcessMapper router: - pfsim.router.DmodKRouter - pfsim.router.GreedyRouter - pfsim.router.GreedyRouter2 jobs: - submit: distribution: pfsim.math.ExponentialDistribution params: lambd: 0.1 trace: traces/cg-c-128.tar.gz Cluster Configuration (YAML) Interconnect Utilization  (Output GraphML visualized with Cytoscape) High Traﬃc Load Less Traﬃc Load

Simulated Cluster ‣ Modeled after a cluster installed at our
institution ‣ 20 computing nodes (160 cores) ‣ 2-level fat-tree topology (oversubscription ratio = 2.5) ‣ Switch (NEC PF5240) supports OpenFlow 1.0 (and 1.3) ‣ NAS CG and NERSC MILC are used as workloads 19 Spine Switches Leaf Switches Computing Nodes

Simulated Configurations 20 Node Selection Process Placement Routing Linear Random
Linear Cyclic D-mod-K Dynamic 0 1 2 3 4 5 0 3 1 4 2 5 Path selected solely based on the destination of flow 0 50 100 Sender Rank 0 25 50 75 100 125 Receiver Rank 0.0 0.2 0.4 0.6 0.8 1.0 Sent Bytes ⇥108 Path allocated based on communication pattern (heavy pairs first)

Simulation Results Maximum traﬃc load on all links is plotted
as a performance indicator 21 Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.0 0.5 1.0 1.5 2.0 2.5 Maximum Tra c (Normalized) Linear/Block/DmodK Linear/Block/Dynamic Linear/Cyclic/DmodK Linear/Cyclic/Dynamic Random/Block/DmodK Random/Block/Dynamic Random/Cyclic/DmodK Random/Cyclic/Dynamic 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Maximum Tra c (Normalized) NAS CG Benchmark (128 ranks) NERC MILC Benchmark (128 ranks) D-mod-K Dynamic

Comparison with Benchmark Results 22 DmodK Dynamic 0.0 0.2 0.4
0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0.0 0.2 0.4 0.6 0.8 1.0 Maximum Tra c (Normalized) DmodK Dynamic 0 100 200 300 400 Execution Time [s] DmodK Dynamic 0 100 200 300 400 Execution Time [s] NAS CG Benchmark (128 ranks) NERSC MILC Benchmark (128 ranks) Simulated Maximum  Traﬃc Load Execution Time  on Actual Cluster Simulated Maximum  Traﬃc Load Execution Time  on Actual Cluster 50% 18% 23% 8%

Future Work Integration into SDN-enhanced MPI framework ‣ To realize
online optimization of interconnect ‣ Can be used for application-aware scheduling and process allocation Improved fidelity ‣ Currently, false-positive hot-spots may be identified due to rough approximation (drop of time-axis information) ‣ Segment profile into multiple distinct communication phases and simulate each phase separately 23

Conclusion ‣ SDN-enhanced MPI is an embodiment of future application-aware
dynamic interconnects ‣ A tool to rapidly test diﬀerent interconnect control algorithms is required for the research on SDN-enhanced MPI ‣ Our proposal: PFAnalyzer - PFProf: Collects communication pattern from applications using MPI PERUSE interface - PFSim: Simulates the interconnect in a holistic manner using the communication pattern acquired with PFProf ‣ Preliminary results are obtained that conform benchmark results on actual clusters ‣ Future plan: integrate into SDN-enhanced MPI for online simulation 24

PFAnalyzer: A Toolset for Analyzing Application...

PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects

More Decks by Keichi Takahashi

Other Decks in Research

Featured

Transcript