Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Realizing a Dynamic and MPI Application-aware Interconnect with SDN

Towards Realizing a Dynamic and MPI Application-aware Interconnect with SDN

My talk at the 26th Workshop on Sustained Simulation Performance (WSSP26)

Keichi Takahashi

October 11, 2017
Tweet

More Decks by Keichi Takahashi

Other Decks in Research

Transcript

  1. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Towards Realizing a Dynamic and
    MPI Application-aware Interconnect with SDN
    Keichi Takahashi
    Cybermedia Center, Osaka University

    View Slide

  2. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Introduction
    ‣ Why do we need an MPI Application-aware Interconnect?
    SDN-accelerated MPI Primitives
    ‣ Is our idea feasible at all?
    A Coordination Mechanism of Computation and Communication
    ‣ How do we reconfigure the interconnect in accordance with the
    execution of applications?
    A Toolset for Analyzing Application-aware Dynamic Interconnects
    ‣ How will the proposed architecture perform with various types of
    applications and clusters?
    Agenda
    2

    View Slide

  3. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Introduction
    ‣ Why do we need an MPI Application-aware Interconnect?
    SDN-accelerated MPI Primitives
    ‣ Is our idea feasible at all?
    A Coordination Mechanism of Computation and Communication
    ‣ How do we reconfigure the interconnect in accordance with the
    execution of applications?
    A Toolset for Analyzing Application-aware Dynamic Interconnects
    ‣ How will the proposed architecture perform with various types of
    applications and clusters?
    Agenda
    2

    View Slide

  4. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Challenges in Future Interconnects
    Over-provisioned designs might not scale well
    ‣ Interconnects can consume up to 50% of total power [1] and 1/3 of total
    budget of a cluster [2]
    ‣ Properties such as full bisection bandwidth and non-blocking may
    become increasingly difficult to achieve
    ‣ Need to improve the utilization of the interconnect
    Our proposal is to adopt:
    ‣ Dynamic (adaptive) routing
    ‣ Application-awareness network control
    3
    [1] J. Kim et al.“Flattened Butterfly : A Cost-Efficient Topology for High-Radix Networks,” ISCA, vol. 35, no. 2, pp. 126–137, 2007.
    [2] D. Abts et al., “Energy proportional datacenter networks,” ACM SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 338, 2010.

    View Slide

  5. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Source of Inefficiency in Current Interconnect
    4

    View Slide

  6. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Source of Inefficiency in Current Interconnect
    4
    Communication Pattern of Applications

    View Slide

  7. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Source of Inefficiency in Current Interconnect
    4
    Communication Pattern of Applications
    Topology of the Interconnect

    View Slide

  8. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Source of Inefficiency in Current Interconnect
    4
    Communication Pattern of Applications
    Topology of the Interconnect
    Mismatch

    View Slide

  9. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Source of Inefficiency in Current Interconnect
    4
    Communication Pattern of Applications
    Topology of the Interconnect
    Mismatch

    View Slide

  10. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Source of Inefficiency in Current Interconnect
    4
    Communication Pattern of Applications
    Topology of the Interconnect
    [3] S. Kamil et al. ,“Communication Requirements and Interconnect
    Optimization for High-End Scientific Applications,” IEEE Trans.
    Parallel Distrib. Syst., 2010.
    Lower utilization
    Higher congestion
    Lower communication performance
    [3]
    Mismatch

    View Slide

  11. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    SDN-enhanced MPI Framework
    Our prototype framework that integrates SDN into MPI
    ‣ Dynamically controls the interconnect based on the communication
    pattern of MPI applications
    ‣ Uses Software-Defined Networking (SDN) as a key technology to realize
    dynamic interconnect control (e.g. dynamic routing)
    ‣ Successfully accelerated several MPI primitives (e.g. MPI_Bcast and
    MPI_Allreduce)
    5

    View Slide

  12. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    SDN | Software-Defined Networking
    6
    Feature
    Control Plane
    Data Plane
    Conventional Networking

    View Slide

  13. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    SDN | Software-Defined Networking
    6
    Feature
    Control Plane
    Data Plane
    Conventional Networking
    Disaggregation

    View Slide

  14. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    SDN | Software-Defined Networking
    6
    Feature
    Control Plane
    Data Plane
    Conventional Networking
    Southbound API (e.g. OpenFlow)
    Northbound API
    App
    App App
    Control Plane
    Data Plane
    Feature
    Software Defined Networking
    Disaggregation

    View Slide

  15. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    OpenFlow | Standard implementation of SDN
    7
    Src MAC Dst MAC … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,…
    Src MAC Dst MAC … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,…
    Src MAC Dst MAC … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,…
    Control Plane
    Data Plane
    Add/Modify/Delete flow entries
    Inject packets into data plane
    Notify flow entry misses
    Flow Table (Collection of flow entries)
    OpenFlow Controller
    OpenFlow Messages

    View Slide

  16. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Basic Idea of SDN-enhanced MPI
    8
    Interconnect
    Computing Nodes
    1 2
    3
    0
    Communication Pattern
    0 1
    2 3

    Interconnect Control Seq.

    View Slide

  17. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Basic Idea of SDN-enhanced MPI
    8
    Interconnect
    Computing Nodes
    1 2
    3
    0
    Communication Pattern
    0 1
    2 3

    Interconnect Control Seq.
    Extract via Tracer/Profiler

    Static Analysis

    View Slide

  18. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Basic Idea of SDN-enhanced MPI
    8
    Interconnect
    Computing Nodes
    1 2
    3
    0
    Communication Pattern
    0 1
    2 3

    Interconnect Control Seq.
    Extract via Tracer/Profiler

    Static Analysis
    Resource (Path, Bandwidth, etc.)
    Allocation

    View Slide

  19. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Basic Idea of SDN-enhanced MPI
    8
    Interconnect
    Computing Nodes
    1 2
    3
    0
    Communication Pattern
    0 1
    2 3

    Interconnect Control Seq.
    Proactive reconfiguration using OpenFlow
    Extract via Tracer/Profiler

    Static Analysis
    Resource (Path, Bandwidth, etc.)
    Allocation

    View Slide

  20. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Related Work
    SDN-enhanced InfiniBand (Lee et al. SC16)
    ‣ Enhancement to InfiniBand that allows dynamic and per-flow level
    network control
    Conditional OpenFlow (Benito et al. HiPC 2015)
    ‣ Enhanced OpenFlow that allows users to add flow entries that are
    activated when an Ethernet Pause (IEEE 802.3x) occurs
    ‣ Primary goal is to implement non-minimal adaptive routing on Ethernet
    Quantized Congestion Notification Switch (Benito et al. HiPINEB 2017)
    ‣ Another enhancement to OpenFlow that uses received QCNs (802.1 Qau
    Quantized Congestion Notification) to probabilistically determine which
    path to select
    9

    View Slide

  21. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Introduction
    ‣ Why do we need an MPI Application-aware Interconnect?
    SDN-accelerated MPI Primitives
    ‣ Can we accelerate MPI primitives based on our idea?
    A Coordination Mechanism of Computation and Communication
    ‣ How do we reconfigure the interconnect in accordance with the
    execution of applications?
    A Toolset for Analyzing Application-aware Dynamic Interconnects
    ‣ How will our idea on various types of applications and clusters?
    Agenda
    10

    View Slide

  22. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    MPI broadcast leveraging hardware-multicast
    ‣ Multicast rules are dynamically installed using OpenFlow
    ‣ Considers background traffic from other jobs to construct optimal
    delivery tree
    SDN-accelerated MPI_Bcast
    11
    SW1 SW2
    SW3 SW4 SW5 SW6
    P0 P1
    P2
    P3
    P0
    SW3
    SW2 SW6
    P0 P0 P0
    Delivery
    Tree
    OpenFlow Ctrl.

    View Slide

  23. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Introduction
    ‣ Why do we need an MPI Application-aware Interconnect?
    SDN-accelerated MPI Primitives
    ‣ Is our idea feasible at all?
    A Coordination Mechanism of Computation and Communication
    ‣ How do we reconfigure the interconnect in accordance with the
    execution of applications?
    A Toolset for Analyzing Application-aware Dynamic Interconnects
    ‣ How will our idea on various types of applications and clusters?
    Agenda
    12

    View Slide

  24. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Introduction
    ‣ Why do we need an MPI Application-aware Interconnect?
    SDN-accelerated MPI Primitives
    ‣ Is our idea feasible at all?
    A Coordination Mechanism of Computation and Communication
    ‣ How do we reconfigure the interconnect in accordance with the
    execution of applications?
    A Toolset for Analyzing Application-aware Dynamic Interconnects
    ‣ How will our idea on various types of applications and clusters?
    Agenda
    12

    View Slide

  25. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Synchronizing Computation and Communication
    13
    #include
    int main() {
    MPI_Init(&argc, &argv);
    MPI_Bcast(buf, count, …);
    /* … */
    MPI_Allreduce(buf, count, …);
    MPI_Finalize();
    }
    Src MAC Dst MAC … Instructions
    aa:aa:aa:… ff:ff:ff:… Flood
    bb:bb:bb:… aa:aa:aa:… Output to Port X
    aa:aa:aa:… bb:bb:bb:… Set Dst IP to Y,…
    Time-varying
    Communication Pattern
    Reconfiguration of the
    Interconnect
    How do we synchronize these two?
    Execution

    View Slide

  26. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Our Idea: Embed encoded MPI envelope into each packet
    ‣ Current implementation uses virtual MAC addresses to represent tags
    UnisonFlow
    14
    MPI Packet
    Tag
    Custom Kernel Module
    Kernel Space
    User Space
    MPI Library
    MPI Application
    ioctl
    MPI Packet
    Tag Instructions
    A Output to port X
    B Output to port Y
    … …
    MPI
    MPI Packet
    Tag
    Embed Packet Flow Controlled

    Based on Tag Value
    MPI Envelope: Rank, Primitive Type,

    Communicator, etc.
    [5] Keichi Takahashi, "Concept and Design of SDN-enhanced MPI Framework", EWSDN 2015

    View Slide

  27. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Introduction
    ‣ Why do we need an MPI Application-aware Interconnect?
    SDN-accelerated MPI Primitives
    ‣ Is our idea feasible at all?
    A Coordination Mechanism of Computation and Communication
    ‣ How do we reconfigure the interconnect in accordance with the
    execution of applications?
    A Toolset for Analyzing Application-aware Dynamic Interconnects
    ‣ How will our idea on various types of applications and clusters?
    Agenda
    15

    View Slide

  28. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Introduction
    ‣ Why do we need an MPI Application-aware Interconnect?
    SDN-accelerated MPI Primitives
    ‣ Is our idea feasible at all?
    A Coordination Mechanism of Computation and Communication
    ‣ How do we reconfigure the interconnect in accordance with the
    execution of applications?
    A Toolset for Analyzing Application-aware Dynamic Interconnects
    ‣ How will our idea on various types of applications and clusters?
    Agenda
    15

    View Slide

  29. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    Job Queue
    j1
    j2
    j3
    j4

    View Slide

  30. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    Job Queue
    j1
    j2
    j3
    j4

    View Slide

  31. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    Job Queue
    j1
    j2
    j3
    j4
    Job Scheduling

    View Slide

  32. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Job Scheduling

    View Slide

  33. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling

    View Slide

  34. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling

    View Slide

  35. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection

    View Slide

  36. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PEs PEs PEs
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection

    View Slide

  37. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PEs PEs PEs
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    0 1 2 3 4 5
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection

    View Slide

  38. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PEs PEs PEs
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    0 1 2 3 4 5
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection
    Process Mapping

    View Slide

  39. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PEs PEs PEs
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    0 1 2 3 4 5
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection
    Process Mapping

    View Slide

  40. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PEs PEs PEs
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    0 1 2 3 4 5
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection
    Process Mapping

    View Slide

  41. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PEs PEs PEs
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    0 1 2 3 4 5
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection
    Process Mapping

    View Slide

  42. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PEs PEs PEs
    Need for a Holistic Analysis in SDN-enhanced MPI
    16
    0 1 2 3 4 5
    2
    4
    5
    1
    0
    3
    Job Queue
    j1
    j2
    j3
    j4
    Communication Pattern
    Job Scheduling Node Selection
    Process Mapping
    Routing

    View Slide

  43. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Q1: Impact of the Communication Pattern
    How does the traffic load in the interconnect change for diverse
    applications?
    ‣ What kind of application benefits most from SDN-enhanced MPI?
    ‣ What happens if the number of processes scales out?
    17

    View Slide

  44. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Q2: Impact of the Cluster Configuration
    How does the traffic load in the interconnect change under diverse
    clusters with different configurations?
    ‣ How do job scheduling, node selection and process mapping affect the
    performance of applications?
    ‣ How does the topology of the interconnect impact the performance?
    ‣ What happens if the size of cluster scales out?
    18
    0
    Node Selection
    (i.e. which node should be
    allocated to a given job?)
    Process Placement
    (i.e. which node should
    execute a process?)
    j1
    j2
    j3
    j4
    1 2 3
    Job Scheduling
    (i.e. which job should be
    executed next?)

    View Slide

  45. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    We aim to develop a toolset to help answer these questions
    ‣ How does the traffic load in the interconnect change for diverse
    applications?
    ‣ How does the traffic load in the interconnect change under diverse
    clusters with different configurations?
    Simulator-based approach is taken to allow rapid assessment
    ‣ Requirements for the toolset are summarized as:
    Requirements for the Interconnect Analysis Toolset
    19
    1. Support for application-aware dynamic routing
    2. Support for communication patterns of real-world applications
    3. Support for diverse cluster configurations

    View Slide

  46. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf (profiler) and PFSim (simulator) constitute PFAnalyzer [6]
    ‣ PFProf
    - Fine-grained MPI profiler for observing network activity caused by MPI
    function calls (Requirement 2)
    ‣ PFSim
    - Lightweight simulator to simulate traffic load in the interconnect
    targeting application-aware dynamic interconnects

    (Requirement 1, 2, 3)
    Overview of PFAnalyzer
    20
    PFSim
    PFProf
    Application
    Profile Result
    [6] Keichi Takahashi et al., "A Toolset for Analyzing Application-aware Dynamic Interconnects", HPCMASPA 2017

    View Slide

  47. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf: Motivation
    Existing profilers do not capture the underlying pt2pt communication
    of collective communication
    ‣ They are designed to support code tuning and optimization, not network
    traffic analysis.
    ‣ MPI Profiling Interface (PMPI) only captures individual MPI function calls.
    21
    1 2 3 4 5 6 7
    1
    4
    3
    2
    5
    7
    6
    Actual communication performed
    0
    0
    Behavior of MPI_Bcast as seen from
    applications
    21

    View Slide

  48. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf: Implementation
    MPI Performance Revealing Extension Interface (PERUSE) is utilized
    ‣ PERUSE exposes internal information of MPI library
    ‣ Notifies you when a request is posted/completed, a transfer begins/ends,
    etc.
    22
    PFProf
    MPI Application
    MPI Library
    • MPI_Init
    • MPI_Finalize
    • MPI_Comm_create
    • MPI_Comm_dup
    • MPI_Comm_free

    View Slide

  49. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf: Implementation
    MPI Performance Revealing Extension Interface (PERUSE) is utilized
    ‣ PERUSE exposes internal information of MPI library
    ‣ Notifies you when a request is posted/completed, a transfer begins/ends,
    etc.
    22
    PFProf
    MPI Application
    MPI Library
    • MPI_Init
    • MPI_Finalize
    • MPI_Comm_create
    • MPI_Comm_dup
    • MPI_Comm_free
    Subscribe to PERUSE Events

    View Slide

  50. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf: Implementation
    MPI Performance Revealing Extension Interface (PERUSE) is utilized
    ‣ PERUSE exposes internal information of MPI library
    ‣ Notifies you when a request is posted/completed, a transfer begins/ends,
    etc.
    22
    PFProf
    MPI Application
    MPI Library
    • MPI_Init
    • MPI_Finalize
    • MPI_Comm_create
    • MPI_Comm_dup
    • MPI_Comm_free
    Call MPI Functions
    Subscribe to PERUSE Events

    View Slide

  51. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf: Implementation
    MPI Performance Revealing Extension Interface (PERUSE) is utilized
    ‣ PERUSE exposes internal information of MPI library
    ‣ Notifies you when a request is posted/completed, a transfer begins/ends,
    etc.
    22
    PFProf
    MPI Application
    MPI Library
    • MPI_Init
    • MPI_Finalize
    • MPI_Comm_create
    • MPI_Comm_dup
    • MPI_Comm_free
    Call MPI Functions
    Notify PERUSE Events
    Subscribe to PERUSE Events

    View Slide

  52. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf: Implementation
    MPI Performance Revealing Extension Interface (PERUSE) is utilized
    ‣ PERUSE exposes internal information of MPI library
    ‣ Notifies you when a request is posted/completed, a transfer begins/ends,
    etc.
    22
    PFProf
    MPI Application
    MPI Library
    • MPI_Init
    • MPI_Finalize
    • MPI_Comm_create
    • MPI_Comm_dup
    • MPI_Comm_free
    Call MPI Functions
    Notify PERUSE Events
    Subscribe to PERUSE Events
    Hook MPI Functions with PMPI

    View Slide

  53. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Representation of Communication Pattern
    ‣ Defined as a matrix T of which
    element Tij is equal to the volume of
    traffic sent from rank i to rank j
    ‣ Implies that the volume of traffic
    between processes as constant
    during the execution of a job
    23
    0 50 100
    Sender Rank
    0
    25
    50
    75
    100
    125
    Receiver Rank
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Sent Bytes
    ⇥108
    An example obtained from running
    the NERSC MILC benchmark with 128
    processes
    The communication pattern of an application is represented using
    its traffic matrix

    View Slide

  54. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFProf: Overhead Evaluation
    24
    101 103 105 107
    Message Size [B]
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Relative Throughput
    100
    101
    102
    Throughput [MB/s]
    w/o profiler
    w/ profiler
    101 103 105 107
    Message Size [B]
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    1.2
    Relative Latency
    102
    103
    104
    Latency [µs]
    w/o profiler
    w/ profiler
    Throughput (osu_bw) Latency (osu_latency)
    Measured throughput and latency of pt2pt communication
    with and without PFProf using the OSU Microbenchmark

    View Slide

  55. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Overview
    25
    PFSim Interconnect

    Usage
    Performance

    Metric Plot
    Simulation

    Log
    Output
    Simulation

    Scenario
    Cluster

    Configuration
    Communication

    Patterns
    Input
    Scheduling
    Plugin
    Node Selection
    Process Placement
    Routing
    PFProf
    Cluster

    Topology

    View Slide

  56. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Overview
    25
    PFSim Interconnect

    Usage
    Performance

    Metric Plot
    Simulation

    Log
    Output
    Simulation

    Scenario
    Cluster

    Configuration
    Communication

    Patterns
    Input
    Scheduling
    Plugin
    Node Selection
    Process Placement
    Routing
    PFProf
    Cluster

    Topology
    For Requirement 2

    View Slide

  57. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Overview
    25
    PFSim Interconnect

    Usage
    Performance

    Metric Plot
    Simulation

    Log
    Output
    Simulation

    Scenario
    Cluster

    Configuration
    Communication

    Patterns
    Input
    Scheduling
    Plugin
    Node Selection
    Process Placement
    Routing
    PFProf
    For Requirement 3
    For Requirement 3
    Cluster

    Topology
    For Requirement 2

    View Slide

  58. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Overview
    25
    PFSim Interconnect

    Usage
    Performance

    Metric Plot
    Simulation

    Log
    Output
    Simulation

    Scenario
    Cluster

    Configuration
    Communication

    Patterns
    Input
    Scheduling
    Plugin
    Node Selection
    Process Placement
    Routing
    PFProf
    For Requirement 3
    For Requirement 1
    For Requirement 3
    Cluster

    Topology
    For Requirement 2

    View Slide

  59. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Architecture
    26
    Event Event
    Event Queue
    j1
    j2
    j3
    j4
    Job Submitted
    Event Handlers

    Job Started
    Job Finished
    Job Queue
    Simulator State
    Update
    Interconnect
    Computing Nodes
    Event
    Dispatch

    View Slide

  60. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Architecture
    26
    Event Event
    Event Queue
    j1
    j2
    j3
    j4
    Job Submitted
    Event Handlers

    Job Started
    Job Finished
    Job Queue
    Simulator State
    Update
    Interconnect
    Computing Nodes
    Event
    Dispatch
    Customized via Plugins

    View Slide

  61. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Example Input & Output
    27
    topology: topologies/milk.graphml
    output: output/milk-cg-dmodk
    algorithms:
    scheduler:
    - pfsim.scheduler.FCFSScheduler
    node_selector:
    - pfsim.node_selector.LinearNodeSelector
    - pfsim.node_selector.RandomNodeSelector
    process_mapper:
    - pfsim.process_mapper.LinearProcessMapper
    - pfsim.process_mapper.CyclicProcessMapper
    router:
    - pfsim.router.DmodKRouter
    - pfsim.router.GreedyRouter
    - pfsim.router.GreedyRouter2
    jobs:
    - submit:
    distribution: pfsim.math.ExponentialDistribution
    params:
    lambd: 0.1
    trace: traces/cg-c-128.tar.gz
    Cluster Configuration (YAML)
    Interconnect Utilization

    (Output GraphML visualized with Cytoscape)

    View Slide

  62. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Example Input & Output
    27
    topology: topologies/milk.graphml
    output: output/milk-cg-dmodk
    algorithms:
    scheduler:
    - pfsim.scheduler.FCFSScheduler
    node_selector:
    - pfsim.node_selector.LinearNodeSelector
    - pfsim.node_selector.RandomNodeSelector
    process_mapper:
    - pfsim.process_mapper.LinearProcessMapper
    - pfsim.process_mapper.CyclicProcessMapper
    router:
    - pfsim.router.DmodKRouter
    - pfsim.router.GreedyRouter
    - pfsim.router.GreedyRouter2
    jobs:
    - submit:
    distribution: pfsim.math.ExponentialDistribution
    params:
    lambd: 0.1
    trace: traces/cg-c-128.tar.gz
    Cluster Configuration (YAML)
    Interconnect Utilization

    (Output GraphML visualized with Cytoscape)
    High Traffic Load

    View Slide

  63. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    PFSim: Example Input & Output
    27
    topology: topologies/milk.graphml
    output: output/milk-cg-dmodk
    algorithms:
    scheduler:
    - pfsim.scheduler.FCFSScheduler
    node_selector:
    - pfsim.node_selector.LinearNodeSelector
    - pfsim.node_selector.RandomNodeSelector
    process_mapper:
    - pfsim.process_mapper.LinearProcessMapper
    - pfsim.process_mapper.CyclicProcessMapper
    router:
    - pfsim.router.DmodKRouter
    - pfsim.router.GreedyRouter
    - pfsim.router.GreedyRouter2
    jobs:
    - submit:
    distribution: pfsim.math.ExponentialDistribution
    params:
    lambd: 0.1
    trace: traces/cg-c-128.tar.gz
    Cluster Configuration (YAML)
    Interconnect Utilization

    (Output GraphML visualized with Cytoscape)
    High Traffic Load
    Less Traffic Load

    View Slide

  64. Simulated Configurations
    28
    Node Selection Process Placement Routing
    Linear
    Random
    Linear
    Cyclic
    D-mod-K
    Dynamic
    0 1 2 3 4 5
    0 3 1 4 2 5
    Path selected solely based
    on the destination of flow
    0 50 100
    Sender Rank
    0
    25
    50
    75
    100
    125
    Receiver Rank
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    Sent Bytes
    ⇥108
    Path allocated based on
    communication pattern
    (heavy pairs first)

    View Slide

  65. Simulation Results
    Maximum traffic load on all links is plotted as a performance indicator
    29
    Linear/Block/DmodK
    Linear/Block/Dynamic
    Linear/Cyclic/DmodK
    Linear/Cyclic/Dynamic
    Random/Block/DmodK
    Random/Block/Dynamic
    Random/Cyclic/DmodK
    Random/Cyclic/Dynamic
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    Maximum Tra c (Normalized)
    Linear/Block/DmodK
    Linear/Block/Dynamic
    Linear/Cyclic/DmodK
    Linear/Cyclic/Dynamic
    Random/Block/DmodK
    Random/Block/Dynamic
    Random/Cyclic/DmodK
    Random/Cyclic/Dynamic
    0.00
    0.25
    0.50
    0.75
    1.00
    1.25
    1.50
    Maximum Tra c (Normalized)
    NAS CG Benchmark (128 ranks) NERC MILC Benchmark (128 ranks)
    D-mod-K Dynamic

    View Slide

  66. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Further Challenges
    Simulation-based study of large-scale clusters with different
    topologies
    ‣ Currently, our institution owns only a small-scale experimental cluster

    employed with SDN
    Integrate interconnect controller with scheduler and MPI runtime
    ‣ To support multiple jobs running in parallel
    ‣ To investigate the effect of node allocation and process placement
    Better application-aware routing algorithms
    ‣ Currently, a simple greedy like algorithm is used
    ‣ How about optimization or machine learning?
    30

    View Slide

  67. The 26th Workshop on Sustained Simulation Performance (WSSP26)
    Summary
    Current static and over-provisioned interconnects might not scale well
    ‣ SDN allows us to build a more dynamic and application-aware
    interconnects
    ‣ Such architecture could improve the utilization of the interconnect and
    communication performance
    Our achievements so far include:
    ‣ SDN-accelerated MPI primitives such as Bcast and Allreduce
    ‣ UnisonFlow, a coordination mechanism of computation and
    communication
    ‣ PFAnalyzer, a toolset for analyzing application-aware dynamic
    interconnects
    31

    View Slide