Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Running Accurate, Scalable, and Reproducible Simulations of Distributed Systems with WRENCH

WRENCH
November 18, 2018

Running Accurate, Scalable, and Reproducible Simulations of Distributed Systems with WRENCH

Scientific workflows are used routinely in numerous scientific domains, and Workflow Management Systems (WMSs) have been developed to orchestrate and optimize workflow executions on distributed platforms. WMSs are complex software systems that interact with complex software infrastructures. Most WMS research and development activities rely on empirical experiments conducted with full-fledged software stacks on actual hardware platforms. Such experiments, however, are limited to hardware and software infrastructures at hand and can be labor- and/or time-intensive. As a result, relying solely on real- world experiments impedes WMS research and development. An alternative is to conduct experiments in simulation.
In this work we present WRENCH, a WMS simulation framework, whose objectives are (i) accurate and scalable simula- tions; and (ii) easy simulation software development. WRENCH achieves its first objective by building on the SimGrid framework. While SimGrid is recognized for the accuracy and scalability of its simulation models, it only provides low-level simulation abstractions and thus large software development efforts are required when implementing simulators of complex systems. WRENCH thus achieves its second objective by providing high- level and directly re-usable simulation abstractions on top of SimGrid. After describing and giving rationales for WRENCH’s software architecture and APIs, we present a case study in which we apply WRENCH to simulate the Pegasus production WMS. We report on ease of implementation, simulation accuracy, and simulation scalability so as to determine to which extent WRENCH achieves its two above objectives. We also draw both qualitative and quantitative comparisons with a previously proposed workflow simulator.

WRENCH

November 18, 2018
Tweet

More Decks by WRENCH

Other Decks in Research

Transcript

  1. Our Team WRENCH is funded by the National Science Foundation

    (NSF) under grants number 1642369 and 1642335, and the National Center for Scientific Research (CNRS) under grant number PICS07239. !2 http://wrench-project.org
  2. Workflow + WMS + CI = Complex Systems !3 Scientific

    workflows are complex applications that execute on complex software stacks on complex hardware platforms To advance the state of the art, one must thus try to study an understand complex systems Theory only gets you so far as its assumptions break down quickly So most research in this field is experiment-driven Typical approach: take a real workflow, take a real software infrastructure installed on a real hardware infrastructure, run the workflow, get results
  3. !4 Real-World Experiments are Expensive The“setup and configuration” work can

    be non-trivial Because you need full-fledge software stacks Experiments are time-consuming Long-running, faulty executions, weird non-representative outliers due to misconfigurations, many required executions due to repeatability problems, etc. Experiments cost “money” Directly or indirectly (energy, carbon footprint) As a result, research papers are often deemed to “not have enough experimental results” [Personal Opinion] Most authors feel that they have spent way too much time setting up and running experiments http://wrench-project.org
  4. !5 Real-World Experiments are Limited One is limited to particular

    platform configurations (and sub-configurations) How can “what if?” scenarios be explored?
 How can generality be claimed? 
 One is limited by specifics of the software infrastructure that impose constraints on workflow executions Modifying complex software stacks (often written by others) just to test out ideas is not feasible 
 In the end, the scope of real-world experiments is limited, which impedes progress / discovery http://wrench-project.org
  5. !6 Simulation When one works in an experimental field in

    which experiments are problematic, one resorts to simulation Physicists have understood this decades ago :)
 In some fields of Computer Science simulation is a standard research and development methodology e.g., Networking, Computer Architecture 
 Several simulators and simulation frameworks have been developed for parallel and distributed computing Some of them developed explicitly for workflows http://wrench-project.org
  6. !7 The SimGrid Framework SimGrid is a research project Development

    of simulation models of hardware/software stacks Models are accurate (validated/invalidated) and scalable (low computational complexity, low memory footprint) SimGrid is open source usable software Provides different APIs for a range of simulation needs, e.g.: S4U: General simulation of Concurrent Sequential Processes SMPI: Fine-grained simulation of MPI applications SimGrid is versatile scientific instrument Used for (combinations of) Grid, HPC, Peer-to-Peer, Cloud, Fog simulation projects First developed in 2000, latest release: v3.21 (October 2018) http://simgrid.org
  7. SimGrid is well-funded, active as a research project, and widely-used

    as a simulation tool for research, development, and education 44 journal articles 146 conference articles 17 PhD theses !8 http://simgrid.org Software Sustainability
  8. SimGrid’s philosophy: provide low-level abstractions Advantage: you can do anything

    with it Drawback: implementing a simulation of a complex system is a lot of work Critical analysis: In [Kecskemeti et al.’14] pinpoints exactly the above trade-off: "SimGrid is more scalable and validated than competing frameworks, but just too much work when wanting to simulate a WMS that interacts with CI components" !9 http://simgrid.org SimGrid’s Philosophy
  9. !10 The WRENCH Project Objective #1: Make it easy to

    develop simulators of complex workflow executions Done by providing high-level, reusable simulation abstractions Objective #2: Produce accurate and scalable simulations Done by building on SimGrid Let’s look at an example system one can simulate with WRENCH… http://wrench-project.org
  10. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !11 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  11. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !12 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  12. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !13 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  13. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !14 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  14. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !15 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  15. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !16 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  16. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !17 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  17. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !18 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  18. WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage

    Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !19 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)
  19. !20 Two Kinds of WRENCH Users Developers who develop simulated

    WMS implementations WMS developers who want an in-simulation implementation of their WMS for easy/broad experimental studies of their WMS WMS researchers who want to quickly prototype ideas in simulation Users who implement (and run) simulators Users who want to see how fast a given workflow would run with a given WMS on a given platform Which can include the previous users http://wrench-project.org
  20. The WRENCH Software Stack !21 SimGrid::S4U API (C++) WRENCH Developer

    API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators
  21. The WRENCH Software Stack !22 SimGrid::S4U API (C++) WRENCH Developer

    API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators
  22. The WRENCH Software Stack !23 SimGrid::S4U API (C++) WRENCH Developer

    API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators
  23. The WRENCH Software Stack !24 SimGrid::S4U API (C++) WRENCH Developer

    API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators
  24. The WRENCH Software Stack !25 SimGrid::S4U API (C++) WRENCH Developer

    API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators
  25. The WRENCH Software Stack !26 SimGrid::S4U API (C++) WRENCH Developer

    API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators
  26. The WRENCH Software Stack !27 SimGrid::S4U API (C++) WRENCH Developer

    API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators
  27. !28 WRENCH Developer API Objective: Make it easy to implement

    a complex WMS WMS is implemented as a single thread of control No explicit message-based interactions Simulated CI services provide high-levels APIs WMS registers callbacks for a small set of events Simple asynchronous interactions with services Built-in “manager threads” help with managing synchronicity Straightforward failures handling and failure cause inspection Bottom-line: removes many well-known difficulties when implementing a system that interacts asynchronously with distributed services on failure- prone platforms Example WMS distributed with WRENCH fits in 200 lines of C++ http://wrench-project.org
  28. !29 WRENCH User API Makes it possible to implement a

    simulator in a few lines of code Instantiate a platform (XML description file) Create a workflow Create CI services and a WMS to run on the platform Launch the simulation Process/analyze/visualize simulation output A useful WRENCH simulator can consist of just a few dozen lines of code http://wrench-project.org
  29. !30 Building a Simulator Blueprint for a WRENCH-based simulator Create

    and initialize a simulation Instantiate a simulated platform Instantiate services on the platform Create at least one workflow Instantiate at least one WMS per workflow Launch the simulation Process simulation output WRENCH + SimGrid internals Agent: some code, some private data, running on a given host Task: amount of work to do and of data to exchange Host: location on which agents execute Mailbox: Rendez-vous points between agents You can send ‘data' to a mailbox; you receive ‘data' from a mailbox Communication time between sender/receiver is accounted (payload) and depends on the network traffic http://wrench-project.org
  30. !31 Case-Study We built a WRENCH implementation of the Pegasus/DAGMan

    WMS (with the Developer API) WRENCH provides a simulated HTCondor implementation We built a simulator of Pegasus-driven executions of arbitrary workflows on arbitrary platforms (with the User API) The simulator simulates Application task executions and data transfers Auxiliary tasks (directory creations, cleanups, file registrations, job startup/ teardown overheads, etc.) Delays and message exchanges within DAGMan and HTCondor Goal: evaluate WRENCH’s ease-of-use, accuracy, and scalability http://wrench-project.org
  31. !32 Easy-of-Use Pegasus/DAGMan WRENCH implementation 127 lines of code to

    read/parse config files 539 lines of code for WMS logic Compared to production implementation Tens of thousands of lines of code http://wrench-project.org Pegasus
 workflow trace SimGrid platform file simulation configuration
 file WRENCH
 Pegasus
 simulator simulation output
 file DAGMan workflow engine HTCondor schedd Submit Node HTCondor
 Central Manager negotiator HTTP Server data storage Data Node HTCondor startd Worker Nodes scratch space https://github.com/wrench-project/pegasus
  32. !33 Accuracy Methodology Execute a real-world workflow with Pegasus/DAGMan on

    a real-world platform And run a few simple benchmarks on the platform to measure network bandwidths/latencies, etc. Simulate that same execution with the WRENCH simulator Compare real-world and simulated executions http://wrench-project.org
  33. Experimental Scenarios !34 1000 Genome Sequencing Analysis Workflow 22 Individual

    tasks, 7 Population tasks, 22 Sifting tasks, 154 Pair Overlap Mutations tasks, and 154 Frequency Overlap Mutations tasks (Total 359 tasks) ... c1 c2 c3 c4 c22 ... s1 s2 s3 s4 s22 ... p1 p2 pn ... fc 2505 fc 1 fs 3 fp 1 fp 2 fp n ... ... m1 m2 m3 m154 ... fr1 fr2 fr3 fr154 i 3 pop 2 sh 3 om 1 Data Preparation Populations Sifting Individuals 1000 Genome Populations Sifting Pair Overlap Mutations Individuals Analysis ofm 1 Input Data Output Data fom 2 fog 2 Frequency Overlap Mutations Scientific Workflow Application http://wrench-project.org Montage Two sets of tasks 573 tasks 1,240 tasks
  34. Experimental Scenarios !35 Simulated Platform <?xml version='1.0'?> <!DOCTYPE platform SYSTEM

    "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd"> <platform version="4.1"> <zone id="AS0" routing="Full"> <host id="master" speed="1f" core="4"/> <host id="data" speed="1f" core="1"/> <host id="workers1-2" speed="1f" core="4"/> <host id="workers1-0" speed="1f" core="4"/> <host id="workers1-3" speed="1f" core="4"/> <host id="workers1-1" speed="1f" core="4"/> <host id="workers1-4" speed="1f" core="4"/> <link id="1" bandwidth="125MBps" latency="100us"/> <link id="2" bandwidth="55MBps" latency="100us"/> <route src="master" dst="workers1-2"> <link_ctn id="1"/> </route> <route src="master" dst="workers1-0"> <link_ctn id="1"/> </route> <route src="master" dst="workers1-3"> <link_ctn id="1"/> </route> <route src="master" dst="workers1-1"> <link_ctn id="1"/> </route> <route src="master" dst="workers1-4"> <link_ctn id="1"/> </route> <route src="data" dst="master"> <link_ctn id="2"/> </route> </zone> </platform> SimGrid Platform description file Data node Master server (submit host) Worker nodes (4 cores each) Modeled as a bare metal system http://wrench-project.org
  35. Experimental Scenarios !36 Simulated Platform http://wrench-project.org t2.xlarge instances 4 vCPU

    each, 16 GiB WMS Data node 0.74 Gbps 0.44 Gbps m5.xlarge instances 4 vCPU each, 54 GiB WMS Data node 1.24 Gbps 0.44 Gbps Amazon’s “cloud” platform
  36. Makespan Accuracy Results !37 Simulation Results and Accuracy Simulated compute

    and data transfer tasks includes simulation of auxiliary tasks (e.g., create_dir, cleanup, and registration), and PRE and POST script jobs Simulates delays on both DAGMan and HTCondor daemons http://wrench-project.org Simulation with WorkflowSim [Chen et al. 2012] Errors are 12.09 ± 2.84, 26.87 ± 6.26, 13.32 ± 1.12 (in spite of our calibration efforts)
  37. Visual Inspection of CDFs !38 Simulation Results and Accuracy Kolmogorov-Smirnov

    goodness of fit test 
 (K-S test) null hypothesis is not rejected (p-value > 0.05) two-sided 
 (alternative hypothesis) null hypothesis H0 is defined as the ECDF for the simulated workflow execution fits the real workflow execution ECDF !38 http://wrench-project.org Montage (1240 tasks) on AWS-m5.xlarge 0.00 0.25 0.50 0.75 1.00 0 1000 2000 3000 Workflow Makespan (s) F(Submitted Tasks) A 0.00 0.25 0.50 0.75 1.00 0 1000 2000 3000 Workflow Makespan (s) F(Completed Tasks) B pegasus wrench workflowsim Task submission times Task completion times
  38. Gantt Charts !39 !39 http://wrench-project.org nd we statis- usions stance,

    ample ission, right- ench”) peated he null This is tained are far alidate -world flowsim”) executions of Montage-2.0 on AWS-m5.xlarge. 0 1000 2000 3000 Makespan (s) Tasks pegasus A 0 1000 2000 3000 Makespan (s) Tasks wrench B
  39. Gantt Charts !40 !40 http://wrench-project.org nd we statis- usions stance,

    ample ission, right- ench”) peated he null This is tained are far alidate -world flowsim”) executions of Montage-2.0 on AWS-m5.xlarge. 0 1000 2000 3000 Makespan (s) Tasks pegasus A 0 1000 2000 3000 Makespan (s) Tasks wrench B Different shapes due to different orders of task submissions (due to different data structures used by the simulated and real-world implementations)
  40. Scalability !41 !41 http://wrench-project.org 0 500 1000 0 500 1000

    1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 # workflow tasks Time (s) Memory [MB] memory usage simulation time workflowsim wrench−1.2
  41. Supporting Research !42 Ongoing WRENCH-enabled Projects !42 Simulating VIP (Virtual

    Imaging Platform) VIP targets the execution of medical imaging workflow applications on the BioMed Virtual Organization resources provided on the EGI (European Grid Initiative) platforms The objective is to optimize workflow executions via better decision making strategies WRENCH is used to simulate novel Data-replication strategies Pilot job submission strategies for batch scheduled clusters Cluster selection strategies Efficient workflows executions on batch- scheduled clusters Batch-scheduled clusters are not ideally suited to workflow applications, and yet they represent the majority of HPC execution platforms A key question is: how should workflow tasks be aggregated into batch jobs? One approach is to design task aggregation strategies that try to account for the dynamics of the batch queues WRENCH simulations are used to drive the design of such strategies http://wrench-project.org
  42. Software Availability Code Repository, Releases, Software Engineering Process https://github.com/wrench-project/wrench Open-source

    repository https://travis-ci.org/wrench-project/wrench https://coveralls.io/github/wrench-project/wrench https://sonarcloud.io/dashboard?id=wrench Releases 1.2 (November 7, 2018) 1.1 (August 26, 2018) 1.0.1 (August 14, 2018) 1.0 (June 16, 2018) Upcoming releases (estimated) 1.3 (January 2019) Code Review !43 Continuous Integration Tests Coverage http://wrench-project.org
  43. Education WRENCH Stand-alone Pedagogical Module http://wrench-project.org/wrench-pedagogic-modules It is crucial to

    teach undergraduate students parallel and distributed computing But it is not easy giving students access to sufficiently diverse and realistic software/hardware platforms dealing with platform down-times and instabilities dealing with time-consuming and possible costly executions !44 Simulation resolves these difficulties and WRENCH provides the foundation for pedagogic modules on parallel and distributed computing that use workflows as a motivating context
  44. Simulation Building Blocks Prototype implementations of Workflow Management System (WMS)

    components and underlying algorithms Simulation Accuracy Captures the behavior of a real-world system with as little bias as possible via validated simulation models Scalability Low ratio of simulation time to simulated time, ability to run large simulations on a single computer with low compute, memory, and energy footprints Reproducible Results Enable the reproduction or repetition of published results by a party working independently using the same or different simulation models !45 < < Research Paper