Slide 1

Slide 1 text

Workflow Management System Simulation Workbench Accurate, scalable, and reproducible simulations http://wrench-project.org

Slide 2

Slide 2 text

Our Team WRENCH is funded by the National Science Foundation (NSF) under grants number 1642369 and 1642335, and the National Center for Scientific Research (CNRS) under grant number PICS07239. !2 http://wrench-project.org

Slide 3

Slide 3 text

Workflow + WMS + CI = Complex Systems !3 Scientific workflows are complex applications that execute on complex software stacks on complex hardware platforms To advance the state of the art, one must thus try to study an understand complex systems Theory only gets you so far as its assumptions break down quickly So most research in this field is experiment-driven Typical approach: take a real workflow, take a real software infrastructure installed on a real hardware infrastructure, run the workflow, get results

Slide 4

Slide 4 text

!4 Real-World Experiments are Expensive The“setup and configuration” work can be non-trivial Because you need full-fledge software stacks Experiments are time-consuming Long-running, faulty executions, weird non-representative outliers due to misconfigurations, many required executions due to repeatability problems, etc. Experiments cost “money” Directly or indirectly (energy, carbon footprint) As a result, research papers are often deemed to “not have enough experimental results” [Personal Opinion] Most authors feel that they have spent way too much time setting up and running experiments http://wrench-project.org

Slide 5

Slide 5 text

!5 Real-World Experiments are Limited One is limited to particular platform configurations (and sub-configurations) How can “what if?” scenarios be explored?
 How can generality be claimed? 
 One is limited by specifics of the software infrastructure that impose constraints on workflow executions Modifying complex software stacks (often written by others) just to test out ideas is not feasible 
 In the end, the scope of real-world experiments is limited, which impedes progress / discovery http://wrench-project.org

Slide 6

Slide 6 text

!6 Simulation When one works in an experimental field in which experiments are problematic, one resorts to simulation Physicists have understood this decades ago :)
 In some fields of Computer Science simulation is a standard research and development methodology e.g., Networking, Computer Architecture 
 Several simulators and simulation frameworks have been developed for parallel and distributed computing Some of them developed explicitly for workflows http://wrench-project.org

Slide 7

Slide 7 text

!7 The SimGrid Framework SimGrid is a research project Development of simulation models of hardware/software stacks Models are accurate (validated/invalidated) and scalable (low computational complexity, low memory footprint) SimGrid is open source usable software Provides different APIs for a range of simulation needs, e.g.: S4U: General simulation of Concurrent Sequential Processes SMPI: Fine-grained simulation of MPI applications SimGrid is versatile scientific instrument Used for (combinations of) Grid, HPC, Peer-to-Peer, Cloud, Fog simulation projects First developed in 2000, latest release: v3.21 (October 2018) http://simgrid.org

Slide 8

Slide 8 text

SimGrid is well-funded, active as a research project, and widely-used as a simulation tool for research, development, and education 44 journal articles 146 conference articles 17 PhD theses !8 http://simgrid.org Software Sustainability

Slide 9

Slide 9 text

SimGrid’s philosophy: provide low-level abstractions Advantage: you can do anything with it Drawback: implementing a simulation of a complex system is a lot of work Critical analysis: In [Kecskemeti et al.’14] pinpoints exactly the above trade-off: "SimGrid is more scalable and validated than competing frameworks, but just too much work when wanting to simulate a WMS that interacts with CI components" !9 http://simgrid.org SimGrid’s Philosophy

Slide 10

Slide 10 text

!10 The WRENCH Project Objective #1: Make it easy to develop simulators of complex workflow executions Done by providing high-level, reusable simulation abstractions Objective #2: Produce accurate and scalable simulations Done by building on SimGrid Let’s look at an example system one can simulate with WRENCH… http://wrench-project.org

Slide 11

Slide 11 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !11 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 12

Slide 12 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !12 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 13

Slide 13 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !13 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 14

Slide 14 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !14 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 15

Slide 15 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !15 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 16

Slide 16 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !16 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 17

Slide 17 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !17 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 18

Slide 18 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !18 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 19

Slide 19 text

WAN Topology Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point xyz 350m 170m xyz xyz xyz !19 System to Simulate Workflow Management Systems Decision-making for optimizing various objectives (static and dynamic) Pilot Jobs Compute Services Bare-metal servers Cloud platforms Virtualized Cluster platforms Batch-scheduled clusters Storage Services Including scratch spaces File Registry Services Replica catalog (key-value pairs) Network Proximity Services Database of host-to-host network distances (Vivaldi)

Slide 20

Slide 20 text

!20 Two Kinds of WRENCH Users Developers who develop simulated WMS implementations WMS developers who want an in-simulation implementation of their WMS for easy/broad experimental studies of their WMS WMS researchers who want to quickly prototype ideas in simulation Users who implement (and run) simulators Users who want to see how fast a given workflow would run with a given WMS on a given platform Which can include the previous users http://wrench-project.org

Slide 21

Slide 21 text

The WRENCH Software Stack !21 SimGrid::S4U API (C++) WRENCH Developer API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators

Slide 22

Slide 22 text

The WRENCH Software Stack !22 SimGrid::S4U API (C++) WRENCH Developer API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators

Slide 23

Slide 23 text

The WRENCH Software Stack !23 SimGrid::S4U API (C++) WRENCH Developer API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators

Slide 24

Slide 24 text

The WRENCH Software Stack !24 SimGrid::S4U API (C++) WRENCH Developer API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators

Slide 25

Slide 25 text

The WRENCH Software Stack !25 SimGrid::S4U API (C++) WRENCH Developer API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators

Slide 26

Slide 26 text

The WRENCH Software Stack !26 SimGrid::S4U API (C++) WRENCH Developer API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators

Slide 27

Slide 27 text

The WRENCH Software Stack !27 SimGrid::S4U API (C++) WRENCH Developer API (C++) WRENCH User API (C++) Simulated core software / hardware stacks Computation Storage Network Monitoring Data Location Cloud Batch Rack FTP HTTP P2P Vivaldi perf SONAR Replica Catalog Simulated core CI services Simulated production and prototype WMSs Makeflow Moteur Pegasus Research Prototype WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point . . Simulators or workflow executions WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WAN Topology WRENCH Batch-Scheduled Compute Service batch job FTP Server Storage Service File X File Y HTTP Server Storage Service File X File X local storage File Registry Service Cloud Compute Service System to Simulate Replica Catalog HTTP Server FTP Server FTP Server Network Proximity Service Vivaldi latency estimates File X local storage File Y Workflow Submission Interface Scientist virtual machines virtual machines workflow description (DAX, JSON) Workflow Management System Decisions / Actions virtual machine virtual machines virtual machines EC2 cloud end-point SLURM batch end-point WRENCH enables novel avenues for scientific workflow use, research, development, and education in the context of large-scale scientific computations and data analyses WRENCH is an open-source library for developing simulators WRENCH exposes several high-level simulation abstractions to provide high-level building blocks for developing custom simulators

Slide 28

Slide 28 text

!28 WRENCH Developer API Objective: Make it easy to implement a complex WMS WMS is implemented as a single thread of control No explicit message-based interactions Simulated CI services provide high-levels APIs WMS registers callbacks for a small set of events Simple asynchronous interactions with services Built-in “manager threads” help with managing synchronicity Straightforward failures handling and failure cause inspection Bottom-line: removes many well-known difficulties when implementing a system that interacts asynchronously with distributed services on failure- prone platforms Example WMS distributed with WRENCH fits in 200 lines of C++ http://wrench-project.org

Slide 29

Slide 29 text

!29 WRENCH User API Makes it possible to implement a simulator in a few lines of code Instantiate a platform (XML description file) Create a workflow Create CI services and a WMS to run on the platform Launch the simulation Process/analyze/visualize simulation output A useful WRENCH simulator can consist of just a few dozen lines of code http://wrench-project.org

Slide 30

Slide 30 text

!30 Building a Simulator Blueprint for a WRENCH-based simulator Create and initialize a simulation Instantiate a simulated platform Instantiate services on the platform Create at least one workflow Instantiate at least one WMS per workflow Launch the simulation Process simulation output WRENCH + SimGrid internals Agent: some code, some private data, running on a given host Task: amount of work to do and of data to exchange Host: location on which agents execute Mailbox: Rendez-vous points between agents You can send ‘data' to a mailbox; you receive ‘data' from a mailbox Communication time between sender/receiver is accounted (payload) and depends on the network traffic http://wrench-project.org

Slide 31

Slide 31 text

!31 Case-Study We built a WRENCH implementation of the Pegasus/DAGMan WMS (with the Developer API) WRENCH provides a simulated HTCondor implementation We built a simulator of Pegasus-driven executions of arbitrary workflows on arbitrary platforms (with the User API) The simulator simulates Application task executions and data transfers Auxiliary tasks (directory creations, cleanups, file registrations, job startup/ teardown overheads, etc.) Delays and message exchanges within DAGMan and HTCondor Goal: evaluate WRENCH’s ease-of-use, accuracy, and scalability http://wrench-project.org

Slide 32

Slide 32 text

!32 Easy-of-Use Pegasus/DAGMan WRENCH implementation 127 lines of code to read/parse config files 539 lines of code for WMS logic Compared to production implementation Tens of thousands of lines of code http://wrench-project.org Pegasus
 workflow trace SimGrid platform file simulation configuration
 file WRENCH
 Pegasus
 simulator simulation output
 file DAGMan workflow engine HTCondor schedd Submit Node HTCondor
 Central Manager negotiator HTTP Server data storage Data Node HTCondor startd Worker Nodes scratch space https://github.com/wrench-project/pegasus

Slide 33

Slide 33 text

!33 Accuracy Methodology Execute a real-world workflow with Pegasus/DAGMan on a real-world platform And run a few simple benchmarks on the platform to measure network bandwidths/latencies, etc. Simulate that same execution with the WRENCH simulator Compare real-world and simulated executions http://wrench-project.org

Slide 34

Slide 34 text

Experimental Scenarios !34 1000 Genome Sequencing Analysis Workflow 22 Individual tasks, 7 Population tasks, 22 Sifting tasks, 154 Pair Overlap Mutations tasks, and 154 Frequency Overlap Mutations tasks (Total 359 tasks) ... c1 c2 c3 c4 c22 ... s1 s2 s3 s4 s22 ... p1 p2 pn ... fc 2505 fc 1 fs 3 fp 1 fp 2 fp n ... ... m1 m2 m3 m154 ... fr1 fr2 fr3 fr154 i 3 pop 2 sh 3 om 1 Data Preparation Populations Sifting Individuals 1000 Genome Populations Sifting Pair Overlap Mutations Individuals Analysis ofm 1 Input Data Output Data fom 2 fog 2 Frequency Overlap Mutations Scientific Workflow Application http://wrench-project.org Montage Two sets of tasks 573 tasks 1,240 tasks

Slide 35

Slide 35 text

Experimental Scenarios !35 Simulated Platform SimGrid Platform description file Data node Master server (submit host) Worker nodes (4 cores each) Modeled as a bare metal system http://wrench-project.org

Slide 36

Slide 36 text

Experimental Scenarios !36 Simulated Platform http://wrench-project.org t2.xlarge instances 4 vCPU each, 16 GiB WMS Data node 0.74 Gbps 0.44 Gbps m5.xlarge instances 4 vCPU each, 54 GiB WMS Data node 1.24 Gbps 0.44 Gbps Amazon’s “cloud” platform

Slide 37

Slide 37 text

Makespan Accuracy Results !37 Simulation Results and Accuracy Simulated compute and data transfer tasks includes simulation of auxiliary tasks (e.g., create_dir, cleanup, and registration), and PRE and POST script jobs Simulates delays on both DAGMan and HTCondor daemons http://wrench-project.org Simulation with WorkflowSim [Chen et al. 2012] Errors are 12.09 ± 2.84, 26.87 ± 6.26, 13.32 ± 1.12 (in spite of our calibration efforts)

Slide 38

Slide 38 text

Visual Inspection of CDFs !38 Simulation Results and Accuracy Kolmogorov-Smirnov goodness of fit test 
 (K-S test) null hypothesis is not rejected (p-value > 0.05) two-sided 
 (alternative hypothesis) null hypothesis H0 is defined as the ECDF for the simulated workflow execution fits the real workflow execution ECDF !38 http://wrench-project.org Montage (1240 tasks) on AWS-m5.xlarge 0.00 0.25 0.50 0.75 1.00 0 1000 2000 3000 Workflow Makespan (s) F(Submitted Tasks) A 0.00 0.25 0.50 0.75 1.00 0 1000 2000 3000 Workflow Makespan (s) F(Completed Tasks) B pegasus wrench workflowsim Task submission times Task completion times

Slide 39

Slide 39 text

Gantt Charts !39 !39 http://wrench-project.org nd we statis- usions stance, ample ission, right- ench”) peated he null This is tained are far alidate -world flowsim”) executions of Montage-2.0 on AWS-m5.xlarge. 0 1000 2000 3000 Makespan (s) Tasks pegasus A 0 1000 2000 3000 Makespan (s) Tasks wrench B

Slide 40

Slide 40 text

Gantt Charts !40 !40 http://wrench-project.org nd we statis- usions stance, ample ission, right- ench”) peated he null This is tained are far alidate -world flowsim”) executions of Montage-2.0 on AWS-m5.xlarge. 0 1000 2000 3000 Makespan (s) Tasks pegasus A 0 1000 2000 3000 Makespan (s) Tasks wrench B Different shapes due to different orders of task submissions (due to different data structures used by the simulated and real-world implementations)

Slide 41

Slide 41 text

Scalability !41 !41 http://wrench-project.org 0 500 1000 0 500 1000 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 # workflow tasks Time (s) Memory [MB] memory usage simulation time workflowsim wrench−1.2

Slide 42

Slide 42 text

Supporting Research !42 Ongoing WRENCH-enabled Projects !42 Simulating VIP (Virtual Imaging Platform) VIP targets the execution of medical imaging workflow applications on the BioMed Virtual Organization resources provided on the EGI (European Grid Initiative) platforms The objective is to optimize workflow executions via better decision making strategies WRENCH is used to simulate novel Data-replication strategies Pilot job submission strategies for batch scheduled clusters Cluster selection strategies Efficient workflows executions on batch- scheduled clusters Batch-scheduled clusters are not ideally suited to workflow applications, and yet they represent the majority of HPC execution platforms A key question is: how should workflow tasks be aggregated into batch jobs? One approach is to design task aggregation strategies that try to account for the dynamics of the batch queues WRENCH simulations are used to drive the design of such strategies http://wrench-project.org

Slide 43

Slide 43 text

Software Availability Code Repository, Releases, Software Engineering Process https://github.com/wrench-project/wrench Open-source repository https://travis-ci.org/wrench-project/wrench https://coveralls.io/github/wrench-project/wrench https://sonarcloud.io/dashboard?id=wrench Releases 1.2 (November 7, 2018) 1.1 (August 26, 2018) 1.0.1 (August 14, 2018) 1.0 (June 16, 2018) Upcoming releases (estimated) 1.3 (January 2019) Code Review !43 Continuous Integration Tests Coverage http://wrench-project.org

Slide 44

Slide 44 text

Education WRENCH Stand-alone Pedagogical Module http://wrench-project.org/wrench-pedagogic-modules It is crucial to teach undergraduate students parallel and distributed computing But it is not easy giving students access to sufficiently diverse and realistic software/hardware platforms dealing with platform down-times and instabilities dealing with time-consuming and possible costly executions !44 Simulation resolves these difficulties and WRENCH provides the foundation for pedagogic modules on parallel and distributed computing that use workflows as a motivating context

Slide 45

Slide 45 text

Simulation Building Blocks Prototype implementations of Workflow Management System (WMS) components and underlying algorithms Simulation Accuracy Captures the behavior of a real-world system with as little bias as possible via validated simulation models Scalability Low ratio of simulation time to simulated time, ability to run large simulations on a single computer with low compute, memory, and energy footprints Reproducible Results Enable the reproduction or repetition of published results by a party working independently using the same or different simulation models !45 < < Research Paper

Slide 46

Slide 46 text

http://wrench-project.org Thank You Get Started: