Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SimGrid, Versatile Simulation of Distributed Systems

SciTech
January 23, 2015

SimGrid, Versatile Simulation of Distributed Systems

In most scientific domains, results are now obtained thanks to computational science that heavily relies on numerical simulations. This in turn leads to an tremendous increase in size and complexity of the needed computing infrastructures. The performance assessment of such distributed systems and the applications they run is then a complex task for which various approaches can be considered. This talk will give a general overview of these approaches for the performance assessment of distributed systems and applications: experimentation, emulation, and simulation. It will specifically focus on the main features and strengths of the SimGrid toolkit. SimGrid is a 15 year old research project whose scope has been broaden over the years from the simulation of computing grids to P2P systems, clouds, and HPC. It will introduce the different programming APIs, the underlying validated network models, and give some scalability results in various applicative domains.

SciTech

January 23, 2015
Tweet

More Decks by SciTech

Other Decks in Technology

Transcript

  1. Versatile Simulation of Distributed Systems Fr´ ed´ eric Suter On

    behalf of the SimGrid Team January 23, 2015
  2. What is Science? Doing Science = Acquiring Knowledge Experimental Science

    Theoretical Science Computational Science Thousand years ago Observations-based Can describe Last few centuries Equations-based Can understand Nowadays Compute-intensive Can simulate SimGrid Team – Versatile Simulation of Distributed Systems 2/28
  3. Use of Computers in Modern Science High Throughput Computing (HTC)

    Huge amount of data + Millions of jobs on “cheap” computers = Higgs Boson discovery High Performance Computing (HPC) Huge parallel simulation + Finely tuned expensive machine = Comparison with observations SimGrid Team – Versatile Simulation of Distributed Systems 3/28
  4. What is Computer Science (in this context)? Deal with huge

    amount of data Data placement, distribution, replication Provenance, mining Heterogeneous distributed infrastructures Job and workflow scheduling Scientific gateways, Cloud Tightly coupled parallel applications Performance analysis Optimization Resilience, algorithms Massively parallel machines High Performance Network interconnects GPUs, Many-Cores, low-power nodes SimGrid Team – Versatile Simulation of Distributed Systems 4/28
  5. SimGrid Scope: Distributed Systems Intensive Computing: High Performance Computing /

    Computational Grids Computational science infrastructure: Massive / Federated systems Main issues: Be TOP’500 #1 / compatibility, trust, accountability Cloud Computing Large infrastructures underlying commercial Internet (eBay, Amazon, Google) Main issues: Optimize costs; Keep up with the load (flash crowds) P2P Systems Exploit resources at network edges (storage, CPU, human presence) Main issues: Volatility (churn); Network locality; Anonymity Production systems, with hard to assess characteristics Correction: absence of crash, race conditions, deadlocks, and other defects Performance: makespan, economics, energy, . . . . ← main focus of SimGrid SimGrid Team – Versatile Simulation of Distributed Systems 5/28
  6. Studying Distributed Applications Correction Study Formal Methods Tests: Do not

    provide definitive answers Performance Study Experimentation Maths: Often not sufficient to fully understand systems SimGrid Team – Versatile Simulation of Distributed Systems 6/28
  7. Studying Distributed Applications Correction Study Formal Methods Tests: Do not

    provide definitive answers Model-Checking: Exhaustive and automated exploration of state space Performance Study Experimentation Maths: Often not sufficient to fully understand systems Experimental Facilities: Real applications on Real platform (in vivo) Simulation: Prototypes of applications on system’s Models (in silico) SimGrid Team – Versatile Simulation of Distributed Systems 6/28
  8. Studying Distributed Applications Correction Study Formal Methods Tests: Do not

    provide definitive answers Model-Checking: Exhaustive and automated exploration of state space Performance Study Experimentation Maths: Often not sufficient to fully understand systems Experimental Facilities: Real applications on Real platform (in vivo) Emulation: Real applications on Synthetic platforms (in vitro) Simulation: Prototypes of applications on system’s Models (in silico) SimGrid Team – Versatile Simulation of Distributed Systems 6/28
  9. Studying Distributed Applications Correction Study Formal Methods Tests: Do not

    provide definitive answers Model-Checking: Exhaustive and automated exploration of state space Performance Study Experimentation Maths: Often not sufficient to fully understand systems Claim: Simulation is both sound and convenient Less simplistic than proposed theoretical models Easier and faster than experimental platforms Experimental Facilities: Real applications on Real platform (in vivo) Emulation: Real applications on Synthetic platforms (in vitro) Simulation: Prototypes of applications on system’s Models (in silico) SimGrid Team – Versatile Simulation of Distributed Systems 6/28
  10. What is Simulation Good for? Fastest Path from Idea to

    Results Idea or MPI code Experimental Setup + ⇝ Scientific Results Models Simulation Get results from partial implementations Run thousands of experiments within a week Test the scientific idea without bothering with details Easiest way to study distributed applications/systems Centralization: Distribution is only simulated High Reproducibility: Everything is controlled Clairvoyance: Observe the hidden behaviors What-if analysis: Measure the impact of condition changes Eco-Friendly: No resource waste for debug and tests SimGrid Team – Versatile Simulation of Distributed Systems 7/28
  11. Computational Science of Distributed Systems? Requirements for a Scientific Approach

    Reproducible results: read a paper, reproduce the results and improve Standard tools that Grad students can learn quickly Current practice in the field is quite different Experimental settings not detailed enough in literature Many short-lived simulators; few sound and established tools Grid/Cloud: OptorSim, GridSim, GroudSim, CloudSim, iCanCloud, . . . Volunteer Computing: SimBA, EmBOINC, SimBOINC, . . . P2P: PeerSim, P2PSim, OverSim, . . . HPC: PSINS, LogGOPSim, BigSim, MPI-SIM, . . . . . . SimGrid Team – Versatile Simulation of Distributed Systems 8/28
  12. SimGrid: a Versatile Simulation Toolkit Scientific Instrument and Scientific Object

    Developed for 15 years (personal contribution since 2009) Simulate real and abstract programs Validated, Scalable, Usable, Modular, Portable Comparison of network and middleware performance models Traces Timed Outcomes independent Time traces Applications Input parameters Scheduling algs. DAG generator App. workload MPI apps. Cloud P2P, SimDAG Platform topology Simulacrum Deployment Application Availability changes Simulation kernel and APIs SURF SimIX SMPI MSG Models Jedule Logs TCP DISK CPU MPI Replay SimGrid Team – Versatile Simulation of Distributed Systems 9/28
  13. SimGrid History 1998-2001 Baby steps: Factorize some code between PhD

    students in scheduling 2001-2003 Infancy: CSP and improved models 2003-2008 Teenage: Performance, validity, multi-APIs 2008-2011 Maturation: Scope increase to P2P; visualization 2012-: Taking the world over :) Further scope increase to HPC and Cloud Added metholologies: emulation, verification Mature ecosystem and community ADT SG USS SimGrid ANR ODL SG SONGS ANR INFRA (UCSD) (UCSD, Grenoble, Lyon, Nancy,...) (UCSD+Lyon) HEMERA WG SG 3 SG 2 SG 1.0 98 99 00 01 02 03 04 05 07 08 09 10 11 12 13 14 15 06 3.5 3.7 3.6 3.4 3.3 3.2 3.1 3.0 2.99 3.8 3.9 3.10 3.11 SimGrid Team – Versatile Simulation of Distributed Systems 10/28
  14. Quick Overview of Internals Organization User-visible SimGrid Components SimDag Framework

    for DAGs of parallel tasks MSG Simple application- level simulator XBT: Grounding features (logging, etc.), data structures (lists, etc.), portability applications on top of a virtual environment Library to run MPI SMPI MSG: heuristics as Concurrent Sequential Processes (Java/Ruby/Lua bindings) SimDag: heuristics as DAG of (parallel) tasks SMPI: simulate real applications written using MPI SimGrid is Strictly Layered internaly MSG: User-friendly syntaxic sugar Simix: Processes, synchro (SimPosix) SURF: Resources usage interface Models: Action completion computation LMM SIMIX SURF MSG Actions 372 435 245 245 530 530 50 664 work remaining variable ... x1 x2 x2 x2 x3 x3 xn + + + ... ≤ CP ≤ CL1 ≤ CL4 ≤ CL2 ≤ CL3    Constraints                                    Variables Conditions { ... Process          user code user code user code user code user code ... SimGrid Team – Versatile Simulation of Distributed Systems 11/28
  15. Simulation Validity SotA: Models in most simulators are either simplistic,

    wrong or not assessed PeerSim: discrete time, application as automaton; GridSim: naive packet level or buggy flow sharing OptorSim, GroudSim: documented as wrong on heterogeneous platforms SimGrid provides several network models Fast flow-based model, toward realism and speed (by default) Accounts for Contention, Slow-start, TCP congestion, Cross-traffic effects Constant time: A bit faster, but no hope of realism Coordinate-based: Easier to instantiate in P2P scenarios Packet-level: NS3 bindings Controlled by command line switches (exact comparison on a given application) SimGrid Team – Versatile Simulation of Distributed Systems 12/28
  16. Max-Min Fairness between Network Flows x1 CPU1 x2, x3 CPU2

    link1 ρ1, ρ2 link2 ρ1, ρ3 x1 ≤ Power CPU1 (1a) x2 + x3 ≤ Power CPU2 (1b) ρ1 + ρ2 ≤ Capacity link1 (1c) ρ1 + ρ3 ≤ Capacity link2 (1d) Computing the sharing between flows Objective function: maximize min f ∈F (ρf ) [Massouli´ e & Roberts 2003] Equilibrium: increasing any ρf decreases a ρf (with ρf > ρf ) (actually, that’s a simplification of our real objective function) Efficient Algorithm 1. Search for the bottleneck link l so that: Cl nl = min Ck nk , k ∈ L 2. This sets the share of any flow f on this link: ρf = Cl nl 3. Update all nk and Ck to remove these flows; Loop until all ρf are fixed SimGrid Team – Versatile Simulation of Distributed Systems 13/28
  17. Max-Min Fairness Example Homogeneous Linear Network flow 2 flow 1

    flow 0 link 1 link 2 C1 = C n1 = 2 C2 = C n2 = 2 ρ0 = ρ1 = ρ2 = All links have the same capacity C Each of them is limiting. Let’s choose link 1 SimGrid Team – Versatile Simulation of Distributed Systems 14/28
  18. Max-Min Fairness Example Homogeneous Linear Network flow 2 flow 1

    flow 0 link 1 link 2 C1 = C n1 = 2 C2 = C n2 = 2 ρ0 = C/2 ρ1 = C/2 ρ2 = All links have the same capacity C Each of them is limiting. Let’s choose link 1 ⇒ ρ0 = C/2 and ρ1 = C/2 SimGrid Team – Versatile Simulation of Distributed Systems 14/28
  19. Max-Min Fairness Example Homogeneous Linear Network 000000 111111 000000 000000

    111111 111111 flow 2 C1 = 0 n1 = 0 C2 = C/2 n2 = 1 ρ0 = C/2 ρ1 = C/2 ρ2 = All links have the same capacity C Each of them is limiting. Let’s choose link 1 ⇒ ρ0 = C/2 and ρ1 = C/2 Remove flows 0 and 1; Update links’ capacity SimGrid Team – Versatile Simulation of Distributed Systems 14/28
  20. Max-Min Fairness Example Homogeneous Linear Network 000000 111111 000000 000000

    111111 111111 flow 2 C1 = 0 n1 = 0 C2 = 0 n2 = 0 ρ0 = C/2 ρ1 = C/2 ρ2 = C/2 All links have the same capacity C Each of them is limiting. Let’s choose link 1 ⇒ ρ0 = C/2 and ρ1 = C/2 Remove flows 0 and 1; Update links’ capacity Link 2 sets ρ1 = C/2. We are done computing the bandwidths ρf SimGrid Implementation is Efficient Lazy updates, Trace integration, preserving Cache locality SimGrid Team – Versatile Simulation of Distributed Systems 14/28
  21. Back on the internals SURF MSG SMPI SIMDAG SIMIX 372

    435 work remaining variable 530 530 50 664 245 245 Handling of concurrent user’s actions Computation of resource sharing & actions’ progress Concurrent Condition        ... ... processes variables App. spec. as task graph ... x1 x2 x3 x3 + xn ... + + xn ... ...                                         Variables P1 L2 x1 App. spec. as concurrent code    Lm L1 ... ... Activities User’s APIs and interconnections specification Resource capacities Resource Capacities ≤ CL2 ≤ CP1 ≤ CL1 ≤ CLm SimGrid Team – Versatile Simulation of Distributed Systems 15/28
  22. What do others in grid or cloud simulation? Naive flow

    models documented as wrong Setting Expected Output Output B = 100 B = 100 B = 20 Known issue in Narses (2002), OptorSim (2003), GroudSim (2011). SimGrid Team – Versatile Simulation of Distributed Systems 16/28
  23. What do others in grid or cloud simulation? Naive flow

    models documented as wrong Setting Expected Output Output B = 100 B = 100 B = 20 B = 100 B = 100 B = 20 Known issue in Narses (2002), OptorSim (2003), GroudSim (2011). SimGrid Team – Versatile Simulation of Distributed Systems 16/28
  24. What do others in grid or cloud simulation? Naive flow

    models documented as wrong Setting Expected Output Output B = 100 B = 100 B = 20 B = 100 B = 100 B = 20 B = 100 B = 100 B = 20 Known issue in Narses (2002), OptorSim (2003), GroudSim (2011). SimGrid Team – Versatile Simulation of Distributed Systems 16/28
  25. What do others in grid or cloud simulation? Naive flow

    models documented as wrong Setting Expected Output Output B = 100 B = 100 B = 20 B = 100 B = 100 B = 20 B = 100 B = 100 B = 20 Known issue in Narses (2002), OptorSim (2003), GroudSim (2011). Validation by general agreement “Since SimJava and GridSim have been extensively utilized in conducting cutting edge research in Grid resource management by several researchers, bugs that may compromise the validity of the simulation have been already detected and fixed.” – CloudSim, ICPP’09 Setting Expected Output Output SimGrid Team – Versatile Simulation of Distributed Systems 16/28
  26. What do others in grid or cloud simulation? Naive flow

    models documented as wrong Setting Expected Output Output B = 100 B = 100 B = 20 B = 100 B = 100 B = 20 B = 100 B = 100 B = 20 Known issue in Narses (2002), OptorSim (2003), GroudSim (2011). Validation by general agreement “Since SimJava and GridSim have been extensively utilized in conducting cutting edge research in Grid resource management by several researchers, bugs that may compromise the validity of the simulation have been already detected and fixed.” – CloudSim, ICPP’09 Setting Expected Output Output B B B Buggy flow model (GS 5.2, 11/2010). Similar issues with naive packet-level models SimGrid Team – Versatile Simulation of Distributed Systems 16/28
  27. SimGrid Network Model for MPI Measurements Small Medium1 Medium2 Detached

    Small Medium1 Medium2 Detached MPI_Send MPI_Recv 1e−04 1e−02 1e+01 1e+03 1e+05 1e+01 1e+03 1e+05 Message size (bytes) Duration (seconds) group Small Medium1 Medium2 Detached Large Hybrid Model Asynchronous (k ≤ Sa ) T3 Pr Ps T1 T2 Detached (Sa < k ≤ Sd ) Ps Pr T2 T4 T1 Synchronous (k > Sd ) Ps Pr T4 T2 Fluid model: account for contention and network topology 1-39 40-74 105-144 75-104 1G 10G Down Up Down Up Down Up Down Up 10G 1G 1−39 40−74 105−144 75−104 13G 10G Limiter ... ... ... ... 1.5G 1G Limiter Down Up SimGrid Team – Versatile Simulation of Distributed Systems 17/28
  28. And this actually works! Sweep3D: Simple Application (but not trivial)

    predicted in all details Graphene (16 procs), OpenMPI, TCP, Gigabit Ethernet achieved without overfiting :) SimGrid Team – Versatile Simulation of Distributed Systems 18/28
  29. Reality often . . . surprizing TCP collapse NAS CG

    on Graphene, with 128 processes Highly congested TCP reduce the emissions When speed reaches 0, it timeouts after 200ms, resets, and start over (TCP RTO should help alleviating this bug – but doesn’t) We could model these effects but actually, you want to fix reality We wanted to understand the systems with models, what a success! SimGrid Team – Versatile Simulation of Distributed Systems 19/28
  30. SimGrid Scalability Simulation Versatility should not hinder Scalability Two aspects:

    Big enough (large platforms) ⊕ Fast enough (large workload) Versatile yet Scalable Platform Descriptions Hierarchical organization in ASes cuts down complexity recursive routing Efficient on each classical structures Flat, Floyd, Star, Coordinate-based Allow bypass at any level Grid’5000 platform in 22KiB (10 sites, 40 clusters, 1,500 nodes) King’s dataset in 290KiB (2,500 nodes, coordinate-based) Empty +coords Full Full Dijkstra Floyd Rule− based Rule− based Rule− based based Rule− AS1 AS2 AS4 AS5 AS7 AS6 AS5−3 AS5−1 AS5−2 AS5−4 SimGrid Team – Versatile Simulation of Distributed Systems 20/28
  31. How big and how fast? (1/3 – Grid and VC)

    Comparison to GridSim A master distributes 500, 000 fixed size jobs to 2, 000 workers (round robin) GridSim SimGrid Network model delay-based model flow model Topology none Grid5000 Time 1h 14s Memory 4.4GB 165MB Volunteer Computing settings Loosely coupled scenario as in Boinc SimGrid: full modeling (clients and servers), precise network model SimBA: Servers only, descisions based on simplistic markov modeling SimGrid shown 25 times faster SimGrid Team – Versatile Simulation of Distributed Systems 21/28
  32. How big and how fast? (2/3 – P2P) Scenario: Initialize

    Chord, and simulate 1,000 seconds of protocol Arbitrary Time Limit: 12 hours (kill simulation afterward) 0 10 000 20 000 30 000 40 000 0 500 000 1e+06 1.5e+06 2e+06 Running time in seconds Number of nodes Oversim (OMNeT++ underlay) Oversim (simple underlay) PeerSim SimGrid (flow-based) SimGrid (delay-based) Largest simulated scenario Simulator size time OverSim (OMNeT++) 10k 1h40 OverSim (simple) 300k 10h PeerSim 100k 4h36 10k 130s SG (flow-based) 300k 32mn 2M∗ 6h23 SG (delay-based) 2M 5h30 ∗ 36GB = 18kB/ process (16kB for the stack) Orders of magnitude more scalable than state-of-the-art P2P simulators Precise model incurs a ≈ 20% slowdown, but accuracy is not comparable Also, parallel simulation (faster simulation at scale); Distributed sim. ongoing SimGrid Team – Versatile Simulation of Distributed Systems 22/28
  33. How big and how fast? (3/3 – HPC) Simulating a

    binomial broadcast 0.01 0.1 1 10 100 1000 10000 10 12 14 16 18 20 22 24 Simulation Time (s) Log2 of the Number of Processes SimGrid LogGoPSim Model: SimGrid: contention + cabinets hierarchy LOGGOPSIM: simple delay-based model Results: SimGrid is roughly 75% slower SimGrid is about 20% more fat (15GB required for 223 processors) Genericity of SimGrid data structures ⇒ slight overhead BUT Scalability loss of realism SimGrid Team – Versatile Simulation of Distributed Systems 23/28
  34. What about workflows? X workflows 1 3 4 5 6

    2 Root End x Y platforms x Z Heuristics For each task do Select resource Schedule task end do SimDag is the API of choice 1. Describe DAGs 2. Describe resources 3. Write scheduling heuristics SimGrid Team – Versatile Simulation of Distributed Systems 24/28
  35. What about workflows? X workflows 1 3 4 5 6

    2 Root End x Y platforms x Z Heuristics For each task do Select resource Schedule task end do SimDag is the API of choice 1. Describe DAGs (possibly in the DAX format) 2. Describe resources 3. Write scheduling heuristics      Workflow Scheduling Simulator SimGrid Team – Versatile Simulation of Distributed Systems 24/28
  36. What about workflows? X workflows 1 3 4 5 6

    2 Root End x Y platforms x Z Heuristics For each task do Select resource Schedule task end do SimDag is the API of choice 1. Describe DAGs (possibly in the DAX format) 2. Describe resources 3. Write scheduling heuristics      Workflow Scheduling Simulator DAX format support Actually DAXes from the worflow archive Jobs include a runtime attribute Do not reflect the executable workflow Lacks of support of the auxiliary info Easy to use: sd daxload("my dax file.xml") Return a dynamic array of SimDag tasks SimGrid Team – Versatile Simulation of Distributed Systems 24/28
  37. Visualizing SimGrid Simulations Visualization scriptable: easy but powerful configuration; Scalable

    tools Right Information: both platform and applicative visualizations Right Representation: gantt charts, spatial representations, tree-graphs Easy navigation in space and time: selection, aggregation, animation Easy trace comparison: Trace diffing (still partial ATM) time slice time slice time slice time slice 1st Space Aggregation 2nd Space Aggregation GroupA GroupB SimGrid Team – Versatile Simulation of Distributed Systems 25/28
  38. Dynamic Verification with SimGrid Verifying safety and liveness properties Works

    on real C code, using Dwarf to introspect state Explicitely explores the execution graph DPOR-based reduction techniques (safety only) or State equality reduction Mostly suited for bug finding (no certification) 1 2 iSend 3 iRecv 4 Wait 5 iRecv 153 Test TRUE 6 Test TRUE 7 iSend 8 Wait 9 iRecv 10 Test FALSE 11 MC_RANDOM (0) 115 MC_RANDOM (1) 122 iRecv 1 2 MC_RANDOM (0) 105 MC_RANDOM (1) 112 iRecv 1 3 MC_RANDOM (0) MC_RANDOM (1) 107 iRecv MC_RANDOM (0) 1 5 MC_RANDOM (1) 24 iRecv 1 6 iSend 1 7 iRecv 1 8 Wait 1 9 Test TRUE 2 0 iSend 2 1 Wait 2 2 iRecv Test FALSE 25 MC_RANDOM (0) 103 MC_RANDOM (1) 2 6 Test FALSE 2 7 MC_RANDOM (0) 6 6 MC_RANDOM (1) 2 8 MC_RANDOM (0) 6 2 MC_RANDOM (1) 2 9 MC_RANDOM (0) MC_RANDOM (1) MC_RANDOM (0) 3 1 MC_RANDOM (1) 32 iSend 3 5 Test FALSE 33 Wait Test TRUE 3 6 iSend 5 8 MC_RANDOM (0) 60 MC_RANDOM (1) 37 Wait 5 4 MC_RANDOM (0) 5 6 MC_RANDOM (1) 3 8 MC_RANDOM (0) 5 2 MC_RANDOM (1) 39 MC_RANDOM (0) 44 MC_RANDOM (1) 40 MC_RANDOM (0) MC_RANDOM (1) 4 1 MC_RANDOM (0) MC_RANDOM (1) 4 2 Test TRUE iSend 4 5 Test TRUE 4 6 iSend 4 7 Wait 4 8 iRecv Test FALSE Test TRUE Wait Wait iSend iSend 63 Test FALSE Test FALSE 6 7 iSend 75 Test FALSE 6 8 Wait 6 9 Test TRUE 7 0 iSend 71 Wait 72 iSend 73 iRecv Test FALSE 7 6 iSend 9 9 MC_RANDOM (0) 101 MC_RANDOM (1) 77 Wait 95 MC_RANDOM (0) 9 7 MC_RANDOM (1) 78 MC_RANDOM (0) 9 3 MC_RANDOM (1) 79 MC_RANDOM (0) 84 MC_RANDOM (1) 8 0 MC_RANDOM (0) MC_RANDOM (1) 8 1 MC_RANDOM (0) MC_RANDOM (1) 8 2 Test TRUE iSend 85 Test TRUE 8 6 iSend 8 7 Wait 8 8 iSend 8 9 iRecv Test FALSE Test TRUE Wait Wait iSend iSend iSend Test FALSE MC_RANDOM (0) 109 MC_RANDOM (1) Test FALSE MC_RANDOM (0) MC_RANDOM (1) 116 iSend 117 iRecv 118 Wait 119 Test TRUE 120 iSend Wait MC_RANDOM (0) 124 MC_RANDOM (1) 125 iSend 128 Test FALSE 126 Wait Test TRUE 129 iSend 149 MC_RANDOM (0) 151 MC_RANDOM (1) 130 Wait 145 MC_RANDOM (0) 147 MC_RANDOM (1) 131 MC_RANDOM (0) 143 MC_RANDOM (1) 132 MC_RANDOM (0) 137 MC_RANDOM (1) 133 MC_RANDOM (0) MC_RANDOM (1) 134 MC_RANDOM (0) MC_RANDOM (1) 135 Test TRUE iSend 138 Test TRUE 139 iSend Wait Test TRUE Wait Wait iSend iSend iRecv Current state Usable in MSG and SMPI (in C) Found wild bugs in medium-sized programs (Chord protocol) Verified collectives of MPICH (in minutes) Ongoing work Verify larger applications Ensure send determinism (for checkpointing) SimGrid Team – Versatile Simulation of Distributed Systems 26/28
  39. Practical Trust onto SimGrid? Internal code base rather complex because

    of hacks for versatile efficiency Continuous Integration Current version tested every night 450 integration tests; 10,000 unit tests; 70% coverage 2 SimGrid configurations on 10 Linux versions Performance regression testing soon operational Release tests Windows and Mac considered as additional release goals Actually works on all Debian arch.: hurd, kfreebsd, mips, arm, ppc, s390 This is free software anyway The code base is currently LGPL (probably soon GPL) Come, check it out and participate! (5 of 25 commiters not affiliated to us) SimGrid Team – Versatile Simulation of Distributed Systems 27/28
  40. Take Away Messages SimGrid will prove helpful to your research

    Versatile: Used in several communities (scheduling, Grids, HPC, P2P, Clouds) Accurate: Model limits known thanks to validation studies Sound: Easy to use, extensible, fast to execute, scalable to death, well tested Open: LGPL; User-community much larger than contributors group Around since 15 years, and ready for at least 15 more years Welcome to the Age of (Sound) Computational Science Discover: http://simgrid.org/ Learn: several 101 tutorials, user manual, and examples Join: user mailing list, #simgrid on irc.debian.org We even have some open positions ;) SimGrid Team – Versatile Simulation of Distributed Systems 28/28