Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Model Discovery and Graph Simulation: A Lightwe...

Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering

Presentation for ICSE-NIER 2026 on discovering lightweight service models from distributed traces and using graph simulation to estimate availability before live chaos experiments. The deck covers the core method, matched live-vs-model evidence, current limitations, and the roadmap toward practical chaos-engineering workflows.

Avatar for Anatoly A. Krasnovsky

Anatoly A. Krasnovsky

April 15, 2026

Other Decks in Research

Transcript

  1. Novelty Discover the model from existing artifacts, not by hand.

    Author Anatoly A. Krasnovsky 1 Innopolis University, Russia 2 MB3R Lab, Russia TA L K T H E S I S Discovered topology gives a cheap first-pass availability estimate before live chaos. 1 / 10 ICSE-NIER 2026 • Research Talk Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering 1, 2
  2. Scenario space services × endpoints × joint failures combinatorics outrun

    safe live coverage Risk budget each live fault spends blast radius + operator time every probe consumes real resilience budget Continuity topology drifts faster than broad campaigns rerun full-campaign refresh lags behind change A useful first pass has to be cheap enough for CI and low-risk enough to run continuously. Even the controlled proof later in this talk already needs a large matched run matrix. 2 / 10 Problem Broad live chaos cannot cover the full risk space.
  3. Architecture recovery blocking dependencies still reconstructed by hand manual path

    recovery returns every release Consistency upkeep model drift reopens validation work after every deploy edges replicas semantics every platform move reopens model validation Repair after change each platform move forces model reconciliation every stack shift reopens validation work So the contribution has to be discovery, not model upkeep. 3 / 10 Novelty Handcrafted models break under microservice change.
  4. Input sources source-agnostic discovery; fusion optional when multiple signals exist

    traces service mesh manifests API contracts SLO-as-Code 01 Ingest sources lift service calls from emitted artifacts single source works; fusion is optional → 02 Normalize identities stable service names, replicas, and entrypoints collapse deployment- specific aliases → 03 Type blocking edges keep required sync paths; flag async/optional when detectable prune what the estimator should ignore → 04 Make it analyzable retain provenance; contract SCCs into a DAG emit an explicit graph artifact → 05 Simulate outages sample fail-stop replica loss and reachability produce CI-speed availability estimates Artifact typed blocking DAG + tagged entrypoints Checks provenance stays attached for inspection Estimate Monte Carlo availability in minutes 4 / 10 Pipeline From passive artifacts to a typed blocking graph.
  5. Three rules Assumption: independent fail-stop faults Blocking path only required

    synchronous calls only Service alive if any replica survives replica count stands in for redundancy Endpoint succeeds if a required reachable path remains reachability on the alive subgraph Minimal model required path replicas r = 2 endpoint service A service B survives service B fails entry endpoint blocking edge failed replica ignored blocking edges only one replica keeps service alive endpoint = success 5 / 10 Minimal model The minimal model needs only three rules.
  6. 6 / 10 Proof by instance Traces alone are enough

    for a matched proof by instance. 01 System under test DeathStarBench Social Network → 02 Trace-mined graph Jaeger dependencies only → 03 Evaluation frame observed live vs. simulated graph → 04 Evidence base matched workload-fault matrix Source traces only Modes 2 deployment modes Faults 5 failure fractions Jobs 250 CI jobs Simulations/job 500,000 graph sims Live windows/job 450 Why the comparison is fair ✓ same workload profile ✓ same deployment mode ✓ same fault fraction ✓ same observation window Live availability metric 5xx + socket errors + timeouts Source-agnostic idea overall; this instance is traces-only. !
  7. Live availability R_live 0.0 0.0 0.2 0.2 0.4 0.4 0.6

    0.6 0.8 0.8 Model availability R_model norepl repl Ideal x = y p_fail Model (repl/norepl) Live (repl/norepl) 0.1 0.6281/ 0.4182 0.6969/ 0.5533 0.3 0.3054/ 0.1613 0.3054/ 0.1775 0.5 0.1145/ 0.0454 0.0958/ 0.0376 0.7 0.0132/ 0.0014 0.0155/ 0.0067 0.9 0.0000/ 0.0000 0.0000/ 0.0000 Agreement — r ≈ 0.992 Pearson correlation across all 10 matched conditions. 7 / 10 Result Discovered model closely tracks live outcomes.
  8. 8 / 10 Limitations Signed bias suggests which mechanisms the

    simple model omits. Inference from signed bias by failure fraction Δ% < 0 → likely recovery missing Δ% > 0 → likely stress missing p = 0.9 omitted: both modes = 0.0 Signed bias Δ% -80 -40 0 20 p = 0.1 p = 0.3 p = 0.5 p = 0.7 Failure rate p_fail -24.4 -9.1 20.7 -79.1 -9.9 0.0 19.5 -14.8 norepl repl Negative bias model < live Likely missing recovery mechanisms. retries, fallbacks, optional paths Positive bias model > live Likely missing stress mechanisms. timing, cascades, gray failures, correlation
  9. 01 Failure realism Close the blind spots behind the signed

    bias. Add gray, correlated, timeout-driven, and retry-coupled faults where fail-stop is too optimistic or too coarse. gray faults correlation timeouts retry/load coupling success: median |Δ%| ≤ 5% in replicated mid-range cases 02 Source coverage Broaden discovery sources and keep provenance explicit. Show the same graph can be recovered from multiple artifacts and explain where each inferred edge came from. traces mesh IaC API contracts SLO-as-Code fusion optional success: coverage ≥ 90% of exercised edges across source variants 03 External validity Test whether the signal survives beyond one benchmark. Replicate on additional applications and stacks after DeathStarBench and the OpenTelemetry follow-up. more apps more deployment styles same success metric success: Pearson r ≥ 0.97 across added applications 04 Operational workflow Make discovery SLO-driven. target runtime < 2 min 01 discover topology continuously 02 estimate posture on each change 03 escalate only risky cases to live chaos success: discover-to-estimate < 2 min with continuous posture signals 9 / 10 Research agenda Residual error gives a concrete roadmap.
  10. papers code docs mb3r-lab.github.io 10 / 10 Questions and follow-up

    “ ” Minimal discovered models are useful before live chaos Questions? AINA 2026 FOLLOW-UP second system, same thesis, same conclusion