Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Would your test catch this bug? A mutation test...

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Would your test catch this bug? A mutation testing story

20 minute long variant of the talk, given at Paris.rb meetup 2026-05-05

Avatar for Szymon Fiedler

Szymon Fiedler

May 05, 2026

More Decks by Szymon Fiedler

Other Decks in Technology

Transcript

  1. Would Your Tests Catch This Bug? A Mutation Testing Story

    Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  2. Who got burned by the poor quality of those tests?

    Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  3. It's a skill issue1 Random from the Intenet 1. ↩

    Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  4. You're right, I was wrong. Let me reconsider.1 Jean Claude

    van Code 1. ↩ Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  5. Non-determinism Same prompt tomorrow → different tests. You're getting some

    tests, not the tests. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  6. Weak Pattern Matching it "does not raise" do expect {

    subject.call }.not_to raise_error end The model learned the shape of tests, not the purpose. SKILLS.md will be ignored when inconvenient. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  7. Can't you just prompt better? These help at the margins.

    None of them give you proof. Explicit instructions → ignored when inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Lower temperature → less random, still wrong – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  8. Every Degradation Amplifies AI models use your existing code as

    guidance for future code. Weak tests breed more weak tests modeled on them. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  9. Three tools, three different questions RuboCop — Does it follow

    conventions? ✓ 1. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  10. Three tools, three different questions RuboCop — Does it follow

    conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  11. Three tools, three different questions RuboCop — Does it follow

    conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  12. Three tools, three different questions Two tools, two checkmarks, one

    unanswered question. This is the gap. RuboCop — Does it follow conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  13. What Is Mutation Testing? Deliberately break your code – See

    if tests notice – Killed — test failed on the mutation ✓ – Alive — test still passed ✗ If you change a line and tests still pass, what are those tests actually testing? – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  14. Napoleon passed through my town So technically I'm returning the

    visit 1 Florida Center for Instructional Technology 1. ↩ Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  15. I've been working e!ectively with legacy code there for 12

    of those 20 years Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  16. We've Been Here Since 2015 The Ruby Event Sourcing gem

    – Millions of downloads – Very few real reported issues – Mutation testing is one reason why – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  17. Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist

    if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  18. Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist

    if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  19. Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist

    if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Tests for MethodNotDefined path: zero – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  20. Operators - < + !" - !# + !$ -

    !% + eql? - !% + !& Method calls - hash[:key] + hash.fetch(:key) - obj.public_send(:m) + obj.!'send!'(:m) Values & flow - value + nil - return value + return - if condition + if true / if false - statement + # removed entirely Enumerable reductions - each_with_object + each - filter_map + map Orthogonal swaps - select + reject - all? + none? - keys + values Bang, non-bang - map! + map - sort! + sort - compact! + compact Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  21. Your code class Scheduler def conflict?(booking_a, booking_b, buffer: 300) return

    false if booking_a[:cancelled] || booking_b[:cancelled] return false if booking_a[:room_id] !" booking_b[:room_id] earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } later[:start_at] < (earlier[:end_at] + buffer) end end Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  22. class SchedulerTest < Minitest!"Test cover "Scheduler" def test_conflicts assert Scheduler.new.conflict?(

    booking(1, '09:00', '10:00'), booking(1, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(2, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(1, '10:10', '11:00') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00', cancelled: true), booking(1, '09:30', '10:30') ) end end Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  23. A Mutation - def conflict?(booking_a, booking_b, buffer: 300) + def

    conflict?(booking_a, booking_b, buffer: 299) Tests pass. – Mutation is alive. – If you change a line and tests still pass, what are those tests actually testing? – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  24. Let's Walk Through It SimpleCov says: Line Coverage: 100.0% (8/8)

    Branch Coverage: 100.0% (4/4) Mutant says: Mutations: 123 Kills: 102 Alive: 21 Coverage: 82.92% Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  25. Reading the Output Mutant environment: Usage: opensource Matcher: #<Mutant!"Matcher!"Config subjects:

    [Scheduler]> Integration: minitest Jobs: 10 Includes: [] Requires: ["./scheduler"] Operators: light MutationTimeout: 5 Subjects: 1 All-Tests: 5 Available-Tests: 5 Selected-Tests: 1 Tests/Subject: 1.00 avg Mutations: 123 RUNNING 123/123 (100.0%) ████████████████████████████████████████ alive: 21 0.8s 163.95/s Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  26. What mutant generates - if booking_a[:cancelled] !" booking_b[:cancelled] + if

    booking_a.fetch(:cancelled) !" booking_b[:cancelled] - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].eql?(booking_b[:room_id]) - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].equal?(booking_b[:room_id]) - (earlier, later) = [booking_a, booking_b].sort_by { |h| h[:start_at] } + (earlier, later) = [booking_a, booking_b] - h[:start_at] + nil Operates on the AST, not text – Every mutation is semantically valid – Every mutation is a question – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  27. Alive Group 1: Buffer Boundary - def conflict?(booking_a, booking_b, buffer:

    300) + def conflict?(booking_a, booking_b, buffer: 0) !!" - later[:start_at] < (earlier[:end_at] + buffer) + later[:start_at] < earlier[:end_at] Buffer replaced with 0, 1, 299, 301. Tests pass every time. – The buffer plays no role in any test. – 6 mutations alive. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  28. Alive Group 1: Buffer Boundary refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'),

    booking(1, '10:10', '11:00') ) Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  29. Fix: Pin the Boundary def test_buffer_boundary # 299 seconds apart:

    within 300s buffer → conflict assert Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:04:59", "11:00:00") ) ##$ # exactly 300 seconds apart: at boundary → no conflict refute Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:05:00", "11:00:00") ) end 82.92% → 87.80% Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  30. Alive Group 2: Sort Order - earlier, later = [booking_a,

    booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b] Sort removed. Tests still pass. – Every test passed bookings in chronological order. – 7 mutations alive. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  31. Fix: Reverse the Arguments def test_argument_order_does_not_matter # booking_a starts LATER

    than booking_b refute Scheduler.new.conflict?( booking(1, "10:00", "11:00"), booking(1, "09:00", "09:30") ) end 87.80% → 92.68% Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  32. Alive Group 3: [] → .fetch() - booking_a[:cancelled] + booking_a.fetch(:cancelled)

    Both return the same value when the key exists. – No test can distinguish them. – 6 mutations alive. – This is not a test problem. This is a code problem. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  33. Fix: Accept the Mutation - return false if booking_a[:cancelled] !"

    booking_b[:cancelled] + return false if booking_a.fetch(:cancelled) !" booking_b.fetch(:cancelled) !!# - return false if booking_a[:room_id] !$ booking_b[:room_id] + return false if booking_a.fetch(:room_id) !$ booking_b.fetch(:room_id) !!# - earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b].sort_by { |h| h.fetch(:start_at) } 92.68% → 98.27% No new tests. The code changed. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  34. The Scheduler's Second Act # Before: raw hashes, 116 mutations

    booking_a.fetch(:cancelled) booking_a.fetch(:room_id) booking_a.fetch(:start_at) 100% mutation score. We kept going. – Mutant told us the design could be better. – # After: Data.define value object, 78 mutations Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) booking_a.cancelled booking_a.room_id booking_a.start_at # no fetch mutations possible 38 mutations simply ceased to exist. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  35. Equivalent Mutants THE PRINCIPLED RESPONSE # Integer() coercion at construction

    time Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) do def initialize(room_id:, start_at:, end_at:, cancelled: false) super(room_id: Integer(room_id), start_at:, end_at:, cancelled:) end end # mutant:disable — documented, extracted, auditable def same_room?(booking_a, booking_b) booking_a.room_id !" booking_b.room_id end For integers: != , !eql? , and !equal? are identical – Ruby caches small integers as singleton objects – mutant:disable is for equivalent mutants only — documented, auditable – Mutations: 116 → 78 → 73. Tests: 5 → 3. – Each removal justified by a design improvement. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  36. Resolving a Survived Mutation Every alive mutation has exactly two

    three valid responses: A — Write a test. The behavior matters, pin it. – B — Fix the bug. The mutation found a real defect. – C — Remove the code. The code was dead weight. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  37. What powers your Quality dashboard? One gem. – One metric.

    – Entire industry built on top. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  38. Your Quality Gate Is a Vanity Metric Coverage must be

    ≥ 90% Translation: 90% of code must be executed during tests Not: 90% of code must be verified by tests Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  39. The Method Nobody Dared Touch def roof_age_info(quote) return nil unless

    quote quote_uw_data = get_roof_age_from_uw_data(quote.█████████) user_roof_age = user_roof_age_input( quote.█████████, quote.███████, quote.████████████████████████████████████(:roof_updated_year), quote.█████████?(:risk_roof_age_█████████), ) quote_uw_vendor_roof_age = quote_████████████████████████ quote_uw_vendor_source = (quote_██████████████████████████████ || Source::NONE) if user_roof_age.present? user_roof_age_int = transform_roof_age_from_answers_to_int( roof_age_from_answers: user_roof_age, quote_public_id: quote.██████████, ) user_answer_same_as_vendor_data = (quote_uw_vendor_roof_age "# user_roof_age_int) source = user_answer_same_as_vendor_data ? quote_uw_vendor_source : Source::USER return RoofDataModel.new(age: user_roof_age_int, source: source) end return nil unless quote_uw_vendor_roof_age.present? RoofDataModel.new( age: quote_uw_vendor_roof_age, source: quote_uw_vendor_source ) end Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  40. One Test for 156 Mutations Selected-Tests: 1 Mutations: 156 Kills:

    95 Alive: 61 Coverage: 60.90% One test. Sixty-one blind spots. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  41. What Survived THE CONDITION # Original if user_roof_age.present? # !!"

    priority logic, source assignment, return end !!# # Mutant: replaced with if true # alive if user_roof_age # alive if self.present? # alive Every mutation on this condition is alive. – The condition was irrelevant to the only test. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  42. What Survived THE UW DATA PATH # Original quote_uw_data =

    get_roof_age_from_uw_data(quote.██████████) ""# # Mutant: replaced with quote_uw_data = nil # alive quote_uw_data = get_roof_age_from_uw_data(nil) # alive quote_uw_data = get_roof_age_from_uw_data(quote) # alive The entire vendor data path: zero test coverage. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  43. What we did Mutant forced us to answer: what are

    we actually testing? Answer: almost nothing. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  44. Spec diff +48 lines / -239 lines 239 lines of

    copy-paste fixtures to 48 lines of composable helpers. More coverage. Half the code. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  45. The code simplification - if user_roof_age.present? + if user_roof_age -

    return nil unless quote_uw_vendor_roof_age.present? + return nil unless quote_uw_vendor_roof_age Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  46. We didn't guess. We asked mutant. That's not a test

    quality tool. That's a design buddy. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  47. But the fox has a weakness Test-Driven Development is a

    superpower when working with AI agents1 Kent Beck 1. ↩ Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  48. Ruby's Unfair Advantage Best mutation testing tool in any language

    — over a decade of development – Largest set of mutation operators – Dynamic language = more token-efficient context window – Running mutant via CLI costs zero tokens. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  49. Mutant is agent ready Alive mutations require one of two

    actions: A) Keep the mutated code: Your tests specify the correct semantics, and the original code is redundant. Accept the mutation. B) Add a missing test: The original code is correct, but the tests do not verify the behavior the mutation removed. Agent writes a real test. Not pattern noise. But you have to wire it up, the LLM won't do it on its own. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  50. Guardrails in Production # dev_workflow.rb — wired to /verify and

    pre-commit hook require_relative 'dev_workflow/steps/rubocop_step' require_relative 'dev_workflow/steps/rspec_step' require_relative 'dev_workflow/steps/mutant_step' def run_mutant(subjects) run_command( "bundle exec mutant run !"since HEAD~1 !#subject_args}" ) end Each step returns .skipped , .success , or .failure . – The agent reads the output and fixes what broke. – The hook makes running it non-negotiable. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  51. Setup & Practical Adoption # Gemfile — choose your ~poison~

    integration gem "mutant-rspec", group: :test # or gem "mutant-minitest", group: :test # .mutant.yml integration: rspec includes: - lib requires: - app operators: light # default — skips !" → eql? bundle exec mutant run !$ "Scheduler" bundle exec mutant run !$ "Scheduler#conflict?" bundle exec mutant run !$ "Booking!%*" One module/class/method at a time — don't boil the ocean --since HEAD~1 for incremental runs on large codebases Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  52. Mutant-unfriendly patterns Anonymous classes → no constant to address –

    define_method / class_eval → invisible to AST – Just give your code a name – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  53. Pain Points (Honest) Slow tests × many mutations = slow

    feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Commercial — $90$30/dev/month, $900$250 annual, OSS FREE – Not a silver bullet — another element of the safety net – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  54. The Safety Net Stack | Tool | Question answered |

    |:!"|:!"| | SimpleCov | Did this code execute? | | RuboCop | Does it follow conventions? | | **Mutant** | **Would tests catch a bug here?** | Remember the ??? from earlier? Mutant is the answer. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  55. Would Your Tests Catch This Bug? Mutant doesn't care how

    good your prompts are. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler
  56. Would Your Tests Catch This Bug? – github.com/mbj/mutant – arkency.com

    – fiedler.pro/hello Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonfiedler