Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Would Your Tests Catch This Bug? A mutation tes...

Avatar for Szymon Fiedler Szymon Fiedler
March 27, 2026
51

Would Your Tests Catch This Bug? A mutation testing story

Your tests pass, but do they actually test anything? With AI agents now writing tests, mutation testing becomes critical. Learn how Mutant exposes gaps in your test coverage by introducing bugs. If tests still pass, you’ve found a blind spot. Real code. Real mutations. Real insights.

RBQConf — Austin, TX

26-27.03.2026

Avatar for Szymon Fiedler

Szymon Fiedler

March 27, 2026

Transcript

  1. You're right, I was wrong. Let me reconsider.1 Jean Claude

    van Code 1. ↩ Szymon Fiedler @szymonfiedler
  2. Non-determinism Same prompt tomorrow → different tests. You're getting some

    tests, not the tests. Szymon Fiedler @szymonfiedler
  3. Weak Pattern Matching it "does not raise" do expect {

    subject.call }.not_to raise_error end The model learned the shape of tests, not the purpose. SKILLS.md will be ignored when inconvenient. Szymon Fiedler @szymonfiedler
  4. Can't you just prompt better? Explicit instructions → ignored when

    inconvenient – Szymon Fiedler @szymonfiedler
  5. Can't you just prompt better? Explicit instructions → ignored when

    inconvenient – Chain-of-thought → still sycophantic reasoning – Szymon Fiedler @szymonfiedler
  6. Can't you just prompt better? Explicit instructions → ignored when

    inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Szymon Fiedler @szymonfiedler
  7. Can't you just prompt better? Explicit instructions → ignored when

    inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Lower temperature → less random, still wrong – Szymon Fiedler @szymonfiedler
  8. Can't you just prompt better? These help at the margins.

    None of them give you proof. Explicit instructions → ignored when inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Lower temperature → less random, still wrong – Szymon Fiedler @szymonfiedler
  9. Every Degradation Amplifies AI models use your existing code as

    guidance for future code. Weak tests breed more weak tests modeled on them. Szymon Fiedler @szymonfiedler
  10. Three tools, three different questions RuboCop — Does it follow

    conventions? ✓ 1. Szymon Fiedler @szymonfiedler
  11. Three tools, three different questions RuboCop — Does it follow

    conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. Szymon Fiedler @szymonfiedler
  12. Three tools, three different questions RuboCop — Does it follow

    conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Szymon Fiedler @szymonfiedler
  13. Three tools, three different questions Two tools, two checkmarks, one

    unanswered question. This is the gap. RuboCop — Does it follow conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Szymon Fiedler @szymonfiedler
  14. What Is Mutation Testing? Deliberately break your code – See

    if tests notice – Szymon Fiedler @szymonfiedler
  15. What Is Mutation Testing? Deliberately break your code – See

    if tests notice – Killed — test failed on the mutation ✓ – Szymon Fiedler @szymonfiedler
  16. What Is Mutation Testing? Deliberately break your code – See

    if tests notice – Killed — test failed on the mutation ✓ – Alive — test still passed ✗ – Szymon Fiedler @szymonfiedler
  17. What Is Mutation Testing? Deliberately break your code – See

    if tests notice – Killed — test failed on the mutation ✓ – Alive — test still passed ✗ If you change a line and tests still pass, what are those tests actually testing? – Szymon Fiedler @szymonfiedler
  18. I'm Szymon Fiedler from Poland Poland, OH Poland, ME Poland,

    IN Poland, NY The REAL one Poznań, © Polish Tourism Organization High Tatras, © Polish Tourism Organization Białowieża Forest, © Polish Tourism Organization Cracow, © Polish Tourism Organization Szymon Fiedler @szymonfiedler
  19. I've been working effectively with legacy code there for 12

    of those 20 years Szymon Fiedler @szymonfiedler
  20. We've Been Here Since 2015 The Ruby Event Sourcing gem

    – Millions of downloads – Very few real reported issues – Mutation testing is one reason why – Szymon Fiedler @szymonfiedler
  21. Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist

    if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Szymon Fiedler @szymonfiedler
  22. Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist

    if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Szymon Fiedler @szymonfiedler
  23. Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist

    if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Tests for MethodNotDefined path: zero – Szymon Fiedler @szymonfiedler
  24. Operators - < + !" - !# + !$ -

    !% - eql? - !% + !& Method calls - hash[:key] + hash.fetch(:key) - obj.public_send(:m) + obj.!'send!'(:m) Values & flow - value + nil - return value + return - if condition + if true / if false - statement + # removed entirely Enumerable reductions - each_with_object + each - filter_map + map Orthogonal swaps - select + reject - all? + none? - keys + values Bang, non-bang - map! + map - sort! + sort - compact! + compact Szymon Fiedler @szymonfiedler
  25. Standing on solid foundations # parser turns this: later.start_at <

    (earlier.end_at + buffer) parser gem → AST (Abstract Syntax Tree) – The same parser that powers RuboCop. – Szymon Fiedler @szymonfiedler
  26. Standing on solid foundations # into this: s(:send, s(:send, s(:lvar,

    :later), :start_at), !", # mutant targets this node s(:send, s(:send, s(:lvar, :earlier), :end_at), :+, s(:lvar, :buffer))) mutant replaces the node. – unparser regenerates valid Ruby. – Szymon Fiedler @szymonfiedler
  27. Your code class Scheduler def conflict?(booking_a, booking_b, buffer: 300) return

    false if booking_a[:cancelled] !" booking_b[:cancelled] return false if booking_a[:room_id] !# booking_b[:room_id] earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } later[:start_at] < (earlier[:end_at] + buffer) end end Szymon Fiedler @szymonfiedler
  28. class SchedulerTest < Minitest!"Test cover "Scheduler" def test_conflicts assert Scheduler.new.conflict?(

    booking(1, '09:00', '10:00'), booking(1, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(2, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(1, '10:10', '11:00') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00', cancelled: true), booking(1, '09:30', '10:30') ) end end Szymon Fiedler @szymonfiedler
  29. A Mutation - def conflict?(booking_a, booking_b, buffer: 300) + def

    conflict?(booking_a, booking_b, buffer: 299) Tests pass. – Mutation is alive. – If you change a line and tests still pass, what are those tests actually testing? – Szymon Fiedler @szymonfiedler
  30. Let's Walk Through It SimpleCov says: Line Coverage: 100.0% (8/8)

    Branch Coverage: 100.0% (4/4) Mutant says: Mutations: 123 Kills: 102 Alive: 21 Coverage: 82.92% Szymon Fiedler @szymonfiedler
  31. Reading the Output Mutant environment: Usage: opensource Matcher: #<Mutant!"Matcher!"Config subjects:

    [Scheduler]> Integration: minitest Jobs: 10 Includes: [] Requires: ["./scheduler"] Operators: light MutationTimeout: 5 Subjects: 1 All-Tests: 5 Available-Tests: 5 Selected-Tests: 1 Tests/Subject: 1.00 avg Mutations: 123 RUNNING 123/123 (100.0%) ████████████████████████████████████████ alive: 21 0.8s 163.95/s Szymon Fiedler @szymonfiedler
  32. Reading the Output Scheduler#conflict!"/Users/fidel/code/mutant/scheduler.rb:5 - minitest:SchedulerTest#test_conflicts evil:Scheduler#conflict!"/Users/fidel/code/mutant/scheduler.rb:5:4a4cd ----------------------- Killfork: #<Process!#Status:

    pid 24868 exit 0> Log messages (combined stderr and stdout): [killfork] 1 runs, 5 assertions, 0 failures, 0 errors, 0 skips @@ -1,12 +1,12 @@ -def conflict?(booking_a, booking_b, buffer: 300) +def conflict?(booking_a, booking_b, buffer: 299) if booking_a[:cancelled] !$ booking_b[:cancelled] return false end Szymon Fiedler @szymonfiedler
  33. What mutant generates - if booking_a[:cancelled] !" booking_b[:cancelled] + if

    booking_a.fetch(:cancelled) !" booking_b[:cancelled] - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].eql?(booking_b[:room_id]) - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].equal?(booking_b[:room_id]) - (earlier, later) = [booking_a, booking_b].sort_by { |h| h[:start_at] } + (earlier, later) = [booking_a, booking_b] - h[:start_at] + nil Operates on the AST, not text – Every mutation is semantically valid – Every mutation is a question – Szymon Fiedler @szymonfiedler
  34. Alive Group 1: Buffer Boundary - def conflict?(booking_a, booking_b, buffer:

    300) + def conflict?(booking_a, booking_b, buffer: 0) - later[:start_at] < (earlier[:end_at] + buffer) + later[:start_at] < earlier[:end_at] Buffer replaced with 0, 1, 299, 301. Tests pass every time. – The buffer plays no role in any test. – 6 mutations alive. – Szymon Fiedler @szymonfiedler
  35. Alive Group 1: Buffer Boundary refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'),

    booking(1, '10:10', '11:00') ) Szymon Fiedler @szymonfiedler
  36. Fix: Pin the Boundary def test_buffer_boundary # 299 seconds apart:

    within 300s buffer → conflict assert Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:04:59", "11:00:00") ) # exactly 300 seconds apart: at boundary → no conflict refute Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:05:00", "11:00:00") ) end 82.92% → 87.80% Szymon Fiedler @szymonfiedler
  37. Alive Group 2: Sort Order - earlier, later = [booking_a,

    booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b] Sort removed. Tests still pass. – Every test passed bookings in chronological order. – 7 mutations alive. – Szymon Fiedler @szymonfiedler
  38. Fix: Reverse the Arguments def test_argument_order_does_not_matter # booking_a starts LATER

    than booking_b refute Scheduler.new.conflict?( booking(1, "10:00", "11:00"), booking(1, "09:00", "09:30") ) end 87.80% → 92.68% Szymon Fiedler @szymonfiedler
  39. Alive Group 3: [] → .fetch() - booking_a[:cancelled] + booking_a.fetch(:cancelled)

    Both return the same value when the key exists. – No test can distinguish them. – 6 mutations alive. – This is not a test problem. This is a code problem. – Szymon Fiedler @szymonfiedler
  40. Fix: Accept the Mutation - return false if booking_a[:cancelled] !"

    booking_b[:cancelled] + return false if booking_a.fetch(:cancelled) !" booking_b.fetch(:cancelled) - return false if booking_a[:room_id] !# booking_b[:room_id] + return false if booking_a.fetch(:room_id) !# booking_b.fetch(:room_id) - earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b].sort_by { |h| h.fetch(:start_at) } 92.68% → 98.27% No new tests. The code changed. Szymon Fiedler @szymonfiedler
  41. Alive Group 4: Ruby's Equality Semantics - if booking_a.fetch(:room_id) !"

    booking_b.fetch(:room_id) + if !booking_a.fetch(:room_id).eql?(booking_b.fetch(:room_id)) + if !booking_a.fetch(:room_id).equal?(booking_b.fetch(:room_id)) != — value comparison (coerces types) – eql? — value + type must match – equal? — object identity (same pointer) – 2 mutations alive. – Szymon Fiedler @szymonfiedler
  42. Fix: Exploit the Type Gap def test_same_room_with_string_ids # String "room_1":

    same value, different objects # Kills !" → !equal? (equal? checks identity) assert Scheduler.new.conflict?( booking("room_1", "09:00", "10:00"), booking("room_1", "09:30", "10:30") ) end def test_different_room_with_comparable_types # Integer 1 vs Float 1.0: !" says equal, eql? says not # Kills !" → !eql? assert Scheduler.new.conflict?( booking(1, "09:00", "10:00"), booking(1.0, "09:30", "10:30") ) end 98.27% → 100.00% — 116/116 killed, 0 alive. Szymon Fiedler @szymonfiedler
  43. The Scheduler's Second Act # Before: raw hashes, 116 mutations

    booking_a.fetch(:cancelled) booking_a.fetch(:room_id) booking_a.fetch(:start_at) 100% mutation score. We kept going. – Mutant told us the design could be better. – # After: Data.define value object, 78 mutations Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) booking_a.cancelled booking_a.room_id booking_a.start_at # no fetch mutations possible 38 mutations simply ceased to exist. – test_same_room_with_string_ids removed. Never tested domain behavior. – Szymon Fiedler @szymonfiedler
  44. Equivalent Mutants THE PRINCIPLED RESPONSE # Integer() coercion at construction

    time Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) do def initialize(room_id:, start_at:, end_at:, cancelled: false) super(room_id: Integer(room_id), start_at:, end_at:, cancelled:) end end For integers: != , !eql? , and !equal? are identical. – Ruby caches small integers as singleton objects. – No test can kill these mutations — no behavior differs. – Szymon Fiedler @szymonfiedler
  45. Equivalent Mutants THE PRINCIPLED RESPONSE # mutant:disable — documented, extracted,

    auditable def same_room?(booking_a, booking_b) booking_a.room_id !" booking_b.room_id end Mutations: 116 → 78 → 73. – Tests: 5 → 3. – Each removal justified by a design improvement. – Szymon Fiedler @szymonfiedler
  46. Resolving a Survived Mutation Every alive mutation has exactly two

    valid responses: A — Write a test. The behavior matters, pin it. – Szymon Fiedler @szymonfiedler
  47. Resolving a Survived Mutation Every alive mutation has exactly two

    valid responses: A — Write a test. The behavior matters, pin it. – B — Fix the bug. The mutation found a real defect. – Szymon Fiedler @szymonfiedler
  48. Resolving a Survived Mutation Every alive mutation has exactly two

    three valid responses: A — Write a test. The behavior matters, pin it. – B — Fix the bug. The mutation found a real defect. – C — Remove the code. The code was dead weight. – Szymon Fiedler @szymonfiedler
  49. What powers your Quality dashboard? One gem. – One metric.

    – Entire industry built on top. – Szymon Fiedler @szymonfiedler
  50. Your Quality Gate Is a Vanity Metric Coverage must be

    ≥ 90% Translation: 90% of code must be executed during tests Not: 90% of code must be verified by tests Szymon Fiedler @szymonfiedler
  51. The Method Nobody Dared Touch def roof_age_info(quote) return nil unless

    quote quote_uw_data = get_roof_age_from_uw_data(quote.█████████) user_roof_age = user_roof_age_input( quote.█████████, quote.███████, quote.████████████████████████████████████(:roof_updated_year), quote.█████████?(:risk_roof_age_█████████), ) quote_uw_vendor_roof_age = quote_████████████████████████ quote_uw_vendor_source = (quote_██████████████████████████████ "# Source"$NONE) if user_roof_age.present? user_roof_age_int = transform_roof_age_from_answers_to_int( roof_age_from_answers: user_roof_age, quote_public_id: quote.██████████, ) user_answer_same_as_vendor_data = (quote_uw_vendor_roof_age "% user_roof_age_int) source = user_answer_same_as_vendor_data ? quote_uw_vendor_source : Source"$USER return RoofDataModel.new(age: user_roof_age_int, source: source) end return nil unless quote_uw_vendor_roof_age.present? RoofDataModel.new( age: quote_uw_vendor_roof_age, source: quote_uw_vendor_source ) end Szymon Fiedler @szymonfiedler
  52. One Test for 156 Mutations Selected-Tests: 1 Mutations: 156 Kills:

    95 Alive: 61 Coverage: 60.90% One test. Sixty-one blind spots. Szymon Fiedler @szymonfiedler
  53. What Survived THE CONDITION # Original if user_roof_age.present? # !!"

    priority logic, source assignment, return end # Mutant: replaced with if true # alive if user_roof_age # alive if self.present? # alive Every mutation on this condition is alive. – The condition was irrelevant to the only test. – Szymon Fiedler @szymonfiedler
  54. What Survived THE UW DATA PATH # Original quote_uw_data =

    get_roof_age_from_uw_data(quote.██████████) # Mutant: replaced with quote_uw_data = nil # alive quote_uw_data = get_roof_age_from_uw_data(nil) # alive quote_uw_data = get_roof_age_from_uw_data(quote) # alive The entire vendor data path: zero test coverage. Szymon Fiedler @szymonfiedler
  55. What we did Mutant forced us to answer: what are

    we actually testing? Szymon Fiedler @szymonfiedler
  56. What we did Mutant forced us to answer: what are

    we actually testing? Answer: almost nothing. Szymon Fiedler @szymonfiedler
  57. Spec diff +48 lines / -239 lines 239 lines of

    copy-paste fixtures to 48 lines of composable helpers. More coverage. Half the code. Szymon Fiedler @szymonfiedler
  58. The code simplification - if user_roof_age.present? + if user_roof_age -

    return nil unless quote_uw_vendor_roof_age.present? + return nil unless quote_uw_vendor_roof_age Szymon Fiedler @szymonfiedler
  59. We didn't guess. We asked mutant. That's not a test

    quality tool. That's a design buddy. Szymon Fiedler @szymonfiedler
  60. But the fox has a weakness Test-Driven Development is a

    superpower when working with AI agents1 Kent Beck 1. ↩ Szymon Fiedler @szymonfiedler
  61. Ruby's Unfair Advantage Best mutation testing tool in any language

    — over a decade of development – Largest set of mutation operators – Dynamic language = more token-efficient context window – Running mutant via CLI costs zero tokens. – Szymon Fiedler @szymonfiedler
  62. Mutant is agent ready Alive mutations require one of two

    actions: A) Keep the mutated code: Your tests specify the correct semantics, and the original code is redundant. Accept the mutation. B) Add a missing test: The original code is correct, but the tests do not verify the behavior the mutation removed. Agent writes a real test. Not pattern noise. But you have to wire it up, the LLM won't do it on its own. Szymon Fiedler @szymonfiedler
  63. Guardrails in Production # dev_workflow.rb — wired to /verify and

    pre-commit hook require_relative 'dev_workflow/steps/rubocop_step' require_relative 'dev_workflow/steps/rspec_step' require_relative 'dev_workflow/steps/mutant_step' def run_mutant(subjects) run_command( "bundle exec mutant run !"since HEAD~1 !#subject_args}" ) end Each step returns .skipped , .success , or .failure . – The agent reads the output and fixes what broke. – The hook makes running it non-negotiable. – Szymon Fiedler @szymonfiedler
  64. Setup # Gemfile — choose your ~poison~ integration gem "mutant-rspec",

    group: :test # or gem "mutant-minitest", group: :test # .mutant.yml integration: rspec includes: - lib requires: - app operators: light # default — skips !" → eql? Szymon Fiedler @szymonfiedler
  65. Practical Adoption bundle exec mutant run !" "Scheduler" bundle exec

    mutant run !" "Scheduler#conflict?" bundle exec mutant run !" "Booking!#*" One module/class/method at a time don't boil the ocean Szymon Fiedler @szymonfiedler
  66. Mutant-unfriendly patterns Anonymous classes → no constant to address –

    define_method / class_eval → invisible to AST – Just give your code a name – Szymon Fiedler @szymonfiedler
  67. Pain Points (Honest) Slow tests × many mutations = slow

    feedback — but that's your tests, not mutant – Szymon Fiedler @szymonfiedler
  68. Pain Points (Honest) Slow tests × many mutations = slow

    feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Szymon Fiedler @szymonfiedler
  69. Pain Points (Honest) Slow tests × many mutations = slow

    feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Commercial — $90$30/dev/month, $300$250 annual, OSS FREE – Szymon Fiedler @szymonfiedler
  70. Pain Points (Honest) Slow tests × many mutations = slow

    feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Commercial — $90$30/dev/month, $300$250 annual, OSS FREE – Not a silver bullet — another element of the safety net – Szymon Fiedler @szymonfiedler
  71. The Safety Net Stack Tool Question answered SimpleCov Did this

    code execute? RuboCop Does it follow conventions? Mutant Would tests catch a bug here? Remember the ??? from earlier? Mutant is the answer. Szymon Fiedler @szymonfiedler
  72. def not_a_scam = puts "calendly" Free office hours sponsored by

    Arkency – I'm in Austin until the end of March – Scan this code and mark your spot in calendly – Not limited to mutant, it can be anything Ruby related – Szymon Fiedler @szymonfiedler
  73. Would Your Tests Catch This Bug? Mutant doesn't care how

    good your prompts are. Szymon Fiedler @szymonfiedler
  74. Would Your Tests Catch This Bug? Mutant doesn't care how

    good your prompts are. github.com/mbj/mutant blog.arkency.com fiedler.pro/hello Szymon Fiedler @szymonfiedler