Would your test catch this bug? A mutation testing story

Would Your Tests Catch This Bug? A Mutation Testing Story
Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Who tests their software? Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Who measures code coverage? Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Who codes with an LLM? Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Who lets the LLM write tests? Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Who got burned by the poor quality of those tests?

It's a skill issue1 Random from the Intenet 1. ↩

You're right, I was wrong. Let me reconsider.1 Jean Claude
van Code 1. ↩ Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Sycophancy la Flagornerie Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Non-determinism Same prompt tomorrow → different tests. You're getting some
tests, not the tests. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Weak Pattern Matching it "does not raise" do expect {
subject.call }.not_to raise_error end The model learned the shape of tests, not the purpose. SKILLS.md will be ignored when inconvenient. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Can't you just prompt better? These help at the margins.
None of them give you proof. Explicit instructions → ignored when inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Lower temperature → less random, still wrong – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Every Degradation Ampliﬁes AI models use your existing code as
guidance for future code. Weak tests breed more weak tests modeled on them. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Three tools, three different questions RuboCop — Does it follow
conventions? ✓ 1. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Three tools, three different questions Two tools, two checkmarks, one
unanswered question. This is the gap. RuboCop — Does it follow conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

What Is Mutation Testing? Deliberately break your code – See
if tests notice – Killed — test failed on the mutation ✓ – Alive — test still passed ✗ If you change a line and tests still pass, what are those tests actually testing? – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

I'm Szymon Fiedler Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Napoleon passed through my town So technically I'm returning the
visit 1 Florida Center for Instructional Technology 1. ↩ Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Arkency has been doing Ruby since 2006 Paris.rbɾ2026-05-05 Szymon Fiedler
ɾ@szymonﬁedler

I've been working e!ectively with legacy code there for 12
of those 20 years Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

We've Been Here Since 2015 The Ruby Event Sourcing gem
– Millions of downloads – Very few real reported issues – Mutation testing is one reason why – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist
if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Tests for MethodNotDefined path: zero – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Operators - < + !" - !# + !$ -
!% + eql? - !% + !& Method calls - hash[:key] + hash.fetch(:key) - obj.public_send(:m) + obj.!'send!'(:m) Values & flow - value + nil - return value + return - if condition + if true / if false - statement + # removed entirely Enumerable reductions - each_with_object + each - filter_map + map Orthogonal swaps - select + reject - all? + none? - keys + values Bang, non-bang - map! + map - sort! + sort - compact! + compact Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Your code class Scheduler def conflict?(booking_a, booking_b, buffer: 300) return
false if booking_a[:cancelled] || booking_b[:cancelled] return false if booking_a[:room_id] !" booking_b[:room_id] earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } later[:start_at] < (earlier[:end_at] + buffer) end end Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

class SchedulerTest < Minitest!"Test cover "Scheduler" def test_conflicts assert Scheduler.new.conflict?(
booking(1, '09:00', '10:00'), booking(1, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(2, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(1, '10:10', '11:00') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00', cancelled: true), booking(1, '09:30', '10:30') ) end end Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

A Mutation - def conflict?(booking_a, booking_b, buffer: 300) + def
conflict?(booking_a, booking_b, buffer: 299) Tests pass. – Mutation is alive. – If you change a line and tests still pass, what are those tests actually testing? – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Let's Walk Through It SimpleCov says: Line Coverage: 100.0% (8/8)
Branch Coverage: 100.0% (4/4) Mutant says: Mutations: 123 Kills: 102 Alive: 21 Coverage: 82.92% Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Reading the Output Mutant environment: Usage: opensource Matcher: #<Mutant!"Matcher!"Config subjects:
[Scheduler]> Integration: minitest Jobs: 10 Includes: [] Requires: ["./scheduler"] Operators: light MutationTimeout: 5 Subjects: 1 All-Tests: 5 Available-Tests: 5 Selected-Tests: 1 Tests/Subject: 1.00 avg Mutations: 123 RUNNING 123/123 (100.0%) ████████████████████████████████████████ alive: 21 0.8s 163.95/s Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

What mutant generates - if booking_a[:cancelled] !" booking_b[:cancelled] + if
booking_a.fetch(:cancelled) !" booking_b[:cancelled] - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].eql?(booking_b[:room_id]) - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].equal?(booking_b[:room_id]) - (earlier, later) = [booking_a, booking_b].sort_by { |h| h[:start_at] } + (earlier, later) = [booking_a, booking_b] - h[:start_at] + nil Operates on the AST, not text – Every mutation is semantically valid – Every mutation is a question – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Alive Group 1: Buffer Boundary - def conflict?(booking_a, booking_b, buffer:
300) + def conflict?(booking_a, booking_b, buffer: 0) !!" - later[:start_at] < (earlier[:end_at] + buffer) + later[:start_at] < earlier[:end_at] Buffer replaced with 0, 1, 299, 301. Tests pass every time. – The buffer plays no role in any test. – 6 mutations alive. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Alive Group 1: Buffer Boundary refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'),
booking(1, '10:10', '11:00') ) Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Fix: Pin the Boundary def test_buffer_boundary # 299 seconds apart:
within 300s buffer → conflict assert Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:04:59", "11:00:00") ) ##$ # exactly 300 seconds apart: at boundary → no conflict refute Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:05:00", "11:00:00") ) end 82.92% → 87.80% Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Alive Group 2: Sort Order - earlier, later = [booking_a,
booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b] Sort removed. Tests still pass. – Every test passed bookings in chronological order. – 7 mutations alive. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Fix: Reverse the Arguments def test_argument_order_does_not_matter # booking_a starts LATER
than booking_b refute Scheduler.new.conflict?( booking(1, "10:00", "11:00"), booking(1, "09:00", "09:30") ) end 87.80% → 92.68% Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Alive Group 3: [] → .fetch() - booking_a[:cancelled] + booking_a.fetch(:cancelled)
Both return the same value when the key exists. – No test can distinguish them. – 6 mutations alive. – This is not a test problem. This is a code problem. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Fix: Accept the Mutation - return false if booking_a[:cancelled] !"
booking_b[:cancelled] + return false if booking_a.fetch(:cancelled) !" booking_b.fetch(:cancelled) !!# - return false if booking_a[:room_id] !$ booking_b[:room_id] + return false if booking_a.fetch(:room_id) !$ booking_b.fetch(:room_id) !!# - earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b].sort_by { |h| h.fetch(:start_at) } 92.68% → 98.27% No new tests. The code changed. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

The Scheduler's Second Act # Before: raw hashes, 116 mutations
booking_a.fetch(:cancelled) booking_a.fetch(:room_id) booking_a.fetch(:start_at) 100% mutation score. We kept going. – Mutant told us the design could be better. – # After: Data.define value object, 78 mutations Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) booking_a.cancelled booking_a.room_id booking_a.start_at # no fetch mutations possible 38 mutations simply ceased to exist. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Equivalent Mutants THE PRINCIPLED RESPONSE # Integer() coercion at construction
time Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) do def initialize(room_id:, start_at:, end_at:, cancelled: false) super(room_id: Integer(room_id), start_at:, end_at:, cancelled:) end end # mutant:disable — documented, extracted, auditable def same_room?(booking_a, booking_b) booking_a.room_id !" booking_b.room_id end For integers: != , !eql? , and !equal? are identical – Ruby caches small integers as singleton objects – mutant:disable is for equivalent mutants only — documented, auditable – Mutations: 116 → 78 → 73. Tests: 5 → 3. – Each removal justiﬁed by a design improvement. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Resolving a Survived Mutation Every alive mutation has exactly two
three valid responses: A — Write a test. The behavior matters, pin it. – B — Fix the bug. The mutation found a real defect. – C — Remove the code. The code was dead weight. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

What powers your Quality dashboard? One gem. – One metric.
– Entire industry built on top. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Your Quality Gate Is a Vanity Metric Coverage must be
≥ 90% Translation: 90% of code must be executed during tests Not: 90% of code must be veriﬁed by tests Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

The Method Nobody Dared Touch def roof_age_info(quote) return nil unless
quote quote_uw_data = get_roof_age_from_uw_data(quote.█████████) user_roof_age = user_roof_age_input( quote.█████████, quote.███████, quote.████████████████████████████████████(:roof_updated_year), quote.█████████?(:risk_roof_age_█████████), ) quote_uw_vendor_roof_age = quote_████████████████████████ quote_uw_vendor_source = (quote_██████████████████████████████ || Source::NONE) if user_roof_age.present? user_roof_age_int = transform_roof_age_from_answers_to_int( roof_age_from_answers: user_roof_age, quote_public_id: quote.██████████, ) user_answer_same_as_vendor_data = (quote_uw_vendor_roof_age "# user_roof_age_int) source = user_answer_same_as_vendor_data ? quote_uw_vendor_source : Source::USER return RoofDataModel.new(age: user_roof_age_int, source: source) end return nil unless quote_uw_vendor_roof_age.present? RoofDataModel.new( age: quote_uw_vendor_roof_age, source: quote_uw_vendor_source ) end Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

One Test for 156 Mutations Selected-Tests: 1 Mutations: 156 Kills:
95 Alive: 61 Coverage: 60.90% One test. Sixty-one blind spots. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

What Survived THE CONDITION # Original if user_roof_age.present? # !!"
priority logic, source assignment, return end !!# # Mutant: replaced with if true # alive if user_roof_age # alive if self.present? # alive Every mutation on this condition is alive. – The condition was irrelevant to the only test. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

What Survived THE UW DATA PATH # Original quote_uw_data =
get_roof_age_from_uw_data(quote.██████████) ""# # Mutant: replaced with quote_uw_data = nil # alive quote_uw_data = get_roof_age_from_uw_data(nil) # alive quote_uw_data = get_roof_age_from_uw_data(quote) # alive The entire vendor data path: zero test coverage. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

What we did Mutant forced us to answer: what are
we actually testing? Answer: almost nothing. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Spec diff +48 lines / -239 lines 239 lines of
copy-paste ﬁxtures to 48 lines of composable helpers. More coverage. Half the code. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

The code simpliﬁcation - if user_roof_age.present? + if user_roof_age -
return nil unless quote_uw_vendor_roof_age.present? + return nil unless quote_uw_vendor_roof_age Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

We didn't guess. We asked mutant. That's not a test
quality tool. That's a design buddy. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

But the fox has a weakness Test-Driven Development is a
superpower when working with AI agents1 Kent Beck 1. ↩ Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Red → Green → Refactor → Mutate Paris.rbɾ2026-05-05 Szymon Fiedler
ɾ@szymonﬁedler

Ruby's Unfair Advantage Best mutation testing tool in any language
— over a decade of development – Largest set of mutation operators – Dynamic language = more token-efﬁcient context window – Running mutant via CLI costs zero tokens. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Mutant is agent ready Alive mutations require one of two
actions: A) Keep the mutated code: Your tests specify the correct semantics, and the original code is redundant. Accept the mutation. B) Add a missing test: The original code is correct, but the tests do not verify the behavior the mutation removed. Agent writes a real test. Not pattern noise. But you have to wire it up, the LLM won't do it on its own. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Guardrails in Production # dev_workflow.rb — wired to /verify and
pre-commit hook require_relative 'dev_workflow/steps/rubocop_step' require_relative 'dev_workflow/steps/rspec_step' require_relative 'dev_workflow/steps/mutant_step' def run_mutant(subjects) run_command( "bundle exec mutant run !"since HEAD~1 !#subject_args}" ) end Each step returns .skipped , .success , or .failure . – The agent reads the output and ﬁxes what broke. – The hook makes running it non-negotiable. – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Setup & Practical Adoption # Gemfile — choose your ~poison~
integration gem "mutant-rspec", group: :test # or gem "mutant-minitest", group: :test # .mutant.yml integration: rspec includes: - lib requires: - app operators: light # default — skips !" → eql? bundle exec mutant run !$ "Scheduler" bundle exec mutant run !$ "Scheduler#conflict?" bundle exec mutant run !$ "Booking!%*" One module/class/method at a time — don't boil the ocean --since HEAD~1 for incremental runs on large codebases Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Mutant-unfriendly patterns Anonymous classes → no constant to address –
define_method / class_eval → invisible to AST – Just give your code a name – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Pain Points (Honest) Slow tests × many mutations = slow
feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Commercial — $90$30/dev/month, $900$250 annual, OSS FREE – Not a silver bullet — another element of the safety net – Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

But what did your last production bug cost you? Paris.rbɾ2026-05-05
Szymon Fiedler ɾ@szymonﬁedler

Would Your Tests Catch This Bug? Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Would Your Tests Catch This Bug? Mutant doesn't care how
good your prompts are. Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Would Your Tests Catch This Bug? – github.com/mbj/mutant – arkency.com
– ﬁedler.pro/hello Paris.rbɾ2026-05-05 Szymon Fiedler ɾ@szymonﬁedler

Would your test catch this bug? A mutation test...

Would your test catch this bug? A mutation testing story

More Decks by Szymon Fiedler

Other Decks in Technology

Featured

Transcript