Would Your Tests Catch This Bug? A mutation testing story

Would Your Tests Catch This Bug? A Mutation Testing Story
Szymon Fiedler @szymonﬁedler

Who tests their software? Szymon Fiedler @szymonﬁedler

Who measures code coverage? Szymon Fiedler @szymonﬁedler

Who codes with an LLM? Szymon Fiedler @szymonﬁedler

Who lets the LLM write tests? Szymon Fiedler @szymonﬁedler

Who got burned by the poor quality of those tests?

It's a skill issue1 Random from the Intenet 1. ↩

You're right, I was wrong. Let me reconsider.1 Jean Claude
van Code 1. ↩ Szymon Fiedler @szymonﬁedler

Sycophancy Szymon Fiedler @szymonﬁedler

Non-determinism Same prompt tomorrow → different tests. You're getting some
tests, not the tests. Szymon Fiedler @szymonﬁedler

Weak Pattern Matching it "does not raise" do expect {
subject.call }.not_to raise_error end The model learned the shape of tests, not the purpose. SKILLS.md will be ignored when inconvenient. Szymon Fiedler @szymonﬁedler

Can't you just prompt better? Explicit instructions → ignored when
inconvenient – Szymon Fiedler @szymonﬁedler

inconvenient – Chain-of-thought → still sycophantic reasoning – Szymon Fiedler @szymonﬁedler

inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Szymon Fiedler @szymonﬁedler

inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Lower temperature → less random, still wrong – Szymon Fiedler @szymonﬁedler

Can't you just prompt better? These help at the margins.
None of them give you proof. Explicit instructions → ignored when inconvenient – Chain-of-thought → still sycophantic reasoning – Contradiction → You're right, let me reconsider – Lower temperature → less random, still wrong – Szymon Fiedler @szymonﬁedler

Every Degradation Ampliﬁes AI models use your existing code as
guidance for future code. Weak tests breed more weak tests modeled on them. Szymon Fiedler @szymonﬁedler

Three tools, three different questions RuboCop — Does it follow
conventions? ✓ 1. Szymon Fiedler @szymonﬁedler

conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. Szymon Fiedler @szymonﬁedler

conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Szymon Fiedler @szymonﬁedler

Three tools, three different questions Two tools, two checkmarks, one
unanswered question. This is the gap. RuboCop — Does it follow conventions? ✓ 1. SimpleCov — Did this line execute? ✓ 2. ??? — If this line were wrong, would tests catch it? ✗ 3. Szymon Fiedler @szymonﬁedler

What Is Mutation Testing? Szymon Fiedler @szymonﬁedler

What Is Mutation Testing? Deliberately break your code – Szymon
Fiedler @szymonﬁedler

What Is Mutation Testing? Deliberately break your code – See
if tests notice – Szymon Fiedler @szymonﬁedler

if tests notice – Killed — test failed on the mutation ✓ – Szymon Fiedler @szymonﬁedler

if tests notice – Killed — test failed on the mutation ✓ – Alive — test still passed ✗ – Szymon Fiedler @szymonﬁedler

if tests notice – Killed — test failed on the mutation ✓ – Alive — test still passed ✗ If you change a line and tests still pass, what are those tests actually testing? – Szymon Fiedler @szymonﬁedler

I'm Szymon Fiedler from Poland Poland, OH Poland, ME Poland,
IN Poland, NY The REAL one Poznań, © Polish Tourism Organization High Tatras, © Polish Tourism Organization Białowieża Forest, © Polish Tourism Organization Cracow, © Polish Tourism Organization Szymon Fiedler @szymonﬁedler

Just back from Big Bend Rare experience, well done! Szymon

Arkency has been doing Ruby since 2006 Szymon Fiedler @szymonﬁedler

I've been working effectively with legacy code there for 12
of those 20 years Szymon Fiedler @szymonﬁedler

We've Been Here Since 2015 The Ruby Event Sourcing gem
– Millions of downloads – Very few real reported issues – Mutation testing is one reason why – Szymon Fiedler @szymonﬁedler

Why Coverage Lied to Us def add_subscriber(subscriber, event_types) raise SubscriberNotExist
if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Szymon Fiedler @szymonﬁedler

if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Szymon Fiedler @szymonﬁedler

if subscriber.nil? raise MethodNotDefined unless subscriber.methods.include? :handle_event subscribe(subscriber, [*event_types]) end Coverage: ✓ – Mutation: replace condition with true , false → alive – Tests for MethodNotDefined path: zero – Szymon Fiedler @szymonﬁedler

Operators - < + !" - !# + !$ -
!% - eql? - !% + !& Method calls - hash[:key] + hash.fetch(:key) - obj.public_send(:m) + obj.!'send!'(:m) Values & flow - value + nil - return value + return - if condition + if true / if false - statement + # removed entirely Enumerable reductions - each_with_object + each - filter_map + map Orthogonal swaps - select + reject - all? + none? - keys + values Bang, non-bang - map! + map - sort! + sort - compact! + compact Szymon Fiedler @szymonﬁedler

Standing on solid foundations # parser turns this: later.start_at <
(earlier.end_at + buffer) parser gem → AST (Abstract Syntax Tree) – The same parser that powers RuboCop. – Szymon Fiedler @szymonﬁedler

Standing on solid foundations # into this: s(:send, s(:send, s(:lvar,
:later), :start_at), !", # mutant targets this node s(:send, s(:send, s(:lvar, :earlier), :end_at), :+, s(:lvar, :buffer))) mutant replaces the node. – unparser regenerates valid Ruby. – Szymon Fiedler @szymonﬁedler

Your code class Scheduler def conflict?(booking_a, booking_b, buffer: 300) return
false if booking_a[:cancelled] !" booking_b[:cancelled] return false if booking_a[:room_id] !# booking_b[:room_id] earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } later[:start_at] < (earlier[:end_at] + buffer) end end Szymon Fiedler @szymonﬁedler

class SchedulerTest < Minitest!"Test cover "Scheduler" def test_conflicts assert Scheduler.new.conflict?(
booking(1, '09:00', '10:00'), booking(1, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(2, '09:30', '10:30') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'), booking(1, '10:10', '11:00') ) refute Scheduler.new.conflict?( booking(1, '09:00', '10:00', cancelled: true), booking(1, '09:30', '10:30') ) end end Szymon Fiedler @szymonﬁedler

A Mutation - def conflict?(booking_a, booking_b, buffer: 300) + def
conflict?(booking_a, booking_b, buffer: 299) Tests pass. – Mutation is alive. – If you change a line and tests still pass, what are those tests actually testing? – Szymon Fiedler @szymonﬁedler

Let's Walk Through It SimpleCov says: Line Coverage: 100.0% (8/8)
Branch Coverage: 100.0% (4/4) Mutant says: Mutations: 123 Kills: 102 Alive: 21 Coverage: 82.92% Szymon Fiedler @szymonﬁedler

Reading the Output Mutant environment: Usage: opensource Matcher: #<Mutant!"Matcher!"Config subjects:
[Scheduler]> Integration: minitest Jobs: 10 Includes: [] Requires: ["./scheduler"] Operators: light MutationTimeout: 5 Subjects: 1 All-Tests: 5 Available-Tests: 5 Selected-Tests: 1 Tests/Subject: 1.00 avg Mutations: 123 RUNNING 123/123 (100.0%) ████████████████████████████████████████ alive: 21 0.8s 163.95/s Szymon Fiedler @szymonﬁedler

Reading the Output Scheduler#conflict!"/Users/fidel/code/mutant/scheduler.rb:5 - minitest:SchedulerTest#test_conflicts evil:Scheduler#conflict!"/Users/fidel/code/mutant/scheduler.rb:5:4a4cd ----------------------- Killfork: #<Process!#Status:
pid 24868 exit 0> Log messages (combined stderr and stdout): [killfork] 1 runs, 5 assertions, 0 failures, 0 errors, 0 skips @@ -1,12 +1,12 @@ -def conflict?(booking_a, booking_b, buffer: 300) +def conflict?(booking_a, booking_b, buffer: 299) if booking_a[:cancelled] !$ booking_b[:cancelled] return false end Szymon Fiedler @szymonﬁedler

What mutant generates - if booking_a[:cancelled] !" booking_b[:cancelled] + if
booking_a.fetch(:cancelled) !" booking_b[:cancelled] - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].eql?(booking_b[:room_id]) - if booking_a[:room_id] !# booking_b[:room_id] + if !booking_a[:room_id].equal?(booking_b[:room_id]) - (earlier, later) = [booking_a, booking_b].sort_by { |h| h[:start_at] } + (earlier, later) = [booking_a, booking_b] - h[:start_at] + nil Operates on the AST, not text – Every mutation is semantically valid – Every mutation is a question – Szymon Fiedler @szymonﬁedler

Alive Group 1: Buffer Boundary - def conflict?(booking_a, booking_b, buffer:
300) + def conflict?(booking_a, booking_b, buffer: 0) - later[:start_at] < (earlier[:end_at] + buffer) + later[:start_at] < earlier[:end_at] Buffer replaced with 0, 1, 299, 301. Tests pass every time. – The buffer plays no role in any test. – 6 mutations alive. – Szymon Fiedler @szymonﬁedler

Alive Group 1: Buffer Boundary refute Scheduler.new.conflict?( booking(1, '09:00', '10:00'),
booking(1, '10:10', '11:00') ) Szymon Fiedler @szymonﬁedler

Fix: Pin the Boundary def test_buffer_boundary # 299 seconds apart:
within 300s buffer → conflict assert Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:04:59", "11:00:00") ) # exactly 300 seconds apart: at boundary → no conflict refute Scheduler.new.conflict?( booking(1, "09:00:00", "10:00:00"), booking(1, "10:05:00", "11:00:00") ) end 82.92% → 87.80% Szymon Fiedler @szymonﬁedler

Alive Group 2: Sort Order - earlier, later = [booking_a,
booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b] Sort removed. Tests still pass. – Every test passed bookings in chronological order. – 7 mutations alive. – Szymon Fiedler @szymonﬁedler

Fix: Reverse the Arguments def test_argument_order_does_not_matter # booking_a starts LATER
than booking_b refute Scheduler.new.conflict?( booking(1, "10:00", "11:00"), booking(1, "09:00", "09:30") ) end 87.80% → 92.68% Szymon Fiedler @szymonﬁedler

Alive Group 3: [] → .fetch() - booking_a[:cancelled] + booking_a.fetch(:cancelled)
Both return the same value when the key exists. – No test can distinguish them. – 6 mutations alive. – This is not a test problem. This is a code problem. – Szymon Fiedler @szymonﬁedler

Fix: Accept the Mutation - return false if booking_a[:cancelled] !"
booking_b[:cancelled] + return false if booking_a.fetch(:cancelled) !" booking_b.fetch(:cancelled) - return false if booking_a[:room_id] !# booking_b[:room_id] + return false if booking_a.fetch(:room_id) !# booking_b.fetch(:room_id) - earlier, later = [booking_a, booking_b].sort_by { |h| h[:start_at] } + earlier, later = [booking_a, booking_b].sort_by { |h| h.fetch(:start_at) } 92.68% → 98.27% No new tests. The code changed. Szymon Fiedler @szymonﬁedler

Alive Group 4: Ruby's Equality Semantics - if booking_a.fetch(:room_id) !"
booking_b.fetch(:room_id) + if !booking_a.fetch(:room_id).eql?(booking_b.fetch(:room_id)) + if !booking_a.fetch(:room_id).equal?(booking_b.fetch(:room_id)) != — value comparison (coerces types) – eql? — value + type must match – equal? — object identity (same pointer) – 2 mutations alive. – Szymon Fiedler @szymonﬁedler

Fix: Exploit the Type Gap def test_same_room_with_string_ids # String "room_1":
same value, different objects # Kills !" → !equal? (equal? checks identity) assert Scheduler.new.conflict?( booking("room_1", "09:00", "10:00"), booking("room_1", "09:30", "10:30") ) end def test_different_room_with_comparable_types # Integer 1 vs Float 1.0: !" says equal, eql? says not # Kills !" → !eql? assert Scheduler.new.conflict?( booking(1, "09:00", "10:00"), booking(1.0, "09:30", "10:30") ) end 98.27% → 100.00% — 116/116 killed, 0 alive. Szymon Fiedler @szymonﬁedler

The Scheduler's Second Act # Before: raw hashes, 116 mutations
booking_a.fetch(:cancelled) booking_a.fetch(:room_id) booking_a.fetch(:start_at) 100% mutation score. We kept going. – Mutant told us the design could be better. – # After: Data.define value object, 78 mutations Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) booking_a.cancelled booking_a.room_id booking_a.start_at # no fetch mutations possible 38 mutations simply ceased to exist. – test_same_room_with_string_ids removed. Never tested domain behavior. – Szymon Fiedler @szymonﬁedler

Equivalent Mutants THE PRINCIPLED RESPONSE # Integer() coercion at construction
time Booking = Data.define(:room_id, :start_at, :end_at, :cancelled) do def initialize(room_id:, start_at:, end_at:, cancelled: false) super(room_id: Integer(room_id), start_at:, end_at:, cancelled:) end end For integers: != , !eql? , and !equal? are identical. – Ruby caches small integers as singleton objects. – No test can kill these mutations — no behavior differs. – Szymon Fiedler @szymonﬁedler

Equivalent Mutants THE PRINCIPLED RESPONSE # mutant:disable — documented, extracted,
auditable def same_room?(booking_a, booking_b) booking_a.room_id !" booking_b.room_id end Mutations: 116 → 78 → 73. – Tests: 5 → 3. – Each removal justiﬁed by a design improvement. – Szymon Fiedler @szymonﬁedler

Resolving a Survived Mutation Every alive mutation has exactly two
valid responses: A — Write a test. The behavior matters, pin it. – Szymon Fiedler @szymonﬁedler

valid responses: A — Write a test. The behavior matters, pin it. – B — Fix the bug. The mutation found a real defect. – Szymon Fiedler @szymonﬁedler

three valid responses: A — Write a test. The behavior matters, pin it. – B — Fix the bug. The mutation found a real defect. – C — Remove the code. The code was dead weight. – Szymon Fiedler @szymonﬁedler

What powers your Quality dashboard? One gem. – One metric.
– Entire industry built on top. – Szymon Fiedler @szymonﬁedler

Your Quality Gate Is a Vanity Metric Coverage must be
≥ 90% Translation: 90% of code must be executed during tests Not: 90% of code must be veriﬁed by tests Szymon Fiedler @szymonﬁedler

The Method Nobody Dared Touch def roof_age_info(quote) return nil unless
quote quote_uw_data = get_roof_age_from_uw_data(quote.█████████) user_roof_age = user_roof_age_input( quote.█████████, quote.███████, quote.████████████████████████████████████(:roof_updated_year), quote.█████████?(:risk_roof_age_█████████), ) quote_uw_vendor_roof_age = quote_████████████████████████ quote_uw_vendor_source = (quote_██████████████████████████████ "# Source"$NONE) if user_roof_age.present? user_roof_age_int = transform_roof_age_from_answers_to_int( roof_age_from_answers: user_roof_age, quote_public_id: quote.██████████, ) user_answer_same_as_vendor_data = (quote_uw_vendor_roof_age "% user_roof_age_int) source = user_answer_same_as_vendor_data ? quote_uw_vendor_source : Source"$USER return RoofDataModel.new(age: user_roof_age_int, source: source) end return nil unless quote_uw_vendor_roof_age.present? RoofDataModel.new( age: quote_uw_vendor_roof_age, source: quote_uw_vendor_source ) end Szymon Fiedler @szymonﬁedler

One Test for 156 Mutations Selected-Tests: 1 Mutations: 156 Kills:
95 Alive: 61 Coverage: 60.90% One test. Sixty-one blind spots. Szymon Fiedler @szymonﬁedler

What Survived THE CONDITION # Original if user_roof_age.present? # !!"
priority logic, source assignment, return end # Mutant: replaced with if true # alive if user_roof_age # alive if self.present? # alive Every mutation on this condition is alive. – The condition was irrelevant to the only test. – Szymon Fiedler @szymonﬁedler

What Survived THE UW DATA PATH # Original quote_uw_data =
get_roof_age_from_uw_data(quote.██████████) # Mutant: replaced with quote_uw_data = nil # alive quote_uw_data = get_roof_age_from_uw_data(nil) # alive quote_uw_data = get_roof_age_from_uw_data(quote) # alive The entire vendor data path: zero test coverage. Szymon Fiedler @szymonﬁedler

What we did Mutant forced us to answer: what are
we actually testing? Szymon Fiedler @szymonﬁedler

What we did Mutant forced us to answer: what are
we actually testing? Answer: almost nothing. Szymon Fiedler @szymonﬁedler

Spec diff +48 lines / -239 lines 239 lines of
copy-paste ﬁxtures to 48 lines of composable helpers. More coverage. Half the code. Szymon Fiedler @szymonﬁedler

The code simpliﬁcation - if user_roof_age.present? + if user_roof_age -
return nil unless quote_uw_vendor_roof_age.present? + return nil unless quote_uw_vendor_roof_age Szymon Fiedler @szymonﬁedler

We didn't guess. We asked mutant. That's not a test
quality tool. That's a design buddy. Szymon Fiedler @szymonﬁedler

But the fox has a weakness Test-Driven Development is a
superpower when working with AI agents1 Kent Beck 1. ↩ Szymon Fiedler @szymonﬁedler

Red → Green → Refactor → Mutate Szymon Fiedler @szymonﬁedler

Ruby's Unfair Advantage Best mutation testing tool in any language
— over a decade of development – Largest set of mutation operators – Dynamic language = more token-efﬁcient context window – Running mutant via CLI costs zero tokens. – Szymon Fiedler @szymonﬁedler

Mutant is agent ready Alive mutations require one of two
actions: A) Keep the mutated code: Your tests specify the correct semantics, and the original code is redundant. Accept the mutation. B) Add a missing test: The original code is correct, but the tests do not verify the behavior the mutation removed. Agent writes a real test. Not pattern noise. But you have to wire it up, the LLM won't do it on its own. Szymon Fiedler @szymonﬁedler

Guardrails in Production # dev_workflow.rb — wired to /verify and
pre-commit hook require_relative 'dev_workflow/steps/rubocop_step' require_relative 'dev_workflow/steps/rspec_step' require_relative 'dev_workflow/steps/mutant_step' def run_mutant(subjects) run_command( "bundle exec mutant run !"since HEAD~1 !#subject_args}" ) end Each step returns .skipped , .success , or .failure . – The agent reads the output and ﬁxes what broke. – The hook makes running it non-negotiable. – Szymon Fiedler @szymonﬁedler

Setup # Gemfile — choose your ~poison~ integration gem "mutant-rspec",
group: :test # or gem "mutant-minitest", group: :test # .mutant.yml integration: rspec includes: - lib requires: - app operators: light # default — skips !" → eql? Szymon Fiedler @szymonﬁedler

Practical Adoption bundle exec mutant run !" "Scheduler" bundle exec
mutant run !" "Scheduler#conflict?" bundle exec mutant run !" "Booking!#*" One module/class/method at a time don't boil the ocean Szymon Fiedler @szymonﬁedler

Practical Adoption Run in CI after merge to main Szymon

Practical Adoption --since HEAD~whatever for incremental runs on large codebases

Practical Adoption assert Scheduler.new.conflict?(a, b) beats refute_nil result Szymon Fiedler
@szymonﬁedler

Mutant-unfriendly patterns Anonymous classes → no constant to address –
define_method / class_eval → invisible to AST – Just give your code a name – Szymon Fiedler @szymonﬁedler

Pain Points (Honest) Slow tests × many mutations = slow
feedback — but that's your tests, not mutant – Szymon Fiedler @szymonﬁedler

feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Szymon Fiedler @szymonﬁedler

feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Commercial — $90$30/dev/month, $300$250 annual, OSS FREE – Szymon Fiedler @szymonﬁedler

feedback — but that's your tests, not mutant – Setup friction with unusual require chains – Commercial — $90$30/dev/month, $300$250 annual, OSS FREE – Not a silver bullet — another element of the safety net – Szymon Fiedler @szymonﬁedler

But what did your last production bug cost you? Szymon

The Safety Net Stack Tool Question answered SimpleCov Did this
code execute? RuboCop Does it follow conventions? Mutant Would tests catch a bug here? Remember the ??? from earlier? Mutant is the answer. Szymon Fiedler @szymonﬁedler

def not_a_scam = puts "calendly" Free ofﬁce hours sponsored by
Arkency – I'm in Austin until the end of March – Scan this code and mark your spot in calendly – Not limited to mutant, it can be anything Ruby related – Szymon Fiedler @szymonﬁedler

Would Your Tests Catch This Bug? Szymon Fiedler @szymonﬁedler

Would Your Tests Catch This Bug? Mutant doesn't care how
good your prompts are. Szymon Fiedler @szymonﬁedler

Would Your Tests Catch This Bug? Mutant doesn't care how
good your prompts are. github.com/mbj/mutant blog.arkency.com fiedler.pro/hello Szymon Fiedler @szymonﬁedler

Would Your Tests Catch This Bug? A mutation tes...

Would Your Tests Catch This Bug? A mutation testing story

More Decks by Szymon Fiedler

Featured

Transcript