Code smarter, not harder | DWX 2026

Code smarter, not harder How AI coding tools boost productivity
— and where they don't. Daniel Sogl @sogldaniel Consultant @ Thinktecture

About me Daniel Sogl Consultant @ Thinktecture AG MVP —
Developer & Web Technologies Focus: Developer Productivity & Generative AI Socials: linktr.ee/daniel_sogl 2 Code smarter, not harder How AI coding tools boost productivity — and where they don't

6 Acts in 60 Minutes Act 1 — The Problem
We got faster. Did we get better? Act 2 — Teach the Agent Rules, guidelines & skills — give it your team's judgment Act 3 — Think Before You Code Specs over vibes — and who reviews the plan Act 4 — Make "Done" Machine-Checkable Tests, hooks & CI as the guardrail Act 5 — Scale It Multi-agent teams — and paying down the debt they'd otherwise multiply Act 6 — The New Bottleneck Shipping was never the hard part — stability is 3 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Act 1 Question 1 of 6 — does it even
work, and could you tell? The Productivity Question — can you measure it? 4 Code smarter, not harder How AI coding tools boost productivity — and where they don't

ADOPTION IS SOLVED → THE TRUST GAP 84% use or
plan to use AI 33% trust their accuracy — down from 43% We use it. We don't trust it. We use it anyway. SOURCE — STACK OVERFLOW DEVELOPER SURVEY 2025 5 Code smarter, not harder How AI coding tools boost productivity — and where they don't

YOU CAN'T TRUST THE FEELING · METR RCT JULY 2025
RCT · EXPERIENCED OSS DEVS · MATURE REPOS −19% They were slower with AI — while they felt 20% faster. If you can't feel it, you have to measure it. Stanford: self-rated productivity is “almost as good as flipping a coin.” SOURCES — METR ARXIV:2507.09089 · STANFORD SEP 2025 · 2026 RETEST: SIGNAL TOO NOISY TO CALL 6 Code smarter, not harder How AI coding tools boost productivity — and where they don't

THE QUESTION · WHAT EVEN IS PRODUCTIVITY? How do you
measure developer productivity? THE TEXTBOOK DEFINITION Clean on a factory floor. Countable widgets, countable hours. …but which output? Lines? Commits? PRs? None of them equal value delivered. …and whose input? Thinking, reviewing, understanding — the real work isn't typing. Productivity = Input Output Sources — Peter Drucker on knowledge-worker productivity (1999) 7 Code smarter, not harder How AI coding tools boost productivity — and where they don't

THE THREE FRAMEWORKS BEHIND IT · PICK THE METRIC THAT
GUARDS THE RISK DORA Delivery outcomes · 2018 / 2025 Deployment Frequency · Lead Time · Change-Failure Rate · Failed-Deploy Recovery AI GUARD METRIC Change-Failure Rate More PRs, same broken-deploy share? Then it's volume, not speed. SPACE The human system · 2021 Satisfaction · Performance · Activity · Communication · Efficiency & Flow AI GUARD METRIC Satisfaction + Flow Never one number alone — pair any output metric with a human one. DevEx Daily friction · 2023 Flow State · Feedback Loops · Cognitive Load AI GUARD METRIC Cognitive Load AI moves effort from writing to understanding — review is the new cost. DX Core 4 rolls all three into four tensions — Speed · Quality · Effectiveness · Impact. No single AI-productivity number to game. Sources — DORA (Forsgren/Humble/Kim; dora.dev, 2025) · SPACE (Forsgren et al., 2021) · DevEx (Noda, Storey, Forsgren, 2023) · DX Core 4 (Tacho & Noda, 2024) 8 Code smarter, not harder How AI coding tools boost productivity — and where they don't

THE AI PRODUCTIVITY PARADOX · FAROS AI · 10,000+ DEVS
LOOKS LIKE A HUGE WIN +98% pull requests +21% tasks completed This is what a PR-count dashboard shows your VP. WHAT THE SAME DATA SHOWS DOWNSTREAM +91% review time +154% PR size +9% bugs / PR flat org-level DORA Twice the PRs. Same delivery. More bugs. The work didn't disappear — it moved downstream. DORA 2026 names it the "instability tax." The fix isn't "more AI" — it's getting ready for it. Sources — Faros AI engineering telemetry · Jul 2025 (10,000+ devs · 1,255 teams) · Apr 2026 "Acceleration Whiplash" update (22,000 devs · 4,000+ teams) · correlational · "instability tax" term: DORA 2026 ROI of AI-Assisted Software Development 9 Code smarter, not harder How AI coding tools boost productivity — and where they don't

DORA’s One-Sentence Diagnosis "AI's primary role is as an amplifier,
magnifying an organization's existing strengths and weaknesses." Strong teams get stronger. Struggling teams get worse — faster. So before "more AI" — fix what gets amplified. ↑ Throughput ↓ Stability verified · DORA 2025 SOURCE — DORA 2025 STATE OF AI-ASSISTED SOFTWARE DEVELOPMENT 10 Code smarter, not harder How AI coding tools boost productivity — and where they don't

IF AI AMPLIFIES — FIX WHAT GETS AMPLIFIED FIRST AI
doesn't repair a shaky foundation. It pours concrete over it — at speed. Team Misalignment now ships in hours, not sprints. Fix first: shared definition-of-done, small batches, review habits. Code The agent mirrors your patterns — legacy debt & wrong boundaries, multiplied. Fix first: clean version control, tests as guardrails, anti-patterns named in instructions . Product Vague requirements in → confidently- wrong out. Fix first: a domain glossary + the "what & why" written down. Not an AI problem — the homework AI makes urgent. Team, Code, Product — in that order. DORA 2025 — foundational capabilities (small batches, clean version control, internal platform) gate AI's payoff · field experience 11 Code smarter, not harder How AI coding tools boost productivity — and where they don't

ARE YOU AI-READY? · A 60-SECOND SELF-CHECK ✓ READY WHEN…
✗ AI POURS CONCRETE IF… Team Reviews & definition-of-done are shared. "Works on my machine" is the culture. Code Tests + CI gate every merge. No tests; boundaries are tangled. Product The problem fits in one sentence. Requirements live in someone's head. Every ✗ is where AI pours concrete fastest. Fix those first — starting with how you teach the agent what "good" looks like → Act 2. 12 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Act 2 Question 2 of 6 — how do you
give it your team's judgment? Teach the Agent 13 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 1 — CUSTOM RULES & GUIDELINES AGENTS.MD · OPEN
STANDARD 60,000+ repos already run one Plain Markdown, nested per package. Released by OpenAI Aug 2025 — read today by Cursor, Devin, Factory, Codex, Copilot, Gemini CLI, VS Code and more. Dec 2025: donated to the Linux Foundation's new Agentic AI Foundation, alongside MCP. RULE OF THUMB Context ≠ enforcement. Guide it in CLAUDE.md / AGENTS.md — style, architecture, "how we do things". Block it in CI, or a PreToolUse hook — anything that must never ship. A CLAUDE.md the agent can ignore under pressure is a suggestion. A CI gate it can't merge past is a rule. Sources — Linux Foundation, "Agentic AI Foundation" press release (9 Dec 2025) · agents.md · TechCrunch (9 Dec 2025) 14 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 2 — SKILLS + EVALUATION 1 · DISCOVER ~80
tokens name + description preloaded at startup — dozens of skills for less than one activation 2 · ACTIVATE Full SKILL.md loads only when the task actually matches 3 · EXECUTE Bundled scripts & refs load on demand — the skill can outgrow the prompt window WRITE SKILLS WITH SKILLS Anthropic's own skill-creator scaffolds new skills interactively — and the direction is agents that create, edit and evaluate their own skills, not just consume the ones you write. Skills package the domain rules; the constitution/AGENTS.md carries the house style. Evaluate skills like code — reused across projects, they need the same regression bar. Sources — Anthropic Engineering, "Equipping agents for the real world with Agent Skills" · SwirlAI independent token analysis 15 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 2 — HOW DO YOU KNOW A SKILL ACTUALLY
WORKS? Seeing a skill trigger tells you Claude found it. Not that it did what you intended. 1 · TEST CASES evals.json prompts + expected behaviour, next to the skill 2 · ISOLATED RUNS 1 subagent / case clean context each time — no bleed between runs 3 · GRADE IT grading.json pass / fail per assertion, with evidence 4 · BENCHMARK with vs. without pass rate against the token & time overhead "Testing turns a skill that seems to work into one you know works." And if the base model starts passing without the skill loaded — it's not broken, it's just been absorbed into the model. Sources — Claude Code docs, "Extend Claude with skills" (Evaluate and iterate on a skill) · Claude Blog, "Improving skill-creator: Test, measure, and refine Agent Skills" 16 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Act 3 Question 3 of 6 — who does the
thinking before the agent starts typing? Think Before You Code 17 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 5 — SPEC-DRIVEN DEVELOPMENT AI doesn't fix vague requirements.
It ships them — at machine speed. Ambiguous brief incomplete · contradictory → Agent guesses no clarifying question → Confidently wrong compiles, looks done "Exceptional at pattern completion, not at mind reading." — Den Delimarsky · Principal Product Engineer, GitHub Spec Kit A spec is Step Zero, written down: who is this for, what problem does it solve. Spec Kit, OpenSpec & co. automate everything after that — never that. Source — Den Delimarsky (GitHub) · field experience 18 Code smarter, not harder How AI coding tools boost productivity — and where they don't

CHOOSING YOUR SPEC TOOL — GREENFIELD VS BROWNFIELD GitHub Spec
Kit 117k★ · 30+ agents supported /speckit.specify → .plan → .tasks → .implement Plus a constitution.md of immutable project rules every phase must respect. Best for greenfield — new projects, new services OpenSpec 58k★ · 25+ tools · MIT, no infra propose → apply → archive Delta specs — describe what's changing, not the whole system. Built for codebases that already exist. Best for brownfield — the 90% of our real work Both are plain Markdown in your repo — no lock-in. Pick one, keep the workflow consistent. Star counts move fast — check before you cite them. Sources — GitHub Spec Kit · OpenSpec (Fission-AI) · Jun 2026 star counts 19 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 6 — RUBBER-DUCKING THE PLAN, NOT JUST THE CODE
The model that wrote the plan is the worst reviewer of it. It's invested in its own reasoning. A second model — or a second person — catches what the first one rationalized away. Frontier planner, cheap executor. Spend the expensive model on the plan review; let a cheaper one grind through the implementation steps. WHY THIS WON'T STAY FREE AI coding costs > dev salary by 2028 — Gartner forecast, Jun 2026 Agentic tasks trigger 5-30 model calls, each resending the whole context. Model choice isn't just quality — it's the budget line. Source — Gartner press release, "Gartner Predicts AI Coding Costs Will Surpass Average Developer's Salary by 2028" (24 Jun 2026) — analyst forecast, not a measured outcome 20 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Act 4 Question 4 of 6 — how does the
agent know it's actually finished? Make "Done" Machine-Checkable 21 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 3 — SELF-HEALING LOOPS Give it a machine-checkable definition
of done — then let it loop until green. tests + lint + typecheck → green A PostToolUse hook feeds every failure straight back as context — the agent sees its own red build without you pasting the log. THE TRAP Don't let the agent grade its own homework. Models skew positive when they evaluate their own output. Separate generation from evaluation — a different pass, a different prompt, ideally a different model, checks the work. "Definition of done" stops being a Jira checkbox and becomes an exit condition the agent can test itself. 22 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 8 — TEST-FIRST, NOT TEST-AFTER Write the failing test
first. Commit it before the agent implements anything. An agent can quietly weaken a red test it's allowed to edit. It can't weaken one that's already in the commit history. BDD pushes this further: the test asserts the requirement — not the implementation that happens to exist today. FIELD NOTE "We hit 99% coverage with AI" — mutation testing showed the tests just pinned what the code does, not what it should do. Coverage confirms the implementation. It never asked whether the implementation was right. Same discipline as code review: the test has to exist before the code that satisfies it. Source — eferro, "Mutation Testing" (Nov 2025) · field experience 23 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 9 — NOT EVERYTHING NEEDS AI A linter doesn't
hallucinate. Deterministic checks are still the cheapest guardrail you own. Static analysis, type checkers, strict CI — none of it needs a model, all of it catches what an agent produces at volume. PURPOSE-BUILT FOR AI OUTPUT SonarQube · "Sonar way for AI Code" A built-in quality gate for code flagged as AI- generated — GA since SonarQube Server 2025.1 LTA. 0 new issues 80% new coverage ≤3% duplication Security rating A Reserve the model for judgment calls. Let boring, deterministic tooling catch everything it already can. Source — Sonar, "Quality gates for AI code" · SonarQube Server 2025.1 LTA docs 24 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Act 5 Question 5 of 6 — how do you
go from one agent to a team of them? Scale It 25 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 4 — MULTI-AGENT TEAMS ANTHROPIC'S INTERNAL RESEARCH EVAL +90.2%
lead agent + subagents, vs. single agent The cost: ~15× the tokens of a single chat. Multi- agent wins on breadth — parallel, independent subtasks. It's the wrong tool for one tightly-coupled feature. ORCHESTRATOR → WORKERS Subagents — quick, focused, report back to one lead. Default choice. Agent Teams — teammates message each other directly, coordinate on their own. More overhead, needed when work isn't easily split. Same-file edits or tightly sequential work: stay with one session. Parallelism helps only when the subtasks are actually independent. Sources — Anthropic Engineering, "How we built our multi-agent research system" (Jun 2025) · Claude Code docs, "Agent Teams" 26 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SUBAGENTS VS. AGENT TEAMS — ANTHROPIC'S OWN PICTURE Source —
Anthropic · Claude Code docs (Agent Teams) 27 Code smarter, not harder How AI coding tools boost productivity — and where they don't

SOLUTION 7 — PAY DOWN THE DEBT THAT'S ALREADY THERE
×8 code-clone blocks YoY ≥ 5-line duplicates 25→<10% refactored-code share of all changes 3.1→5.7% code churn within 2 weeks FIELD NOTE · WPF → WEB MIGRATIONS Shit in → shit out. Blind on a legacy migration, the agent translates the old debt into the new stack — same anti-patterns, new syntax. (Exactly why greenfield sees 35–40% gains and legacy sees ≤10% — Stanford SEP.) The "no time to refactor" excuse just died: spin up a git worktree , let a background agent chip at legacy debt while you keep shipping in the main one. Gotcha: deleting the session drops uncommitted work; each worktree needs its own node_modules . Sources — GitClear AI Copilot Code Quality Report 2025 (211M lines) · Stanford SEP (AIEWF 2025) · Claude Code docs, worktree support · field experience 28 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Act 6 Question 6 of 6 — if we can
ship more, what's actually still slowing us down? The New Bottleneck 29 Code smarter, not harder How AI coding tools boost productivity — and where they don't

THE PIVOT — SHIPPING WAS NEVER THE BOTTLENECK We can
ship more code now. We were never short on code. ↑ Throughput ↓ Stability DORA 2025 · the same amplifier from Act 1 Remember Faros: +98% PRs, org-level delivery flat. Speed without stability isn't progress — it's accelerated chaos. 30 Code smarter, not harder How AI coding tools boost productivity — and where they don't

THE PRACTICES THAT NOW MATTER MORE Code review AI as
pre-reviewer, human decides what ships. #1 by F1 (CodeRabbit) but AI code is 1.7× more to review Feature flags More PRs, smaller blast radius. Ship the agent's change dark, roll out gradually. Decouple deploy from release Observability If throughput outpaces your ability to notice regressions, throughput is a liability. Catch it before the customer does None of these are new practices. They're just the ones that now decide whether "faster" means anything. Source — CodeRabbit Martian Code Review Bench (Mar 2026) · CodeRabbit "State of AI vs Human Code" Dec 2025 · field experience 31 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Two Rules Worth Stealing Simon Willison · creator of Datasette
· co- creator of Django: "I won't commit any code to my repository if I couldn't explain exactly what it does to somebody else." → Forces understanding. Kills hallucinated dependencies. Catches silent bugs. Addy Osmani · Google: Beware "house of cards code". → Fragile AI output that collapses under scrutiny. Specs in workflows prevent it. Multi-agent teams and background agents mean you'll increasingly review a PR you never watched get written. These two rules are what's left when you do. 32 Code smarter, not harder How AI coding tools boost productivity — and where they don't

So what do you do tomorrow? 33 Code smarter, not
harder How AI coding tools boost productivity — and where they don't

Three Concrete Things — Starting Tomorrow 1 Write an AGENTS.md
/ CLAUDE.md tonight For your most active repo. Treat it like onboarding for a new hire — and remember: it guides, it doesn't enforce. 2 Add one machine-checkable gate A test-first task, or a Sonar/CI gate on your definition-of-done. One gate the agent can't talk its way past. 3 Spin up a background agent on a worktree Point it at one piece of legacy debt. Let it chip away while you keep shipping in the main one. 34 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Smart isn't "more AI". Smart is AI in the right
place, at the right time, and knowing when not to use it at all. 35 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Sources & Further Reading ACT 1 — THE PROBLEM Stack
Overflow Dev Survey 2025 survey.stackoverflow.co/2025 DORA 2025 · State of AI-assisted Dev dora.dev/dora-report-2025 DORA 2026 · ROI of AI-assisted Dev dora.dev/ai/roi/report METR · AI productivity RCT metr.org · arXiv:2507.09089 DX Core 4 · AI Measurement getdx.com Faros AI · engineering telemetry faros.ai Stanford SEP (Denisov-Blanch) AIEWF 2025 · greenfield/brownfield gains ACT 2–3 — TEACH & THINK agents.md · open standard agents.md Linux Foundation · Agentic AI Foundation linuxfoundation.org Anthropic · Agent Skills anthropic.com/engineering GitHub Spec Kit github.com/github/spec-kit OpenSpec (Fission-AI) github.com/Fission-AI/OpenSpec Gartner · AI coding cost forecast gartner.com · 24 Jun 2026 ACT 4–6 — SHIP & SCALE eferro · mutation testing eferro.net Sonar · AI Code Assurance docs.sonarsource.com Anthropic · multi-agent research system anthropic.com/engineering Claude Code docs · Agent Teams, hooks, worktrees code.claude.com/docs GitClear · Code Quality 2025 gitclear.com CodeRabbit · review benchmarks coderabbit.ai 36 Code smarter, not harder How AI coding tools boost productivity — and where they don't

Voices — Simon Willison simonwillison.net · Addy Osmani addyosmani.com ·
Laura Tacho getdx.com · Den Delimarsky den.dev · All links: linktr.ee/daniel_sogl Thank you! Questions? linktr.ee/daniel_sogl thinktecture.com [email protected] Slides & socials

Code smarter, not harder | DWX 2026

Code smarter, not harder | DWX 2026

More Decks by Daniel Sogl

Other Decks in Programming

Featured

Transcript