Testing GenAI Applications: Patterns That Actually Work

Patterns That Actually Work linkedin.com/in/adrianfcole Testing GenAI Applications

Open Source history includes.. Observability: OpenZipkin, OpenTelemetry, OpenInference (GenAI obs)
Usability: wazero (golang for wasm), func-e (easy start for envoy) Portability: Netflix Denominator (DNS clouds), jclouds (compute + storage) github.com/codefromthecrypt linkedin.com/in/adrianfcole @adrianfcole linkedin.com/in/adrianfcole • principal engineer at • focused on the genai dev -> prod transition

Quick recap on Agentic • LLM: Typically accessed via web
service, completes text, image, audio • MCP: Primarily a client+server protocol for tools, though it does a bit more • Agent: LLM loop that auto-completes actions (with tools) not just text, etc. LLM MCP

Today’s GenAI testing struggles • Flaky CI: LLMs are non-deterministic
and tricky to test • YOLO clouds: Major LLM providers make undocumented changes • Rage or FOMO choices: Priced out or want a backend released an hour ago • Complex Agentic Scenarios: multi-step or context-related glitches AI tests break your CI today—non-determinism and token costs are sabotaging releases.

Agenda Flaky CI & YOLO Clouds: Stabilize your tests with
VCR (Envoy AI Gateway) Rage/FOMO Choices: Model agility via LLM evals (Arize Phoenix) Complex Agentic Scenarios: Goose recipes & terminal-bench Takeaways: What you can do with all this

Flakey CI and YOLO Clouds Coping with changes you don’t
desire

Integration tests using real AI providers can be ﬂakey

VCR give your tests deterministic real responses

Why VCR Matters for AI Testing Eliminates live API flakiness:
No more random CI failures Catches undocumented API changes: Like fields not in schemas Ensures repeatability: Same inputs, same outputs every time Cuts test costs to zero: No live API calls in CI Caution: Idempotency means everything is recorded, not just your LLM calls! Audit your cassettes for endpoints you may have missed!

VCR Tips • Choose interesting patterns relevant for your app
• Be flexible on serialization and json comparisons to avoid uninteresting fuzz • Strip headers from recordings or mask sensitive ones (cookie, authorization)

True story: OpenAI obfuscation field • If you run gpt-5-nano
with streaming, you will get an obfuscation field • This is not documented by OpenAI or in their OpenAPI spec • We only noticed this due to presence in VCR recordings • To this day, the github issue to OpenAI org remains unanswered! Docs lie, responses are the only truth. Record them if your business code includes LLM apis.

Agentic Protip: If it is a large feature, write a
test plan doc Writing code in a large project means thinking a lot about a lot of things (large context) Agents can help you test, but there are limits. When agents “compact” or a session crashes, they forget what they were doing. If you test a large feature, put your most important test plan into a file so it can be re-read.

Rage or FOMO choices Agility to test new offerings to
fit demand

People change AI models and tools often this year! •
Model Upgrades (qwen3 hybrid thinking mode in Apr) • MCP goes mainstream (Github remote MCP in Apr with leagues to follow) • Pricing Rage (Claude Code: $20→$200/month Apr->Aug) • Leaderboard Races (glm-4.5 to compete with Claude Sonnet in Jul) • YOLO products (gpt-5 deletes gpt-4o then quickly restored in Aug) • Price War (DeepSeek V3.1 nearly 48x cheaper than OpenAI o3-Pro in Sep)

Testing a different model with LLM evals

• Use built-in Phoenix evaluators for correctness and factuality •
Deﬁne our own domain-speciﬁc evaluator for common errors • Evaluation can take a while to complete! https://github.com/elastic/testing-genai-applications LLM evaluation cron job example

• Agents complete actions, not just text audio or video
• Sessions are long running, and multi-turn • Tool calls are important, as their responses impact the whole context • Token efficiency, and isolation matters 2025 is the year of the agents and evals are changing

Goose is your local coding agent, born in Sydney goose
is your on-machine AI agent, capable of automating complex development tasks from start to finish. More than just code suggestions, goose can build entire projects from scratch, write and execute code, debug failures, orchestrate workflows, and interact with external APIs - autonomously. Blog on Tetrate Agent Router Service+Goose with free $10

Goose is open all the way, and this is tricky
• Consumers, not just coders (cannot assume users are technical) • Public and local LLMs (capability hype vs reality plays out here) • 100pct MCP (First to practice many parts of tool orchestration) • Open Source project (Must be efficient in problem solving)

Model change impact in agents Common Failures - Feature Support
Mismatch - Local model lacks tool calling - Version Drift - Different model versions behave differently - Schema Differences - Tool definitions don't match - Performance Characteristics - Timeout behavior varies Real Examples - Python inline recipes work on GPT-4 but fail on local Qwen - Excel tool transposes data differently across model versions - Function calling syntax varies between providers How do we evaluate this?

Complex Agentic Scenarios What to do when your application is
realistic

Goose eval 3 ways • Goose Recipe (ad-hoc tests of
an agentic task) • Goose Bench (model + configs over common tasks) • Terminal Bench (normalized agent tests: goose vs others)

Goose Recipes Portable YAML files that standardize agent behavior Reproducibility:
Same task, same tools, different models Shareability: Team uses identical prompts/configs Parameterization: Template variables for reuse goose run --recipe code-review.yaml --params pr_number=4587 --params repo=block/goose

Ad-Hoc testing Goose Recipes Test the same recipe against different
providers https://block.github.io/goose/blog/2025/08/12/mcp-testing Want advanced? →

Goose Bench measure GenAI model tool-calling & task completion ability
Execution Flow: 1. Run evaluations: Each task executed 3 times per model 2. Post-processing: Optional LLM judge scores subjective tasks 3. Score calculation: Combines judge score + tool usage + format validity 4. Aggregation: Python scripts average scores across runs, generate CSVs Key Metrics: - Task completion success (0-1) - Tool call accuracy and usage - Token efficiency - Execution time https://block.github.io/goose/blog/2025/03/31/goose-benchmark Want more? →

Terminal Bench https://www.tbench.ai/ Independent benchmark for testing AI agents in
real terminal environments Scope: 229 production tasks from compiling code to training models Approach: Agents solve real-world CLI tasks autonomously Evaluation: Binary pass/fail on practical outcomes Public leaderboard: Compare models at tbench.ai

Benchmark hazards • Model Clustering: Top agents rely on similar
models (e.g., Claude-4-Opus), limiting true innovation. • Public Tasks: Open tasks enable memorization/tuning, inflating scores without real gains. • Hybrid Obfuscation: OB-1 blends models (GPT-5, Claude): are wins from design or proxies. • Quick Obsolescence: Improvements invalidate rankings in a month; re-validate often. Shelf life: Brief window before tuning, advances, and contamination lower leaderboards relevance

Coming up for air Ok wow that was a lot,
what does this mean for me?!

Key Takeaways for Any Developer Building with GenAI • Treat
LLMs like flaky services: Use recording tools (VCR) for deterministic tests • Evaluate outputs rigorously: LLM-as-judge evals for correctness and domain checks • Design for model agility: Easy provider switching without breaking CI • Monitor AI interactions: OpenTelemetry traces for debugging and usage • Test beyond units: Parameterized recipes for end-to-end behavior Blog on Tetrate Agent Router Service+Goose with free $10

Thank you very much! linkedin.com/in/adrianfcole Start Today: Add VCR tests
this afternoon Spin up local Ollama tomorrow Write your first MCP recipe by Friday Envoy AI Gateway - proxy patterns Goose - agentic testing insights VCR.py - reliable recordings OpenInference & Arize Phoenix - DIY evals Terminal-Bench - Agentic benchmark with leaderboard

Extra slides

Test with local models! • Eliminate external dependencies • No
rate limits • Control exact model • Air-gapped - compliance friendly

Tools are needed by Agents, but your mileage may vary
Error: Request failed: registry.ollama.ai/library/qwen3-coder:latest does not support tools (type: api_error) (status 400)

Tool glitches are very routine • Unresolved tool issues (qwen3-coder
has been open since 1 Aug) • Slow resolution is normal (can be 6 months to fully support a new model) • Support may be different elsewhere (LMStudoii or llama.cpp may work) • Community workarounds abound (blogs, alternate repos, etc) https://github.com/ollama/ollama/issues/11621 https://hf.co/unsloth

Goose Toolshim • Main model generates text with JSON tool
calls • Interpreter Ollama model extracts/structures tool calls • Overhead, but enables tools for models lacking support

Testing GenAI Applications: Patterns That Actua...

Testing GenAI Applications: Patterns That Actually Work

More Decks by Adrian Cole

Featured

Transcript