Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing GenAI Applications: Patterns That Actua...

Avatar for Adrian Cole Adrian Cole
September 11, 2025
7

Testing GenAI Applications: Patterns That Actually Work

Presentation given at Web Directions Engineering AI Sydney

Agenda included elaborating my colorful headlines:

Flaky CI & YOLO Clouds: Stabilize your tests with VCR (Envoy AI Gateway)
Rage/FOMO Choices: Model agility via LLM evals (Arize Phoenix)
Complex Agentic Scenarios: Goose recipes & terminal-bench
Takeaways: What you can do with all this

https://webdirections.org/eng-ai/schedule.php

Avatar for Adrian Cole

Adrian Cole

September 11, 2025
Tweet

Transcript

  1. Open Source history includes.. Observability: OpenZipkin, OpenTelemetry, OpenInference (GenAI obs)

    Usability: wazero (golang for wasm), func-e (easy start for envoy) Portability: Netflix Denominator (DNS clouds), jclouds (compute + storage) github.com/codefromthecrypt linkedin.com/in/adrianfcole @adrianfcole linkedin.com/in/adrianfcole • principal engineer at • focused on the genai dev -> prod transition
  2. Quick recap on Agentic • LLM: Typically accessed via web

    service, completes text, image, audio • MCP: Primarily a client+server protocol for tools, though it does a bit more • Agent: LLM loop that auto-completes actions (with tools) not just text, etc. LLM MCP
  3. Today’s GenAI testing struggles • Flaky CI: LLMs are non-deterministic

    and tricky to test • YOLO clouds: Major LLM providers make undocumented changes • Rage or FOMO choices: Priced out or want a backend released an hour ago • Complex Agentic Scenarios: multi-step or context-related glitches AI tests break your CI today—non-determinism and token costs are sabotaging releases.
  4. Agenda Flaky CI & YOLO Clouds: Stabilize your tests with

    VCR (Envoy AI Gateway) Rage/FOMO Choices: Model agility via LLM evals (Arize Phoenix) Complex Agentic Scenarios: Goose recipes & terminal-bench Takeaways: What you can do with all this
  5. Why VCR Matters for AI Testing Eliminates live API flakiness:

    No more random CI failures Catches undocumented API changes: Like fields not in schemas Ensures repeatability: Same inputs, same outputs every time Cuts test costs to zero: No live API calls in CI Caution: Idempotency means everything is recorded, not just your LLM calls! Audit your cassettes for endpoints you may have missed!
  6. VCR Tips • Choose interesting patterns relevant for your app

    • Be flexible on serialization and json comparisons to avoid uninteresting fuzz • Strip headers from recordings or mask sensitive ones (cookie, authorization)
  7. True story: OpenAI obfuscation field • If you run gpt-5-nano

    with streaming, you will get an obfuscation field • This is not documented by OpenAI or in their OpenAPI spec • We only noticed this due to presence in VCR recordings • To this day, the github issue to OpenAI org remains unanswered! Docs lie, responses are the only truth. Record them if your business code includes LLM apis.
  8. Agentic Protip: If it is a large feature, write a

    test plan doc Writing code in a large project means thinking a lot about a lot of things (large context) Agents can help you test, but there are limits. When agents “compact” or a session crashes, they forget what they were doing. If you test a large feature, put your most important test plan into a file so it can be re-read.
  9. People change AI models and tools often this year! •

    Model Upgrades (qwen3 hybrid thinking mode in Apr) • MCP goes mainstream (Github remote MCP in Apr with leagues to follow) • Pricing Rage (Claude Code: $20→$200/month Apr->Aug) • Leaderboard Races (glm-4.5 to compete with Claude Sonnet in Jul) • YOLO products (gpt-5 deletes gpt-4o then quickly restored in Aug) • Price War (DeepSeek V3.1 nearly 48x cheaper than OpenAI o3-Pro in Sep)
  10. • Use built-in Phoenix evaluators for correctness and factuality •

    Define our own domain-specific evaluator for common errors • Evaluation can take a while to complete! https://github.com/elastic/testing-genai-applications LLM evaluation cron job example
  11. • Agents complete actions, not just text audio or video

    • Sessions are long running, and multi-turn • Tool calls are important, as their responses impact the whole context • Token efficiency, and isolation matters 2025 is the year of the agents and evals are changing
  12. Goose is your local coding agent, born in Sydney goose

    is your on-machine AI agent, capable of automating complex development tasks from start to finish. More than just code suggestions, goose can build entire projects from scratch, write and execute code, debug failures, orchestrate workflows, and interact with external APIs - autonomously. Blog on Tetrate Agent Router Service+Goose with free $10
  13. Goose is open all the way, and this is tricky

    • Consumers, not just coders (cannot assume users are technical) • Public and local LLMs (capability hype vs reality plays out here) • 100pct MCP (First to practice many parts of tool orchestration) • Open Source project (Must be efficient in problem solving)
  14. Model change impact in agents Common Failures - Feature Support

    Mismatch - Local model lacks tool calling - Version Drift - Different model versions behave differently - Schema Differences - Tool definitions don't match - Performance Characteristics - Timeout behavior varies Real Examples - Python inline recipes work on GPT-4 but fail on local Qwen - Excel tool transposes data differently across model versions - Function calling syntax varies between providers How do we evaluate this?
  15. Goose eval 3 ways • Goose Recipe (ad-hoc tests of

    an agentic task) • Goose Bench (model + configs over common tasks) • Terminal Bench (normalized agent tests: goose vs others)
  16. Goose Recipes Portable YAML files that standardize agent behavior Reproducibility:

    Same task, same tools, different models Shareability: Team uses identical prompts/configs Parameterization: Template variables for reuse goose run --recipe code-review.yaml --params pr_number=4587 --params repo=block/goose
  17. Ad-Hoc testing Goose Recipes Test the same recipe against different

    providers https://block.github.io/goose/blog/2025/08/12/mcp-testing Want advanced? →
  18. Goose Bench measure GenAI model tool-calling & task completion ability

    Execution Flow: 1. Run evaluations: Each task executed 3 times per model 2. Post-processing: Optional LLM judge scores subjective tasks 3. Score calculation: Combines judge score + tool usage + format validity 4. Aggregation: Python scripts average scores across runs, generate CSVs Key Metrics: - Task completion success (0-1) - Tool call accuracy and usage - Token efficiency - Execution time https://block.github.io/goose/blog/2025/03/31/goose-benchmark Want more? →
  19. Terminal Bench https://www.tbench.ai/ Independent benchmark for testing AI agents in

    real terminal environments Scope: 229 production tasks from compiling code to training models Approach: Agents solve real-world CLI tasks autonomously Evaluation: Binary pass/fail on practical outcomes Public leaderboard: Compare models at tbench.ai
  20. Benchmark hazards • Model Clustering: Top agents rely on similar

    models (e.g., Claude-4-Opus), limiting true innovation. • Public Tasks: Open tasks enable memorization/tuning, inflating scores without real gains. • Hybrid Obfuscation: OB-1 blends models (GPT-5, Claude): are wins from design or proxies. • Quick Obsolescence: Improvements invalidate rankings in a month; re-validate often. Shelf life: Brief window before tuning, advances, and contamination lower leaderboards relevance
  21. Coming up for air Ok wow that was a lot,

    what does this mean for me?!
  22. Key Takeaways for Any Developer Building with GenAI • Treat

    LLMs like flaky services: Use recording tools (VCR) for deterministic tests • Evaluate outputs rigorously: LLM-as-judge evals for correctness and domain checks • Design for model agility: Easy provider switching without breaking CI • Monitor AI interactions: OpenTelemetry traces for debugging and usage • Test beyond units: Parameterized recipes for end-to-end behavior Blog on Tetrate Agent Router Service+Goose with free $10
  23. Thank you very much! linkedin.com/in/adrianfcole Start Today: Add VCR tests

    this afternoon Spin up local Ollama tomorrow Write your first MCP recipe by Friday Envoy AI Gateway - proxy patterns Goose - agentic testing insights VCR.py - reliable recordings OpenInference & Arize Phoenix - DIY evals Terminal-Bench - Agentic benchmark with leaderboard
  24. Test with local models! • Eliminate external dependencies • No

    rate limits • Control exact model • Air-gapped - compliance friendly
  25. Tools are needed by Agents, but your mileage may vary

    Error: Request failed: registry.ollama.ai/library/qwen3-coder:latest does not support tools (type: api_error) (status 400)
  26. Tool glitches are very routine • Unresolved tool issues (qwen3-coder

    has been open since 1 Aug) • Slow resolution is normal (can be 6 months to fully support a new model) • Support may be different elsewhere (LMStudoii or llama.cpp may work) • Community workarounds abound (blogs, alternate repos, etc) https://github.com/ollama/ollama/issues/11621 https://hf.co/unsloth
  27. Goose Toolshim • Main model generates text with JSON tool

    calls • Interpreter Ollama model extracts/structures tool calls • Overhead, but enables tools for models lacking support