Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Testing Challenges in the Age of AI" - Devoxx ...

"Testing Challenges in the Age of AI" - Devoxx Belgium 2025

The presentation "Testing Challenges in the Age of AI" discusses the evolving landscape of software engineering with the rise of AI.
It highlights the challenges of testing AI-generated code, including non-deterministic responses and slow LLMs.
The talk introduces Mokksy, a tool for fast and deterministic mocking of LLM calls.
It also covers prompt testing with promptfoo and end-to-end testing with Langfuse.
he presentation concludes with challenges in security and preventing abuse in AI-infused systems.

Avatar for Konstantin Pavlov

Konstantin Pavlov

October 08, 2025
Tweet

More Decks by Konstantin Pavlov

Other Decks in Programming

Transcript

  1. @YourTwitterHandle #Devoxx #YourTag Testing Challenges in the Age of AI

    Konstantin Pavlov Technical Lead / Kotlin AI  JetBrains #Devoxx #Mokksy #Koog #Kotlin in/kpavlov | kpavlov.me
  2. The power of AI is rising. And the shadow of

    vibe-coding is spreading… The world of software engineering has changed
  3. AGENTS.md A simple, open format for guiding coding agents ##

    Testing instructions - Write comprehensive tests for new features - Update existing tests when refactoring - **Prioritize test readability** - Use readable test names `Names with backticks`, e.g. "fun `should return 200 OK`()" - Avoid writing KDocs for tests, keep code self-documenting - Write tests in Kotlin for JUnit5 test runner ...
  4. Post-generation Review Do an extra step to improve ## Review

    instructions - You are a software architect reviewing this code. - Ensure the code adheres to SOLID principles and best practices. - Check for readability, maintainability, and scalability issues. - Identify potential bugs or logical errors in the implementation. - Suggest improvements with clear, concise reasoning. - Prioritize high-quality, professional recommendations in every response. ...
  5. What is Koog Open-source framework for AI agents https://koog.ai •

    Combines LLMs and tools in a graph unlocking building complex agents • Offers multiplatform development and fault-tolerance reliability • Optimizes token usage with intelligent history compression • Runs with Spring Boot
  6. Challenges 🤯 Non-deterministic responses 💸 Tokens cost money! 🏺 CI/CD

    is fragile. Rate limits 😓 Hard to simulate edge cases
  7. • Split Test Suite. Run tests in parallel • Avoid

    duplicating test scenarios • Run single end-to-end scenario testing the whole workflow from start to finish • Run Smoke tests often; Run Full suite regularly (less confident) How to run tests faster?
  8. Mokksy 📦 Black-box 🏎 Fast & Deterministic ⏫ Streaming /

    Server-Sent Events SSE 🤖 OpenAI, Anthropic, Gemini, Ollama, A2A Protocol 🆓 Zero token costs 💰 ✈ Works offline/on CI, even on a plane 💥 Negative scenarios https://mokksy.dev
  9. Why yet another library? Feature Mokksy Wiremock Rest API ✅

    ✅ HTTP Streaming / SSE ✅ ❌ Admin API ❌ ✅ LLM API ✅ ❌
  10. val mockOpenAi = MockOpenai() mockOpenAi.completion { userMessageContains("Tell me a joke

    about LLM") } responds { assistantContent = "Why did the LLM cross the road? Hallucination." } val model = OpenAiChatModel.builder() .baseUrl(mockOpenAi.baseUrl()) // other settings .build() Mocking LLM call ⚡ 🆓 ✈ fast+free+offline https://mokksy.dev
  11. • Personal data is sent cross-border to non-compliant AI model

    • Logging raw LLM requests with personal data • Training/tuning models on customerʼs data Challenge: Security
  12. • Collect data • Beware of PII. Use Differential Privacy

    to anonymize data • Verify prompts with anonymized dataset • Keep prompts separate from code Measure performance
  13. • Non-relevant questions: “Write an essayˮ, “Solve math problem for

    meˮ • Policy violations: Offensive language, harassment, threats • Jailbreaking: Finding ways to bypass safety guardrails and content filters Preventing abuse
  14. • Prompt engineering: Craft better prompts • Use moderation models:

    Reject bad questions before AI starts working on it • Re-evaluate AI responses: “Does it answer the question?ˮ, “Is it relevant to domain?ˮ Preventing abuse