"Testing Challenges in the Age of AI" - Devoxx Belgium 2025

@YourTwitterHandle #Devoxx #YourTag Testing Challenges in the Age of AI
Konstantin Pavlov Technical Lead / Kotlin AI  JetBrains #Devoxx #Mokksy #Koog #Kotlin in/kpavlov | kpavlov.me

The power of AI is rising. And the shadow of
vibe-coding is spreading… The world of software engineering has changed

Who is writing tests? 💻

AI Coding Assistant AI is writing tests

Low quality of generated test code

AGENTS.md A simple, open format for guiding coding agents ##
Testing instructions - Write comprehensive tests for new features - Update existing tests when refactoring - **Prioritize test readability** - Use readable test names `Names with backticks`, e.g. "fun `should return 200 OK`()" - Avoid writing KDocs for tests, keep code self-documenting - Write tests in Kotlin for JUnit5 test runner ...

Post-generation Review Do an extra step to improve ## Review
instructions - You are a software architect reviewing this code. - Ensure the code adheres to SOLID principles and best practices. - Check for readability, maintainability, and scalability issues. - Identify potential bugs or logical errors in the implementation. - Suggest improvements with clear, concise reasoning. - Prioritize high-quality, professional recommendations in every response. ...

CodeRabbit GitHub Copilot

Whatʼs the AIInfused system?

JVM AI Frameworks Some popular ones… Spring AI LangChain4j Koog
Embabel Eclipse ARC

What is Koog Open-source framework for AI agents https://koog.ai •
Combines LLMs and tools in a graph unlocking building complex agents • Offers multiplatform development and fault-tolerance reliability • Optimizes token usage with intelligent history compression • Runs with Spring Boot

🔌 Integration Testing 📝🤖 Prompt Testing 🚀 End-to-End Testing All
kinds of testing

Integration testing

Database Tests ⚙ App 🌍 APIs

Database Tests ⚙ App 🧠 LLMAPI

Challenges 🤯 Non-deterministic responses 💸 Tokens cost money! 🏺 CI/CD
is fragile. Rate limits 😓 Hard to simulate edge cases

Challenge: Non-deterministic responses a.k.a. “AI can make mistakes

Challenge: LLMs are SLOW

🕜 Cancelled after 1.5h

• Split Test Suite. Run tests in parallel • Avoid
duplicating test scenarios • Run single end-to-end scenario testing the whole workflow from start to finish • Run Smoke tests often; Run Full suite regularly (less confident) How to run tests faster?

Challenge: Failure scenarios

Letʼs mock LLMs!

Mokksy 📦 Black-box 🏎 Fast & Deterministic ⏫ Streaming /
Server-Sent Events SSE 🤖 OpenAI, Anthropic, Gemini, Ollama, A2A Protocol 🆓 Zero token costs 💰 ✈ Works offline/on CI, even on a plane 💥 Negative scenarios https://mokksy.dev

Why yet another library? Feature Mokksy Wiremock Rest API ✅
✅ HTTP Streaming / SSE ✅ ❌ Admin API ❌ ✅ LLM API ✅ ❌

val mockOpenAi = MockOpenai() mockOpenAi.completion { userMessageContains("Tell me a joke
about LLM") } responds { assistantContent = "Why did the LLM cross the road? Hallucination." } val model = OpenAiChatModel.builder() .baseUrl(mockOpenAi.baseUrl()) // other settings .build() Mocking LLM call ⚡ 🆓 ✈ fast+free+offline https://mokksy.dev

Database Tests ⚙ App Mokksy Integration testing

Demo App github.com/kpavlov/koog-spring-boot-assistant

Demo App 🧠 LLM 🌍 UI 🛡 Moderation 📚 RAG
🤖 Agent

Integration testing 🤖 Agent Tests Embedding model Moderation model Chat
model

🛠 Tools 🧠 LLM ✅ Finish 󰡸 Query document RAG
🛡 Moderation 🛑 Stop ➡ Start

Prompt testing

promptfoo 🧪 Test scenarios 🧠 LLM 📜 Prompts (under test)
🧠󰳌 LLM https://promptfoo.dev

End-to-End Testing

⚙ App promptfoo 🧪 Test scenarios 🧠 LLM 📜🛠 🧠󰳌
LLM https://promptfoo.dev

⚙ App Langfuse 🧫 Evaluations 🧠 LLM 🛠 📜 Prompts
https://langfuse.com 🪢 Langfuse

https://langfuse.com

Red teaming

Your app is live Whatʼs next?

• Personal data is sent cross-border to non-compliant AI model
• Logging raw LLM requests with personal data • Training/tuning models on customerʼs data Challenge: Security

Protecting privacy https://private-ai.com PII identification ↕ de-identification

Protecting Privacy PII identification ⟷ de-identification https://private-ai.com

Challenge: Models change

Measure the App performance

• Collect data • Beware of PII. Use Differential Privacy
to anonymize data • Verify prompts with anonymized dataset • Keep prompts separate from code Measure performance

• Non-relevant questions: “Write an essayˮ, “Solve math problem for
meˮ • Policy violations: Offensive language, harassment, threats • Jailbreaking: Finding ways to bypass safety guardrails and content filters Preventing abuse

• Prompt engineering: Craft better prompts • Use moderation models:
Reject bad questions before AI starts working on it • Re-evaluate AI responses: “Does it answer the question?ˮ, “Is it relevant to domain?ˮ Preventing abuse

Only testing can save our systems from falling

Thank you! Konstantin Pavlov kpavlov.me / linkedin.com/in/kpavlov github.com/kpavlov/koog-spring-boot-assistant

"Testing Challenges in the Age of AI" - Devoxx ...

"Testing Challenges in the Age of AI" - Devoxx Belgium 2025

More Decks by Konstantin Pavlov

Other Decks in Programming

Featured

Transcript