Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Did you really get better?

Did you really get better?

From https://www.jfokus.se/talks.html?showid=2701

Testing is hard, which is why developers tend to avoid it. Testing non-deterministic things is even harder, which is unfortunate, since we're all writing AI-infused applications, and AI models are notoriously non-deterministic. What happens when the applications start using advanced features, such as RAG, tools, and agents? How do you test these applications? There must be some tools, technologies, and practices out there that can help, while not costing your organization lots of money!

Join Java Champions Oleg & Eric in this session as they revisit a topic they debuted at JFokus last year. The AI landscape changes at a breathtaking pace, so what new capabilities and strategies have come along in the last year?

Hopefully by the end of the presentation you will be able to answer the question "If I change my model/prompt/application, did I get better or worse"?

Avatar for Eric Deandrea

Eric Deandrea PRO

February 02, 2026
Tweet

More Decks by Eric Deandrea

Other Decks in Technology

Transcript

  1. @shelajev @edeandrea • Java Champion • 27+ years software development

    experience • Works on Open Source projects Quarkus LangChain4j (& Quarkus LangChain4j) Docking Java (Project lead) Spring Boot, Spring Framework, Spring Security Wiremock Testcontainers • Boston Java Users ACM Chapter Vice Chair • Published Author • Black belt in martial arts • Cat lover About Us
  2. @shelajev @edeandrea • Showcase & explain Quarkus, how it enables

    modern Java development & the Kubernetes-native experience • Introduce familiar Spring concepts, constructs, & conventions and how they map to Quarkus • Equivalent code examples between Quarkus and Spring as well as emphasis on testing patterns & practices 3 https://red.ht/quarkus-spring-devs
  3. @shelajev @edeandrea • Also a Java Champion • Java developer

    • ~12 years Developer Advocate • Loves to stare at Open Source projects • Allergic to cats About Us
  4. @shelajev @edeandrea What are you hoping to learn here? What

    are you hoping to learn here? What are you going to leave with?
  5. @shelajev @edeandrea @shelajev @edeandrea Did we get better or worse

    with this release? (& can we figure it out before we release?)
  6. @shelajev @edeandrea What’s changed in the last year? • Standardization

    ◦ Or lack thereof (lots of competing standards)? • Distributed • Orchestrated • Agentic • Agents • Agentic Agents • Autonomous Agents • Autonomous Agentic Agents Smells like microservices?
  7. @shelajev @edeandrea DevOps Evolution Dev Ops Release Deploy Operate Monitor

    Plan Code Build Test Train Evaluate Deploy Collect Evaluate Curate Analyze Data ML
  8. @shelajev @edeandrea Chat Bot Web Socket Claim AI Assistant Claim

    Status Notification Tool invocation Generate Email AI Assistant Output Guardrails Politeness AI Assistant AI replacing humans AI replacing software https://github.com/edeandrea/non-deterministic-no-problem Code I write Voodoo magic Legend RAG Retrieval Input Guardrails Could be an agent? Could be an agent?
  9. @shelajev @edeandrea Application Database Application Service CRUD application Microservice Application

    Model AI-Infused application What’s the difference between these?
  10. @shelajev @edeandrea Application Database Application Service CRUD application Microservice Application

    Model AI-Infused application Integration Points What’s the difference between these?
  11. @shelajev @edeandrea Signal from tests: - stuff needs fixing -

    confident to release Purpose of tests: ❌ - prevent breaking prod ✅ - continuously improve your app
  12. @shelajev @edeandrea Application Database Application Service CRUD application Microservice Application

    Model AI-Infused application Integration Points Observability (metrics, tracing, logs, auditing) Fault Tolerance (timeout, bulkhead, circuit breaker, rate limiting, fallbacks, …) What’s the difference between these?
  13. @shelajev @edeandrea @shelajev @edeandrea end-to-end tests unit tests integration tests

    low effort high realism tests with application server test REST endpoints tests using AI
  14. @shelajev @edeandrea Stupidity Prompt: Please return a JSON document in

    the following format: { “name: “String”, “countryOfOrigin”: “String”} Response: Sure I’d love to give you some JSON! Here it is: ```json { “name”: “Eric”, “countryOfOrigin”: “USA” } ```
  15. @shelajev @edeandrea Guardrails - Out of the box in LangChain4j

    & Quarkus! - Functions used to validate the input and output of the model - Detect invalid input or output - Detect prompt injection - Detect hallucination - Chain of guardrails - Sequential - Stop at first failure
  16. @shelajev @edeandrea Retry and Reprompt Output guardrails can have 4

    different outcomes: - Success - Response is passed to the caller or next guardrail - Fatal - Stop and throw an exception - Retry - Call the model again with the same context we never know ;-) - Reprompt - Call the model again with another message in the model indicating how to fix the response
  17. @shelajev @edeandrea Observability Collect metrics - Exposed as Prometheus -

    Track token usage & cost OpenTelemetry Tracing - Trace interactions with the LLM Auditing - Track of interactions with the LLM - Ability to replay & re-score interactions
  18. @shelajev @edeandrea Rescoring - Evaluation @shelajev @edeandrea https://docs.quarkiverse.io/quarkus-langchain4j/dev/testing.html#_evaluation 1. Sample

    ◦ The test case containing input parameters & expected output. 2. Function under test ◦ The function being evaluated. Receives input parameters & produces and actual output. 3. Evaluation Strategy ◦ Logic that determines if the actual output is acceptable based on the expected output. 4. Evaluation Result ◦ Outcome (pass/fail), score, explanation, and metadata from the evaluation
  19. @shelajev @edeandrea Rescoring - Evaluation @shelajev @edeandrea https://docs.quarkiverse.io/quarkus-langchain4j/dev/testing.html#_evaluation 1. Sample

    ◦ The test case containing input parameters & expected output. 2. Function under test ◦ The function being evaluated. Receives input parameters & produces and actual output. 3. Evaluation Strategy ◦ Logic that determines if the actual output is acceptable based on the expected output. 4. Evaluation Result ◦ Outcome (pass/fail), score, explanation, and metadata from the evaluation
  20. @shelajev @edeandrea Rescoring - Evaluation @shelajev @edeandrea https://docs.quarkiverse.io/quarkus-langchain4j/dev/testing.html#_evaluation 1. Sample

    ◦ The test case containing input parameters & expected output. 2. Function under test ◦ The function being evaluated. Receives input parameters & produces and actual output. 3. Evaluation Strategy ◦ Logic that determines if the actual output is acceptable based on the expected output. 4. Evaluation Result ◦ Outcome (pass/fail), score, explanation, and metadata from the evaluation
  21. @shelajev @edeandrea RAG Evaluation: Two Surfaces • RAG introduces new

    surfaces for necessary evaluation • Evaluate the retrieval of relevant context documents • Evaluate the generation based on the retrieved context • Semantic similarity checks are crucial for RAG evaluation
  22. @shelajev @edeandrea • Naming things is still the hardest thing

    in computer science • LangChain4j & Quarkus are awesome! They provide foundational building blocks! • Don’t build observability into your apps - build it around your apps • Don’t forget your craft: DevOps process is there to help • Write tests, expect change and failure, deploy often • AI is just an API call Actual takeaways