Slide 1

Slide 1 text

A look inside the LLM closed box: test, observe and evaluate your RAG assisted chatbot by Mario Fusco & Dimitris Kafetzis

Slide 2

Slide 2 text

What Generative AI can do is amazing …

Slide 3

Slide 3 text

… well … almost always :) …

Slide 4

Slide 4 text

… but … there’s an elephant in the room How can you write reliable and deterministic tests for something that works on a statistical basis???

Slide 5

Slide 5 text

Where does the problem come from?

Slide 6

Slide 6 text

An hallucination is an inconsistency and it can happen at different levels: - Inconsistency within output sentences - “Mario is tall thus he is the shortest person” - Inconsistency between input and output - “Generate formal text to announce to colleagues …” “Yo boyz!” - Factually wrong - “First man on the Moon in 2024” What hallucinations are

Slide 7

Slide 7 text

A LLM is a black box able to hallucinate. This because of multiple reasons: - Partial/inconsistent training data - LLMs learn how to generalize from training data assuming they are comprehensive (but we don’t train them!) - Configuration of generation can be “hallucination prone” - Sampling configurations like temperature, top_k, top_p guide creativity (but we often want LLM to be creative) - Context/input quality - The more is specific, the better (we can control this!) Why hallucinations happen

Slide 8

Slide 8 text

An unprecedented definition of malfunctioning Given their statistical nature, GenAI algorithms can hallucinate or make other spectacular mistakes even when they work very well. These mistakes cannot be fixed in the same way of a software bug.

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Why not fixing LLM’s hallucinations using another LLM to validate the outcome of the first?

Slide 11

Slide 11 text

Why not fixing LLM’s hallucinations using another LLM to validate the outcome of the first? … this reminds me of a 20 years old joke …

Slide 12

Slide 12 text

LLM to judge another (or even the same) LLM sounds wrong but it is a useful tool together with Humans. It is critical to ask the right questions to detect many issues/hallucinations (but usually not factual checking): - Style/Tone: “is the response using formal language?” - Bias: “does the response show prejudice against a group?” - Sentiment: “has the text a positive sentiment?” - Consistency: “does the response stick to the provided context?” LLM as a judge

Slide 13

Slide 13 text

Why don’t normal tests work? What do we need to do differently?

Slide 14

Slide 14 text

Signal from tests: ❌ - stuff needs fixing ✅ - confident to release

Slide 15

Slide 15 text

Signal from tests: ❌ - stuff needs fixing ✅ - confident to release Purpose of tests: ❌ - prevent breaking prod ✅ - continuously improve your app

Slide 16

Slide 16 text

Testing Applications Whenever we build a software system, we need to ensure that it behaves in a way that matches our intentions. With traditional systems, we do this primarily through testing. We provide a thoughtfully selected sample of input, and verified that the system responds in the way we expect.

Slide 17

Slide 17 text

Testing AI-Infused Applications With LLM-based systems, we encounter a system that no longer behaves deterministically. Such a system will provide different outputs to the same inputs on repeated requests. This doesn't mean we cannot examine its behavior to ensure it matches our intentions, but it does mean we have to think about it differently.

Slide 18

Slide 18 text

Come on … how hard it can be? :) @ParameterizedTest @MethodSource("modelsSupportingImageInputs") @EnabledIf("supportsSingleImageInputAsPublicURL") protected void should_accept_single_image_as_public_URL(M model) { // given UserMessage userMessage = UserMessage.from( TextContent.from("What do you see?"), ImageContent.from(CAT_IMAGE_URL) ); ChatRequest chatRequest = ChatRequest.builder() .messages(userMessage) .build(); // when ChatResponse chatResponse = chat(model, chatRequest).chatResponse(); // then AiMessage aiMessage = chatResponse.aiMessage(); assertThat(aiMessage.text()).containsIgnoringCase("cat"); assertThat(aiMessage.toolExecutionRequests()).isNull(); } https://upload.wikimedia.org/wikipedia/ commons/e/e9/Felis_silvestris_silvestri s_small_gradual_decrease_of_quality.png

Slide 19

Slide 19 text

Come on … how hard it can be? :) @ParameterizedTest @MethodSource("modelsSupportingImageInputs") @EnabledIf("supportsSingleImageInputAsPublicURL") protected void should_accept_single_image_as_public_URL(M model) { // given UserMessage userMessage = UserMessage.from( TextContent.from("What do you see?"), ImageContent.from(CAT_IMAGE_URL) ); ChatRequest chatRequest = ChatRequest.builder() .messages(userMessage) .build(); // when ChatResponse chatResponse = chat(model, chatRequest).chatResponse(); // then AiMessage aiMessage = chatResponse.aiMessage(); assertThat(aiMessage.text()).containsIgnoringCase("cat"); assertThat(aiMessage.toolExecutionRequests()).isNull(); } https://upload.wikimedia.org/wikipedia/ commons/e/e9/Felis_silvestris_silvestri s_small_gradual_decrease_of_quality.png [ERROR] Failures: [ERROR] OpenAiChatModelIT>AbstractBaseChatModelIT.should_ accept_single_image_as_public_URL:1131 Expecting actual: "I see an animal with a feline appearance. It has a distinctive striped coat and green eyes. The background appears blurred or out of focus." to contain: "cat" (ignoring case)

Slide 20

Slide 20 text

A simple AI-infused app @RegisterAiService public interface CategoryRouter { @UserMessage(""" Analyze the user request and categorize It as 'legal', 'medical' or 'technical'. Reply with only one of those words. The user request is: '{request}'. """) RequestCategory classify(String request); } public enum RequestCategory { LEGAL, MEDICAL, TECHNICAL, UNKNOWN } @GET @Path("request/{request}") @Produces(MediaType.TEXT_PLAIN) public String assist(String request) { return routerService.findExpert(request).apply(request); } public UnaryOperator findExpert(String request) { return switch (categoryRouter.classify(request)) { case LEGAL -> legalExpert::chat; case MEDICAL -> medicalExpert::chat; case TECHNICAL -> technicalExpert::chat; default -> ignore -> "I cannot find an appropriate category."; }; }

Slide 21

Slide 21 text

… and how to test it @QuarkusTest @AiScorer public class CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples samples = new Samples<>( EvaluationSample.builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } io.quarkiverse.langchain4j quarkus-langchain4j-testing-scorer-junit5 ${quarkus.langchain4j.version} test

Slide 22

Slide 22 text

… and how to test it @QuarkusTest @AiScorer public class CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples samples = new Samples<>( EvaluationSample.builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } io.quarkiverse.langchain4j quarkus-langchain4j-testing-scorer-junit5 ${quarkus.langchain4j.version} test The Scorer is a utility

Slide 23

Slide 23 text

… and how to test it @QuarkusTest @AiScorer public class CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples samples = new Samples<>( EvaluationSample.builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } io.quarkiverse.langchain4j quarkus-langchain4j-testing-scorer-junit5 ${quarkus.langchain4j.version} test The Scorer is a utility that evaluates a set of samples

Slide 24

Slide 24 text

… and how to test it @QuarkusTest @AiScorer public class CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples samples = new Samples<>( EvaluationSample.builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } io.quarkiverse.langchain4j quarkus-langchain4j-testing-scorer-junit5 ${quarkus.langchain4j.version} test The Scorer is a utility that evaluates a set of samples against a function

Slide 25

Slide 25 text

… and how to test it @QuarkusTest @AiScorer public class CategoryRouterTest { @Inject CategoryRouter categoryRouter; @Test void testCategoryRouter(Scorer scorer) { Samples samples = new Samples<>( EvaluationSample.builder().withName("Medical") .withParameter("I broke my leg what should I do") .withExpectedOutput(RequestCategory.MEDICAL).build(), EvaluationSample.builder().withName("Legal") .withParameter("I have been sued for tax evasion") .withExpectedOutput(RequestCategory.LEGAL).build() ); EvaluationReport report = scorer.evaluate(samples, i -> categoryRouter.classify(i.get(0)), (sample, output) -> sample.expectedOutput() == output); assertThat(report.score()).isEqualTo(100); // Expect full success } } io.quarkiverse.langchain4j quarkus-langchain4j-testing-scorer-junit5 ${quarkus.langchain4j.version} test The Scorer is a utility that evaluates a set of samples against a function and a set of evaluation strategies public interface EvaluationStrategy { boolean evaluate(EvaluationSample sample, T output); }

Slide 26

Slide 26 text

Built-in Evaluation Strategies new SemanticSimilarityStrategy(embeddingModel, 0.85); io.quarkiverse.langchain4j quarkus-langchain4j-testing- scorer-semantic-similarity ${quarkus.langchain4j.version} test The SemanticSimilarityStrategy evaluates the similarity between the actual output and the expected output using cosine similarity. It requires an embedding model and a minimum similarity threshold.

Slide 27

Slide 27 text

Built-in Evaluation Strategies new SemanticSimilarityStrategy(embeddingModel, 0.85); io.quarkiverse.langchain4j quarkus-langchain4j-testing- scorer-semantic-similarity ${quarkus.langchain4j.version} test The SemanticSimilarityStrategy evaluates the similarity between the actual output and the expected output using cosine similarity. It requires an embedding model and a minimum similarity threshold. io.quarkiverse.langchain4j quarkus-langchain4j-testing- scorer-ai-judge ${quarkus.langchain4j.version} test EvaluationStrategy strategy = new AiJudgeStrategy(myModel, """ You are an AI evaluating a response and the expected output. You need to evaluate if the model response is correct or not. Return true if the response is correct, false otherwise. Response to evaluate: {response} Expected output: {expected_output} """); The AiJudgeStrategy (LLM-as-a-judge) leverages an AI model to determine if the actual output matches the expected output.

Slide 28

Slide 28 text

LangChain4j evals Toward a comprehensive LLM evaluation framework

Slide 29

Slide 29 text

What is RAG? ❖ Retrieve information from a database ❖ Augment user query with retrieved data ❖ Generate content from the augmented context

Slide 30

Slide 30 text

What is RAG and what does it mean to evaluate it? ❖ Retrieve information from a database ❖ Augment user query with retrieved data ❖ Generate content from the augmented context ❖ Retrieval evaluation ➢ Are the retrieved documents relevant? (Precision) ➢ How many of the relevant documents are retrieved? (Recall) ❖ Generation evaluation ➢ What is the quality of the final generated output?

Slide 31

Slide 31 text

A practical example

Slide 32

Slide 32 text

Parts of an evaluation ❖ Scope of evaluation and Dataset creation ❖ Scorers ❖ Evaluator

Slide 33

Slide 33 text

What is a dataset? ❖ A curated set of data entries used to evaluate aspects of the application ❖ Usually the more we have the better, but quality is important Question What is the capital of Germany? Retrieved Contexts "Berlin is the capital and largest city of Germany." Ground Truth Answer Berlin

Slide 34

Slide 34 text

How to make a dataset ❖ Define ➢ aspects of the application you want to evaluate ➢ possible scenarios you want to test ➢ what data is needed to test them ❖ Collect samples of the data that best represent the scenarios from ➢ the real world ➢ through synthetic generation ❖ Aim for a dataset that is ➢ large enough to provide meaningful insights ➢ not so large that it becomes unwieldy ❖ Ensure that the data ➢ is of high quality ➢ accurately reflects the scenarios under evaluation

Slide 35

Slide 35 text

How to make a dataset (Example)

Slide 36

Slide 36 text

How to make a dataset (Example)

Slide 37

Slide 37 text

How to make a dataset (Example)

Slide 38

Slide 38 text

How to make a dataset (Example)

Slide 39

Slide 39 text

How to make a dataset (Example)

Slide 40

Slide 40 text

How to make a dataset (initial log entry) { "insertId": "rroswqf9rk00w", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;a9a840e2-521c-4b3b-82a0-80bee1ca8122;;;question;;;If I get Input is too long for requested model when sending data to sonnet 3.5, how can I manage this? split the text and do more than call or how ?" }, … }, "timestamp": "2024-10-14T07:40:44.646Z", "severity": "INFO", "labels": { "loggerName": "com.vaadin.demo.views.ChatView", "levelValue": "20000", "levelName": "INFO" }, "logName": "projects/langchain4j/logs/application-log", "receiveTimestamp": "2024-10-14T07:40:44.833940511Z" } … { "insertId": "8d5lksf9xmvie", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;retrieved;;;tutorials\\1-chat-and-language-models.md;7;;integrations\\language-models\\jlama.md;7;;tutorials\\6-tools.md;6" }, … }, { "insertId": "3bya7df19n47j", "jsonPayload": { "message": "1a3efd1b-6aa2-41e2-92c2-56f58471acf3;;;a9a840e2-521c-4b3b-82a0-80bee1ca8122;;;answer;;;If you encounter the error \"Input is too long for requested model\" when sending data to the Sonnet 3.5 model, you can manage this by splitting the text into smaller chunks and making multiple calls to the model. Since LLMs do not maintain the state of the conversation, you'll need to keep track of the context yourself.\n\nHere’s how you can handle it:\n\n1. **Split the Text:** Break your input into smaller, manageable pieces that fit within the model's input limits.\n\n2. **Maintain Context:** When making multiple calls, ensure you keep the relevant context from previous interactions. This may involve including previous user and AI messages in each new request.\n\n3. **Use `UserMessage`:** Create a `UserMessage` for each chunk of text. You can use `new UserMessage(\"Your text here\")` or `UserMessage.from(\"Your text here\")` to do this.\n\n4. **Call the Model:** For each chunk, generate a response using the model, like so:\n ```java\n UserMessage userMessage = UserMessage.from(\"Your chunk of text here\");\n Response response = model.generate(singletonList(userMessage), toolSpecifications);\n AiMessage aiMessage = response.content();\n ```\n\n5. **Combine Responses:** After processing all chunks, combine the responses to form a complete output.\n\nBy following these steps, you can effectively manage long inputs and maintain a coherent conversation with the model." }, … }

Slide 41

Slide 41 text

How to make a dataset (cleaned up DataSet entry) { "query": "What is langchain4j" , "ground truth chunks" : [ { "chunk": "## LangChain4j Library StructureLangChain4j features a modular design, comprising: - The `langchain4j-core` module, which defines core abstractions (such as `ChatLanguageModel` and `EmbeddingStore`) and their APIs … You can use the `langchain4j-{integration}` modules independently. For additional features, simply import the main `langchain4j` dependency." , }, { "chunk": "LangChain4j began development in early 2023 amid the ChatGPT hype. \n We noticed a lack of Java counterparts to the numerous Python and JavaScript LLM libraries and frameworks,\n and we had to fix that! \n Although \"LangChain\" is in our name … For easier integration, LangChain4j also includes integration with\n [Quarkus](/tutorials/quarkus-integration) and [Spring Boot](/tutorials/spring-boot-integration)." , } ], "answer": "LangChain4j is a Java library that aims to simplify integrating Language Model Models (LLMs) into Java applications. It features a modular design with different modules such as `langchain4j-core` for core abstractions, the main `langchain4j` module for useful tools, and various `langchain4j-{integration}` … The library is actively developed, with both low-level and high-level abstractions for interacting with LLMs. Additionally, LangChain4j includes integration with Quarkus and Spring Boot for easier integration." }

Slide 42

Slide 42 text

Hand Picked vs Synthetic? Hand Picked Pros: + Better reflection of real world situations that the application might face if based out of real world data + Allows to map edge cases + Ensures that every possible subject is covered Cons: - Time consuming - Expensive Synthetic Pros: + Low cost + Quick to generate + Better than starting from nothing Cons: - Might not cover everything - Possible hallucinations depending on the model used - Mostly low complexity entries It is possible to mix both strategies

Slide 43

Slide 43 text

Picking the scorers for your application Ground Truth text --- Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- --- 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Similar content but split in paragraphs

Slide 44

Slide 44 text

Picking the scorers for your application Ground Truth text --- Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- Missing content

Slide 45

Slide 45 text

Picking the scorers for your application Ground Truth text --- Here's how: 1. **Unified APIs:** LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). 2. **Comprehensive Toolbox:** Since early 2023, the community has been building numerous LLM-powered applications, identifying common abstractions, patterns, and techniques. LangChain4j has refined these into a ready to use package. Our toolbox includes tools ranging from low-level prompt templating, chat memory management, and function calling to high-level patterns like AI Services and RAG. --- Retrieved texts --- LLM providers (like OpenAI or Google Vertex AI) and embedding (vector) stores (such as Pinecone or Milvus) use proprietary APIs. --- --- LangChain4j offers a unified API to avoid the need for learning and implementing specific APIs for each of them. --- --- To experiment with different LLMs or embedding stores, you can easily switch between them without the need to rewrite your code. --- --- LangChain4j currently supports [15+ popular LLM providers](/integrations/language-models/) and [20+ embedding stores](/integrations/embedding-stores/). --- Split even further to sentences

Slide 46

Slide 46 text

Picking the scorers for your application Expected Answer “LangChain4j is a Java library that aims to simplify integrating Language Model Models (LLMs) into Java applications. It features a modular design with different modules such as `langchain4j-core` for core abstractions, the main `langchain4j` module for useful tools, and various `langchain4j-{integration}` modules for integration with different LLM providers and embedding stores. The project started in early 2023 and is a fusion of ideas from various sources like LangChain, Haystack, LlamaIndex, and the broader community. The library is actively developed, with both low-level and high-level abstractions for interacting with LLMs. Additionally, LangChain4j includes integration with Quarkus and Spring Boot for easier integration.” Returned Answer “LangChain4j is a Java library designed to streamline the integration of Language Model Models (LLMs) into Java applications. It boasts a modular architecture, featuring components like langchain4j-core for core abstractions, the primary langchain4j module with useful tools, and various langchain4j-{integration} modules for seamless integration with different LLM providers and embedding stores. Launched in early 2023, the project combines ideas from sources such as LangChain, Haystack, LlamaIndex, and the broader community. The library is under active development, offering both low-level and high-level abstractions for interacting with LLMs. Moreover, LangChain4j provides integration with frameworks like Quarkus and Spring Boot, making the setup process even easier.” Semantically similar but not word for word the same

Slide 47

Slide 47 text

Picking the scorers for your application Document Retrieval ❖ Sentence Level Precision Recall ➢ Extract all sentences from all text chunks to calculate Precision and Recall ❖ RougeL based Recall ➢ Use RougeL between text chunks to calculate Recall ❖ FuzzyText Precision Recall ➢ Use weighted fuzzy text search between text chunks to calculate Precision and Recall ❖ Token Level Precision Recall ➢ Extract tokens from all text chunks to calculate Precision and Recall Generation 🔨 ❖ RougeL Answer 🔨 ➢ Calculate RougeL Precision, Recall and F1 between answers ❖ Bert Answer 🔨 ➢ Use Bert based model to calculate semantic similarity between answers

Slide 48

Slide 48 text

Evaluation Procedure 1. Setup the application 2. Determine the parameters we want to track, e.g. a. Size of chunks b. Chunks overlap c. Number of returned entries 3. Run through each entry in the dataset and score the application 4. Average the results 5. Change the parameters 6. Repeat from 3 7. … 8. Profit?

Slide 49

Slide 49 text

Example evaluation Setup the retrieval part of our application public class PrecisionRecallEvaluationDemo { public static void main(String[] args) { //Load all documents from lc4j documentation. // List l4jDocuments = FileSystemDocumentLoader .loadDocumentsRecursively("../langchain4j/docs/docs"); //Split the documents. DocumentSplitter splitter = DocumentSplitters .recursive(1000, 0); List segments = splitter.splitAll(l4jDocuments); EmbeddingModel embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel(); List embeddings = embeddingModel.embedAll(segments).content(); //Add them to embedding store. EmbeddingStore embeddingStore = new InMemoryEmbeddingStore<>(); embeddingStore.addAll(embeddings, segments);

Slide 50

Slide 50 text

Example evaluation Setup the retrieval part of our application ● Load the documents ● Split them to chunks ● Calculate embeddings ● Store them public class PrecisionRecallEvaluationDemo { public static void main(String[] args) { //Load all documents from lc4j documentation. // List l4jDocuments = FileSystemDocumentLoader .loadDocumentsRecursively("../langchain4j/docs/docs"); //Split the documents. DocumentSplitter splitter = DocumentSplitters .recursive(1000, 0); List segments = splitter.splitAll(l4jDocuments); EmbeddingModel embeddingModel = new BgeSmallEnV15QuantizedEmbeddingModel(); List embeddings = embeddingModel.embedAll(segments).content(); //Add them to embedding store. EmbeddingStore embeddingStore = new InMemoryEmbeddingStore<>(); embeddingStore.addAll(embeddings, segments);

Slide 51

Slide 51 text

Example Evaluation Add the evaluator and somewhere to store the scores var fuzzyMatchingEvaluator = new FuzzyMatchingChunkEvaluator(); HashMap averageFuzzyResults = new HashMap<>();

Slide 52

Slide 52 text

Example Evaluation Use the embedding model and pass to the scorer to evaluate ● Get the query for the embedding ● Get text chunks based on it ● Score against the ground truth chunks of text ● Add results to calculate averages later for (DatasetEntry entry: Dataset.get()){ var queryEmbedding = embeddingModel.embed(entry.query()).content(); var searchRequest = EmbeddingSearchRequest.builder() .queryEmbedding(queryEmbedding).maxResults(5).build(); var searchResult = embeddingStore.search(searchRequest); var fuzzyResults = fuzzyMatchingEvaluator.evaluate( entry.expectedContextResults(), searchResult.matches().stream() .map(EmbeddingMatch::embedded).toList()); for (String key: fuzzyResults.keySet()){ averageFuzzyResults.put( key, averageFuzzyResults.getOrDefault(key, 0.0) + fuzzyResults.get(key)); } }

Slide 53

Slide 53 text

Example Evaluation Calculate the average score from the results before System.out.println("Average fuzzy results:" ); for (String key: averageFuzzyResults.keySet()){ System.out.println(key + ": " + averageFuzzyResults.get(key) / Dataset. get().size()); }

Slide 54

Slide 54 text

DEMO TIME !!!

Slide 55

Slide 55 text

Example Evaluation Running the application

Slide 56

Slide 56 text

Example Evaluation Running the application Fraction of retrieved items that are relevant Higher Recall -> More context that we want is actually retrieved from the store (so it should be prioritized at first) Fraction of relevant items that are actually retrieved

Slide 57

Slide 57 text

Results of the evaluators

Slide 58

Slide 58 text

Key takeaways ❖ Evaluations are only as good as the data they use (garbage-in / garbage-out) ❖ 100% on every metric might not be possible ❖ Pick on what aspects you want to improve on and focus on them ❖ For some pairs of metrics maximising for one metric can sometimes negatively influence another ❖ The Dataset will change as the application evolves (eg. changes to text corpus, new scenarios to add)

Slide 59

Slide 59 text

❖ Evals demo project - https://github.com/Dkafetzis/evals-demo ❖ Quarkus: Testing AI-Infused Applications - https://docs.quarkiverse.io/quarkus-langchain4j/dev/testing.html# ❖ Creating a LLM-as-a-Judge That Drives Business Results - https://hamel.dev/blog/posts/llm-judge/ ❖ Your AI Product Needs Evals - https://hamel.dev/blog/posts/evals/ ❖ A Field Guide to Rapidly Improving AI Products - https://hamel.dev/blog/posts/field-guide/ ❖ Ragas Evaluation Dataset - https://docs.ragas.io/en/stable/concepts/components/eval_dataset/ ❖ Emerging Patterns in Building GenAI Products - https://martinfowler.com/articles/gen-ai-patterns/ References