Slide 1

Slide 1 text

qaware.de AI’s Secret Weapon: Turning Documents into Knowledge Alexander Eimer [email protected] @aeimer Cloud Native Night 05.09.2024

Slide 2

Slide 2 text

About me ● Senior Software Engineer @QAware GmbH ● Currently working in an AI / RAG project at a big german corporate ● Platform Engineer that shifted to AI ● Passionate about delivering quality in software and platforms QAware | 2

Slide 3

Slide 3 text

QAware | 3 This is Bernd

Slide 4

Slide 4 text

QAware | 4 Bernd moves out of his flat and needs to know what are things to do by contract when he leaves his old flat. 🏚 ➡ 🏡

Slide 5

Slide 5 text

QAware | 5 Bernd has an idea! AI is the salvation!

Slide 6

Slide 6 text

Is it that easy? QAware | 6

Slide 7

Slide 7 text

QAware | 7 🤔 ❓

Slide 8

Slide 8 text

What is “Chat with your documents”? QAware | 8 Using an LLM to get answers and findings from documents (like txt, pdf, docx or any other format with prosa text)

Slide 9

Slide 9 text

Why do we want CWYD? QAware | 9 Find, summarize and polish information It will fundamentally change the way we interact with information as the information gathering is faster and easier than ever before

Slide 10

Slide 10 text

Bigger picture – RAG QAware | 10

Slide 11

Slide 11 text

What is RAG - Retrieval Augmented Generation? ■ Retrieval: Using some kind of search to find the relevant information for a given question. ■ Augmented: The Language Model has facts, that it can refer to. ■ Generation: Injecting those into the prompts to the Language Model alongside the question, so it hopefully has facts to refer to. QAware | 11

Slide 12

Slide 12 text

Asking Wikipedia 🙈 Prompts often contain a few examples. Examples can be automatically retrieved from a database with document retrieval, sometimes using a vector database. Given a query, a document retriever is called to retrieve the most relevant. Wikipedia (en) - Prompt Engineering QAware | 12

Slide 13

Slide 13 text

RAG – information sources QAware | 13

Slide 14

Slide 14 text

Chatbots and AI Assistants: Variants and stages ChatGPT or other Chatbots with common knowledge ChatGPT with organization-specific knowledge Specialized AI Assistant ■ Retrieval Augment Generation ■ Transfer Learning ■ Self trained models ■ Processautomation effort benefit ■ Easy and cheap ■ Needs guidelines on Data protection & compliance

Slide 15

Slide 15 text

Why shouldn’t I train a LLM instead? QAware | 15 ● Bad retrieval of detailed information ● Inflexible as training needs time ● Access rights can’t be guaranteed ● Expensive to train

Slide 16

Slide 16 text

RAG example: rental agreement QAware | 16 What are my obligations when I want to move out of my flat? ai_model.complete(“ {{ system_prompt }} {{ chat_history }} {{ document }} {{ user_input }} “) You need to paint the walls in white color. Also, you need to make sure that all stains in the floor are removed and everything is swept clean before you leave.

Slide 17

Slide 17 text

RAG example: Actual prompt QAware | 17 ai_model.complete(“ You are a friendly ChatBot called QAbot. You never get rude or disrespectful. --- Your chat history so far was: N/A --- The user has the following document as further input: Rental agreement This is a rental agreement between [...] --- The userinput is: What are my obligations when I want to move out of my flat? “)

Slide 18

Slide 18 text

RAG example: rental agreement QAware | 18 What are my obligations when I want to move out of my flat? You need to paint the walls in white color. Also, you need to make sure that all stains in the floor are removed and everything is swept clean before you leave. Thanks, but do I also need to return the key or can I just throw it away? According to the rental agreement in paragraph 8 you need to return all keys in person to the landlord.

Slide 19

Slide 19 text

RAG example: Second request QAware | 19 ai_model.complete(“ You are a friendly ChatBot called QAbot. You never get rude or disrespectful. --- Your chat history so far was: USER: What are my obligations when I want to move out of my flat? BOT: You need to paint the walls in white color. Also, you need to make sure that all stains in the floor are removed and everything is swept clean before you leave. --- The user has the following document as further input: Rental agreement This is a rental agreement between [...] --- The userinput is: Thanks, but do I also need to return the key or can I just throw it away? “)

Slide 20

Slide 20 text

CWYD variants

Slide 21

Slide 21 text

There are two base types QAware | 21 Full context LLM has full document in context Chunks / parts LLM has parts of the document ● LLM gets the whole document as input ● Only possible from GPT-4-turbo on as recall was too bad before ● Allows “real” research as all information are present ● ⇒ Allows summarization ● Possibly expensive ● This is usually meant if people talk about RAG ● Need to find relevant parts of the document with external component ● No “real” research possible ● ⇒ Can’t summarize document ● Challenge: Find correct chunks ● Possibly cheaper

Slide 22

Slide 22 text

CWYD variants QAware | 22 ‼ Retrieval?

Slide 23

Slide 23 text

Full document only possible since good recall QAware | 23 https://www.databricks.com/blog/long-context-rag-performance-llms

Slide 24

Slide 24 text

Retrieval methods QAware | 24 ● Retrieval Augmented Generation (RAG) ● Retrieval methods ○ Keyword search ○ Semantic search ○ APIs ■ Confluence search ■ Google ■ Notion ■ … ○ Knowledge graphs ○ HyDE ○ RAPTOR ○ … ⬅ most used

Slide 25

Slide 25 text

Semantic search

Slide 26

Slide 26 text

RAG with semantic search QAware | 26 Phase 1: Ingestion Phase 2: Retrieval

Slide 27

Slide 27 text

Chunking ● Cut input text into (useful) chunks ○ Recursive chunking ○ Semantic chunking ● Each chunk should have one “semantics” ⇒ not mixing multiple topics ● Chunks are given to the prompt ⇒ RAG QAware | 27

Slide 28

Slide 28 text

Chunking (simplified example) QAware unites people who share a passion for coding. We have a deep expertise in software engineering, and a keen eye for perfectly tailored solutions. Since 2005, we’ve been boosting corporations, medium-sized companies, and startups with exceptional quality and productivity in agile software engineering. Our journey as cloud pioneers began in 2010. We are asked when legacy applications need to be analyzed, stabilized or remediated. When they need to go lightweight in the cloud. And we are the ones who build new systems with craft pride, inventiveness and passion. You benefit from excellent code. From outstanding quality. From forward-thinking innovation. And from insights into the real levers for more impact. QAware | 28

Slide 29

Slide 29 text

Chunking (simplified example) QAware unites people who share a passion for coding. We have a deep expertise in software engineering, and a keen eye for perfectly tailored solutions. Since 2005, we’ve been boosting corporations, medium-sized companies, and startups with exceptional quality and productivity in agile software engineering. Our journey as cloud pioneers began in 2010. We are asked when legacy applications need to be analyzed, stabilized or remediated. When they need to go lightweight in the cloud. And we are the ones who build new systems with craft pride, inventiveness and passion. You benefit from excellent code. From outstanding quality. From forward-thinking innovation. And from insights into the real levers for more impact. QAware | 29

Slide 30

Slide 30 text

Typical chunking problem QAware | 30

Slide 31

Slide 31 text

Chunking: Key Learning Chunk sizing is hard! QAware | 31

Slide 32

Slide 32 text

QAware | 32 ✅ ⬅ ⬆ ⬇

Slide 33

Slide 33 text

What is semantic “similarity”? QAware | 33 Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. Wikipedia (en) - semantic similarity

Slide 34

Slide 34 text

It’s all about vectors [...] ● Semantic can be represented as a big vector ● embedding models calculate (deterministically) vector for text-input ○ ada02 is very often used (vector length: 1536) ○ text-embedding-3-large is newer (vector length: 3072) ● Cosine similarity as search-function ○ Vectors “close together” have a same semantic ○ Quite fast calculatable QAware | 34

Slide 35

Slide 35 text

What is a vector? (simplified) ● It’s very hard to explain ● Naïvely spoken: one property of a vector is one category ● Input can be more than one word QAware | 35 Y X

Slide 36

Slide 36 text

What is a vector? (simplified) QAware | 36 Input Vector Color Yellow Car BMW House Apple -0.7 0.3 -1.0 -1.0 -1.0 BMW x5 0.2 0.1 0.9 1.0 -1.0 Renault 0.2 0.3 0.9 -0.8 -1.0 Traffic Light 0.5 0.8 0.4 -0.2 -1.0 Heating system -0.9 -1.0 -0.9 -0.5 0.8 Garage -0.8 -0.9 0.8 0.2 0.8

Slide 37

Slide 37 text

Vector DBs ● Vector databases store vectors with (meta)data ● Search functions on vector ● Examples ○ Weaviate ○ Chroma ○ pg_vector (Postgres) ○ Cassandra QAware | 37

Slide 38

Slide 38 text

Chunking + Semantics + Vectors + DB Bring it all together to get RAG: ● Chunk to get usable text-parts to be used in the prompt ● Define semantics from embedding model ● Represent semantic value as vector ● Store and find vectors with chunks in DB QAware | 38

Slide 39

Slide 39 text

SaaS Example: Azure AI Search QAware | 39 Phase 2: Retrieval Phase 1: Ingestion https://learn.microsoft.com/de-de/azure/search/vector-search-overview

Slide 40

Slide 40 text

What to choose?

Slide 41

Slide 41 text

Full text vs Chunks QAware | 41 Which variant to choose? ⇒ It depends… It’s a consideration between ● Money ○ Full context uses probably more input-token ○ Semantic search needs embedding and a database ● Time ○ Shorter input requests get answered faster ⇒ Let’s see if context-caching is thing soon?! ○ Running a semantic search can also take some time ● Effort ○ Full context can be used by everybody ⇒ only needs the LLM ○ Storing and finding chunks is more complex ● Document size ○ Full context only works if the document is smaller than the LLM max input context ○ Any size of document can be chunked

Slide 42

Slide 42 text

Research has not been idle: RAG Survey Paper Dec 2023 QAware | 42

Slide 43

Slide 43 text

Research has not been idle: RAG Survey Paper Dec 2023 QAware | 43 What is RAG? Naive RAG

Slide 44

Slide 44 text

Data-driven development

Slide 45

Slide 45 text

Why testing AI / RAG systems? QAware | 45 How do you know ● …your system prompt is good? ● …that the change on the prompt is improving the answers? ● …that the correct information are retrieved? ● …that the answers are friendly and not aggressive? ● …that over time the answer-quality does not regress?

Slide 46

Slide 46 text

AI is not deterministic — unit tests don’t work QAware | 46 ➡ LLM as a judge

Slide 47

Slide 47 text

How to test AI/RAG systems? QAware | 47 ● Choose the metrics to check ● Create a set of test-questions and ground truths ● Setup a(nother) LLM to rate the answers ● Enable the app to return the chunks ● Run the test (regularly)

Slide 48

Slide 48 text

Which metrics can be used? QAware | 48 https://freeplay.ai/blog/using-llms-to-automatically-evaluate-rag-prompts-pipelines ● Faithfulness How much halluzination? ● Context Relevance How good was the retrieval? ● Answer Relevance How good was the answer in relation to the question? ● … There are many more

Slide 49

Slide 49 text

Create a test set QAware | 49 Test set content ● Question A question for the app to answer ● Answer The answer the app’s LLM returned for the question ● Contexts The used chunks given to the app’s LLM ● Ground truth The correct answer for the question

Slide 50

Slide 50 text

Example test data QAware | 50 data_samples = { 'question': [ 'When was the first super bowl?', 'Who won the most super bowls?'], 'answer': [ 'The first superbowl was held on January 15, 1967', 'The most super bowls have been won by The New England Patriots'], 'contexts' : [ ['The Super Bowl....season since 1966,','replacing the NFL...in February.'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']], 'ground_truth': [ 'The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times'] } https://docs.ragas.io/en/stable/howtos/applications/data_preparation.html

Slide 51

Slide 51 text

Flow when testing QAware | 51

Slide 52

Slide 52 text

Frameworks QAware | 52 ● RAGAS https://docs.ragas.io Framework to build your own tests ● promptfoo https://www.promptfoo.dev UI driven all-in application

Slide 53

Slide 53 text

The perfect search

Slide 54

Slide 54 text

It does exists (... or not)! QAware | 54 Well… only if there is enough time and money. So it is an academically solved issue! Based on big context LLMs and advanced RAG technology the perfect search exists.

Slide 55

Slide 55 text

Wrap up

Slide 56

Slide 56 text

QAware | 56 Bernd chose to use the full document approach and is happy now. It was easy to use and it worked ad-hoc.

Slide 57

Slide 57 text

Key learnings ● In the long term, RAG or similar technologies will revolutionise the way we search and research information. ● Retrieval is a difficult topic.Especially semantic search is difficult to implement in practice. ● Research is strong. With RAG, the solution space has exploded. But most of the Advanced techniques won’t survive. ● Following and evaluating the innovations is a full-time job.

Slide 58

Slide 58 text

Q&A

Slide 59

Slide 59 text

qaware.de QAware GmbH Mainz Rheinstraße 4 C 55116 Mainz Tel. +49 6131 21569-0 [email protected] twitter.com/qaware linkedin.com/company/qaware-gmbh xing.com/companies/qawaregmbh slideshare.net/qaware github.com/qaware