From naive to advanced RAG: the complete guide

Slide 1

Slide 1 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud From Naive to Advanced RAG / The Definitive Guide DEVOXX BELGIUM 2024 GUILLAUME LAFORGE GOOGLE CEDRICK LUNVEN DATASTAX DEEP DIVE

Slide 2

Slide 2 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud It’s easy to get started with Retrieval Augmented Generation, but you’ll quickly be disappointed with the generated answers: inaccurate or incomplete, missing context or outdated information, bad text chunking strategy, not the best documents returned by your vector database, and the list goes on. After meeting thousands of developers across Europe, we’ve explored those pain points, and will share with you how to overcome them. As part of the team building a vector database we are aware of the different flavors of searches (semantic, meta-data, full text, multimodal) and embedding model choices. We have been implementing RAG pipelines across different projects and frameworks and are contributing to LangChain4j. In this deep-dive, we will examine various techniques using LangChain4j to bring your RAG to the next level: with semantic chunking, query expansion & compression, metadata filtering, document reranking, data lifecycle processes, and how to best evaluate and present the results to your users.

Slide 3

Slide 3 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Guillaume Laforge Developer Advocate @ Google Cloud ❏ Stuff I do ❏ GCP Developer Advocate, focused on Generative AI, serverless solutions & service orchestration ❏ Apache Groovy founder ❏ Java Champion ❏ Cast Codeurs podcast ❏ AI ❏ LangChain4j committer ❏ Google Cloud Machine Learning APIs

Slide 4

Slide 4 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud ❏ Stuff I do ❏ Dev Ecosystem @DS ❏ Tools (sdk, cli, plugins) ❏ Dev Advocate ❏ Creator of ff4j (ff4j.org) ❏ AI ❏ CTO GoodBards ❏ DataStax AI products ❏ Contributor Langchain4j/SpringAI Cédrick Lunven Software Engineer @ DataStax

Slide 5

Slide 5 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 1. Introduction - Generative AI and LLM - Prompt Engineering - Limitations and Why RAG - LangChain4j Overview 2. Naive RAG - Ingestion Principles - Query Principles 3. Advanced RAG : Ingestion - Loading and Parsing - Vectors, Embedding and Similarity - Introducing Vector Databases - Chunking - Embedding Break! 15 min.

Slide 6

Slide 6 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 4. Advanced RAG : Query - Query Preprocessing - Query Preprocessing - Query Transformations - Vector Searches - Filterings and metadata - Projections and Sorting - Question post processing - Reranking - Recursive algorithms - Consolidation 5. Quality and Data Governance - RAG evaluation - Security - Data Lifecycle

Slide 7

Slide 7 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud All the code and the slides are available online github.com/ datastaxdevs/ conference-2024-devoxx

Slide 8

Slide 8 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 1. Introduction - Generative AI and LLM - Prompt Engineering - Limitations and Why RAG - LangChain4j Overview

Slide 9

Slide 9 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Large Language models https://lifearchitect.ai/models/

Slide 10

Slide 10 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud So what are Large Language Models? • Transformer-based neural network architecture that can recognize, predict, and generate human language • Trained on huge corpuses of text, in various languages and domains • Ex: PaLM 2 learned 340 billion parameters, and trained over 3.6 trillions of tokens • Learn the statistical relationships between words and phrases, as well as the patterns of human language • Can be ﬁne-tuned for speciﬁc tasks or domain knowledge

Slide 11

Slide 11 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Generative AI use cases 2024 Language ● Writing ● Summarization ● Ideation ● Classification ● Sentiment analysis ● Extraction ● Chat ● Search Code ● Code generation ● Code completion ● Code chat ● Code conversion Speech ● Speech to text ● Text to speech ● Audio transcription ● Live voice streaming assistant Vision ● Image generation ● Image editing ● Captioning ● Image Q&A ● Image search ● Video descriptions

Slide 12

Slide 12 text

Slide 13

Slide 13 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud SAMPLE CONTEXT How to build effective prompts ? [Question] (inputs) Question, Task, Entity, Completions [Roles, Persona, Audience] : You are an assistant targeting Java developers [Objectives] : Your mission is to provide helpful answers [Constraints] : Format, Style, Must have, Boundaries [Techniques] One-shot Prompt, few shots prompts, check questions

Slide 14

Slide 14 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Conﬁgure LLM generation parameters TEMPERATURE Tune the degree of randomness. 1 0 More Creative tasks - Content Generation - Can hallucinate more More Accurate tasks - Summarization - Q&A TOP P Smallest set of words whose cumulative probability >= P p = .8 java .51 ia .23 langchain .11 spring .08 … TOP K The first K words ordered by their p (decreasing) K = 2 java .51 ia .23 langchain .11 spring .08 … TOKENS Size of the generated response. PRO Detailed/In-Depth Comprehensive Completion CONS Lower Precision Processing Time Cost Repetitions

Slide 15

Slide 15 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Can we do better ? Limitations of Prompt Engineering LLMs… …can be outdated (training cut-off date) …don’t know about your data …aren’t tuned (hard to steer) …are hallucinating if not properly prompted …work with limited input context windows (can’t feed all docs)

Slide 16

Slide 16 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Speaking of (large) context windows… Gemini 1.5 Pro supports 2M input token windows

Slide 17

Slide 17 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Speaking of (large) context windows… Gemini 1.5 Pro supports 2M input token windows

Slide 18

Slide 18 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud SAMPLE CONTEXT [Question] (inputs) Question, Task, Entity, Completions [Roles, Persona, Audience] : You are an assistant targeting Java developers [Objectives] : Your mission is to provide helpful answers [Constraints] : Format, Style, Must have, Boundaries [Techniques] One-shot prompt, few shot prompts, check questions [Document sources] Your relevant document extracts Retrieval Augmented Generation

Slide 19

Slide 19 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud LangChain4j Build GenAI Application with JAVA ChatLanguage Model Language Model Image Model Moderation Model Scoring Model Embedding Model

Slide 20

Slide 20 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 2. Naive RAG - Ingestion Principles - Query Principles

Slide 21

Slide 21 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Architecture LLM Vector DB vector embeddings chunks DOCS calculate split store vector + chunk ❶ INGESTION

Slide 22

Slide 22 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Architecture LLM Vector DB vector embeddings chunks DOCS calculate split store vector + chunk ❶ INGESTION ❷ QUERYING Chatbot app prompt answer vector embedding ﬁnd similar vectors context, question, relevant docs

Slide 23

Slide 23 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Architecture LLM Vector DB vector embeddings chunks DOCS calculate split store vector + chunk ❶ INGESTION ❷ QUERYING Chatbot app prompt answer vector embedding Docs Loading & parsing ﬁnd similar vectors context, question, relevant docs

Slide 24

Slide 24 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Architecture LLM Vector DB vector embeddings chunks DOCS calculate split store vector + chunk ❶ INGESTION ❷ QUERYING Chatbot app prompt answer vector embedding too big? too small? ﬁnd similar vectors context, question, relevant docs

Slide 25

Slide 25 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Architecture LLM Vector DB vector embeddings chunks DOCS calculate split store vector + chunk ❶ INGESTION ❷ QUERYING Chatbot app prompt answer vector embedding Is the context relevant? ﬁnd similar vectors context, question, relevant docs

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud DEMO _20_naive_rag_astra

Slide 30

Slide 30 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud LLM Documents Multiple Sources Multiple Format Chunks Sentences CHUNKING EMBEDDING Vector DATABASE Vector Space Vector Embeddings DB PERSIST METADATA (Naive) Retrieval-Augmented Generation Ingestion (Index Process) Complex PDF Structured Data QA Scalability? Missing Context Data Security? Dense Information MultiModal Compute Delta? Duplicates? Formats Encodings MaxTokens Memory Accuracy

Slide 31

Slide 31 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud The 10 pitfalls of RAG we should normalize steps

Slide 32

Slide 32 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 3. Advanced RAG: Ingestion - Loading and Parsing - Vectors, Embedding and Similarity - Introducing Vector Databases - Chunking - Embedding

Slide 33

Slide 33 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Ingestion Process Document Parser Document Document Loader Document Transformer

Slide 34

Slide 34 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Document Document text String metadata Map

Slide 35

Slide 35 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Ingestion Process EmbeddingStore Document Parser Document Document Loader Document Transformer Document Splitter Segment Embedding Segment Segment Embedding Embedding Embedding Model

Slide 36

Slide 36 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Document Ingestion (part 1) Document LOADERS (DocumentSource) FileSystem URL Amazon S3 Google Cloud Storage Azure Blob Storage Github Tencent selenium Document TRANSFORMERS JSoup Document PARSERS (DECODING) Apache POI Apache TIKA PDF BOX Core JVM Document text metadata

Slide 37

Slide 37 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud DEMO _30_loader_and_parsers

Slide 38

Slide 38 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 3. Advanced RAG: Ingestion - Loading and Parsing - Vectors, Spaces, Similarity & Search - Chunking - Embedding

Slide 39

Slide 39 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud ● Denote a phenomenon with a direction and a length.  ● Formulated as a list made of numbers (components)  ● List length is the dimensionality (d)  ● The "length" (or norm) regardless which direction  ● some meaningful notion of "rotation"   ● All vectors with same d form a vector space  ● Direction = "where the arrow points"  Vector Overview

Slide 40

Slide 40 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud embeddings vector dimensionality Vector Vectorization LLM Question Sentence Transformer Embeddings API

Slide 41

Slide 41 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector spaces are "ﬂat": Higher dimensions  d > 3: not easily visualized  So what? Such spaces exist…  …and are pretty useful across disciplines  + = "Actual" physical spaces  d = 1: (no need to involve "vectors", no?)  d = 2: a ﬂat plane  d = 3: the space around you  | | | x v y v z v v Figure 1: a 768-dimensional sphere.  v = [v 1 , v 2 , v 3 … v d-1 , v d ]  Vector Space and Dimensions

Slide 42

Slide 42 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Question: In a 3-dimensional space, the notation |v| = 1 represent a unit vector. How do you call the representation of all unit vectors in a 3-dimensional space ? |v| = 1 The UNIT SPHERE

Slide 43

Slide 43 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud A numeric way to quantify how much two vectors v and u are close to each other, computed with some formula S(v, u). Angular distance, cosine similarity Euclidean distance (L2) * u Vector Similarities decrease when similarity increase

Slide 44

Slide 44 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud What is closest to A?    A B C Question "According to Euclidean similarity, it is C.  With Cosine similarity, it is B." DISTANCE ≠ SIMILARITY

Slide 45

Slide 45 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Similarities

Slide 46

Slide 46 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud measure domain notes Euclidean sphere (all unit-norm) Switch to Dot (same sorting, faster) Cosine Switch to Dot (identical, faster) Dot-product OK Euclidean whole space Use if the norm itself carries information Cosine Normalize-on-save and switch to Dot on sphere Dot-product Are you sure? Vector Similarities

Slide 47

Slide 47 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Similarity != Relevance Vectors can be similar: a query vector is similar to a passage containing the query’s answer… But similar vectors may be irrelevant! (ie. they don’t contain the answer) ⇒ Importance of scoring, with (Re)Ranking APIs

Slide 48

Slide 48 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud query vector 1st hit 2nd hit 3rd hit 4th hit 5th hit * You'll soon see there's more to say . . .  Vector Vector Search

Slide 49

Slide 49 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Vector Search LLM Question Results Chunks Similarities Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API Results Embeddings OTHER SOURCES

Slide 50

Slide 50 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud DEMO _31_vectors _32_vectors_similarity

Slide 51

Slide 51 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 3. Advanced RAG: Ingestion - Loading and Parsing - Vectors, Spaces, Similarity & Search - Chunking - Embedding - Vector Database

Slide 52

Slide 52 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Ingestion Process Document Parser Document Document Loader Document Transformer Document Splitter Segment Embedding Segment Text Segment Embedding Embedding Embedding Model

Slide 53

Slide 53 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Response metadata Map tokenUsage TokenUsage finishReason FinishReason Document Ingestion (part 2) Document text metadata TextSegment metadata Map text String DocumentSplitter (Chunking) Hierarchical Sentence Based Character Based Recursive EmbeddingModel (Embedding) Embedding floats

Slide 54

Slide 54 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud ● Hierarchical chunking ● Context expansion splitting parent/child, sliding window ● Hypothetical Questions generate relevant questions ● Contextual retrieval recent article from Anthropic ● Semantic chunking Chunking techniques

Slide 55

Slide 55 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Illustration with a Wikipedia article about Berlin

Slide 56

Slide 56 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union. Raw text

Slide 57

Slide 57 text

Slide 58

Slide 58 text

Slide 59

Slide 59 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Chunking by sentence Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union. VECTOR TEXT META embed(sentence) sentence _34_chunking_defaults

Slide 60

Slide 60 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Parent (context) / child (embedding) chunking Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union. Embed sentence, but return context VECTOR TEXT META embed(sentence) sentence paragraph

Slide 61

Slide 61 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Sentence sliding window chunking Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union. Embed sentence, but return context VECTOR TEXT META embed(sentence) sentence window _36_chunks_with_wider_context

Slide 62

Slide 62 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Hypothetical Questions Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union. Embedding questions: ● What is the capital and largest city of Germany? ● What is the population of Berlin? ● Which state is Berlin located in? ● What is the name of the state surrounding Berlin? ● What is the name of the capital of the state surrounding Berlin? VECTOR TEXT META embed(question) question paragraph _37_hypothetical_questions_embedding

Slide 63

Slide 63 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Contextual Retrieval Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union. Embed chunk “in context”: Berlin's population within its city limits is the largest in the European Union. VECTOR TEXT META embed(context) context paragraph _38_contextual_retrieval

Slide 64

Slide 64 text

Slide 65

Slide 65 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 3. Advanced RAG: Ingestion - Loading and Parsing - Vectors, Spaces, Similarity & Search - Chunking - Embedding - Vector Database

Slide 66

Slide 66 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Embedding LLM Embedding MODEL CHUNK tokenization index mapping [“devoxx”, “conference”, “developers”] tokenizer model model inference [12, 1235, 246] [0.084, 0.04, …] vector for each token indices in the vocab. tokens embedding matrix embedding dimension vocabulary pooling [0.084, 0.04, …] vector for chunk

Slide 67

Slide 67 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud How to choose your embedding models ● Task Requirements (use case, retrieval works great)  ● Language (second tab)  ● Domain Speciﬁcity (~ﬁne-tuning)  ● Model Size (resources, latency)  ● Cost (self hosted vs API)  ● Dimensionality  ● Multi Modal Requirements  ● Your existing stack/provider  https://huggingface.co/spaces/mteb/leaderboard https://huggingface.co/spaces/mteb/arena

Slide 68

Slide 68 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Matryoshka Embeddings Matryoshka Representation Learning (MRL) dim=1024 dim=512 dim=256 TRUNCATE embedding matrix embedding dimension vocabulary

Slide 69

Slide 69 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Colbert Embeddings High Density DataSet Pre-Processing ● (Optional) Fine-tune BERT on your document corpus ● Use the generic checkpoint provided by ColBERT project ● Generate contextualized embeddings vectors E[d] for each of your documents (one per token) Query Time ● Generate embeddings vectors for your query (one per token)

Slide 70

Slide 70 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Embedding (scalar) quantization ● Instead of using float32, use smaller types (float16/bfloat16, int8, binary…) ● Distribution is ignored (PQ covers later) float32 int8 binary / ternary

Slide 71

Slide 71 text

Slide 72

Slide 72 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud DEMO _33_1_embeddings_tokenizers _33_2_embedding_multilingual _33_3_embedding_retrieval_task _33_4_embedding_matryoshka

Slide 73

Slide 73 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Document Ingestion (part 3) TextSegment metadata Map text String Embedding floats EmbeddingStore (PERSIST)

Slide 74

Slide 74 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Databases Overview LLM Question Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API OTHER SOURCES Vector Space

Slide 75

Slide 75 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Apache Cassandra™ Undisputed Leader for Scale and Reliability

Slide 76

Slide 76 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 1 Installation = 1 NODE ✔ Capacity = ~ 2-4TB ✔ Throughput = LOTS Tx/sec/core Communication: ✔ Gossiping ✔ No Master (peer-to-peer) DataCenter (DC) | Ring Apache Cassandra™ NoSQL Distributed database

Slide 77

Slide 77 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud High Availability Always On Every second of downtime translates into lost revenue   Linear Scalability Hyper Scalability Millions of operations per day, hour, or second  Global Distribution Data Everywhere On-premises, hybrid, multi-cloud, centralized, or edge   Low Latency Faster Pace Every millisecond of latency has consequence   NoSQL Distributed database Apache Cassandra™

Slide 78

Slide 78 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud CREATE TABLE IF NOT EXISTS vsearch.products ( id int PRIMARY KEY, name TEXT, description TEXT, item_vector VECTOR //5-dimensional embedding ); New Type ● New Vector type introduced Cassandra 5 as a Vector database

Slide 79

Slide 79 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Cassandra 5 as a Vector database SAI Secondary indices CREATE CUSTOM INDEX IF NOT EXISTS ann_index ON vsearch.products(item_vector) USING 'StorageAttachedIndex';

Slide 80

Slide 80 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Cassandra 5 as a Vector database Sample Neighbour Search SELECT * FROM vsearch.products ORDER BY item_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] LIMIT 1;

Slide 81

Slide 81 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 4. Advanced RAG: Query - Query Preprocessing - Query Preprocessing - Query Transformations - Vector Searches - Filterings and metadata - Projections and Sorting - Question post processing - Reranking - Recursive algorithms - Consolidation

Slide 82

Slide 82 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Query Preprocessing Overview LLM Question Results Chunks Similarities Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API Results Embeddings Chat completion API Rest API answer OTHER SOURCES Prompt Question RagContext

Slide 83

Slide 83 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Ai Service Question Content Retriever DB Embedding Model ChatLanguage Model Reply Query Preprocessing Extending the Content Retriever

Slide 84

Slide 84 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Ai Service Retrieval Augmentor Question Content Retriever DB Embedding Model ChatLanguage Model Reply Content Retriever DB Embedding Model STUFF Query Preprocessing Retrieval Augmentor

Slide 85

Slide 85 text

Slide 86

Slide 86 text

Slide 87

Slide 87 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud HyDE — Hypothetical Document Embedding Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Berlin is surrounded by the state of Brandenburg, and Brandenburg's capital Potsdam is nearby. The urban area of Berlin has a population of over 4.5 million and is therefore the most populous urban area in Germany. The Berlin-Brandenburg capital region has around 6.2 million inhabitants and is Germany's second-largest metropolitan region after the Rhine-Ruhr region, and the sixth-biggest metropolitan region by GDP in the European Union. User query: What is the population of Berlin? Hypothetical answer (provided by LLM): There are 3 million inhabitants in Berlin VECTOR TEXT META embed(parag.) paragraph paragraph _47_hypothetical_document_embedding

Slide 88

Slide 88 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Search Overview LLM Question Results Chunks Similarities Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API Results Embeddings Chat completion API Rest API answer OTHER SOURCES Prompt Question RagContext

Slide 89

Slide 89 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Databases What you think… LLM Question Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API Results Embeddings OTHER SOURCES

Slide 90

Slide 90 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Databases But… in reality LLM Question Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API Results Embeddings OTHER SOURCES METADATA FILTERING PROJECTIONS SORTS FULL-TEXT, FACET SEARCH HYBRID SEARCH Vectorization

Slide 91

Slide 91 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud https://superlinked.com/vector-db-comparison Vector Databases Vector Database Features

Slide 92

Slide 92 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Search: KNN Inverted File Index (IVF) ● Exhaustive ● Linear Complexity ● 1000+ dimensions ● Millions of Vector ● Good Luck ! https://machinelearningknowledge.ai/k-nearest-neighbor- classification-simple-explanation-beginners/ Voronoi Diagram (FAISS) centroid

Slide 93

Slide 93 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud benchmarks Vector Search: ANN Approximate nearest neighbours Benchmarks

Slide 94

Slide 94 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Search: ANN Vector Indices Families ● Hash-based indexing ○ Locality-sensitive hashing ● Tree-based indexing ○ ANNOY ● Cluster-based or cluster indexing ○ Product quantization ● Graph-based indexing ○ Hierarchical navigable small world

Slide 95

Slide 95 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Layer 0 .0 to .90 high degree Layer 2 .99 to .999 low degree Layer 1 .90 to .99 “Vector Index” As seen in: • Lucene (Elastic, Solr, OpenSearch, MongoDB) • Weaviate • Qdrant • PGVector (August 2023) decreasing characteristic radius v v Vector Search ANN Hierarchical Navigable Small World (HSNW)

Slide 96

Slide 96 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud HSNW on Cassandra (Wikipedia DataSet)

Slide 97

Slide 97 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud JVector Under the hood DiskANN = Vamana + Product Quantization + Oversampling

Slide 98

Slide 98 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud JVector Building the Index with Vamana Vamana ● Data Compressed in memory  ● Full vector on disk, less lookup  ● Way better in high recall  recall = Number of docs matchings retrieved Number of docs matchings available SINGER LAYER, DENSER CONNECTIONS

Slide 99

Slide 99 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud JVector Product Quantization (PQ) ORIGINAL VECTOR (dim=1024) subvector subvector subvector subvector (m=4)  split into “subspaces”  K-mean for all subspaces  replace with closest centroid (1 double)  output vector  LOSSY COMPRESSION FOR VECTORS

Slide 100

Slide 100 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud JVector Product Quantization (PQ) ● Eﬃciency: Reduces storage requirements signiﬁcantly as compared to storing full vectors.  ● Speed: Allows for fast approximate nearest neighbors  ● Scalability: Facilitates the handling of very large datasets by reducing the dimensionality and data redundancy.  LOSSY COMPRESSION FOR VECTORS

Slide 101

Slide 101 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud JVector OverSample ● Instead of searching for closest K, search for closest 2K (using compressed comparisons) ● Read uncompressed vectors from disk during search whenever a candidate is added to the resultset ● Reorder the resultset (of 2K) using uncompressed vectors, and return the top K

Slide 102

Slide 102 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud JVector Some results

Slide 103

Slide 103 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud JVector Some results

Slide 104

Slide 104 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud ● Scale-Out Capabilities: No upper limits  ● Garbage Collection: Pruning obsolete index information  ● Eﬀective Use of Disk: Enabling high throughput  ● Composability: Predicates, term-based searches. Aka Hybrid Search  ● Concurrency: Non-blocking, multi-threaded index construction  https://thenewstack.io/5-hard-problems-in-vector-search-and-how-cassandra-solves-them/ https://github.com/jbellis/jvector JVector 5 hard problems addressed

Slide 105

Slide 105 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Vector Databases Forrester Wave

Slide 106

Slide 106 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Query Post-Processing Overview LLM Question Results Chunks Similarities Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API Results Embeddings Chat completion API Rest API answer OTHER SOURCES Prompt Question RagContext

Slide 107

Slide 107 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud ReRanking General Principle LLM Question Results Chunks Similarities Vector DATABASE SIMILARITY SEARCH Question Embeddings DB Sentence Transformer Embeddings API Results Embeddings Chat completion API Rest API answer RE RANKING Scoring Function OTHER SOURCES

Slide 108

Slide 108 text

Slide 109

Slide 109 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud ReRanking LangChain4j “Content Aggregator”

Slide 110

Slide 110 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud ReRanking BM25 (Best Matching 25) ALL DOCUMENTS Documents containing query term Relevant documents relevant doc with query term ● How often do the query terms appear in the document (Term Frequency, TF)  ● Inverse document Frequency (IDF)  ● The length of the document (DL)  ● The average length of all documents in the collection (AVDL)  Parameters: Pro: ● Dynamic rankings  ● Good for long queries  Cons: ● No Semantic  ● No personalization 

Slide 111

Slide 111 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Multiple Search Queries Rank Search Results Assign Reciprocal Rank Scores Combine Scores Rank Documents Based on Combined Scores Generate Fused Ranking ReRanking Reciprocal Rank Fusion (RRF) ● Different Similarity Searches with different filters  ● Similarity Search + BM25  ● k constant  ● rank(d) rank position of doc d in each series (N) 

Slide 112

Slide 112 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Graph RAG Overview Subject (Node) Object Predicate (relation) (edge) triplets Structured Knowledge Graph (Subject, Predicate, Object) chunk (vector) vector indices (vector databases) chunk (vector) Semantic Similarities

Slide 113

Slide 113 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Graph RAG Overview Building the Graph (Knowledge Extraction) ● Link content based on hyperlinks in HTML ● How to: Link content based on common keywords (using Keybert) ● How to: Link content based on named entities (using GLiNER) ● How to: Link content based on document hierarchy Contextual Embeddings ● Embedded the triplet ● Context-aware embeddings BUILD GRAPH QUERY WITH LLM GREMLIN / CYPHER GRAPH TRAVERSAL Triplets Embeddings

Slide 114

Slide 114 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Graph RAG CassandraGraphStore CREATE TABLE astra_docs_nodes ( content_id text PRIMARY KEY, kind text, link_to_tags set>, links_blob text, metadata_blob text, text_content text, text_embedding vector ) CREATE TABLE graph_targets ( kind text, tag text, target_content_id text, target_text_embedding vector, PRIMARY KEY ((kind, tag), target_content_id) ); link

Slide 115

Slide 115 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 5. Advanced Concepts - Functions Calling - Agents, Agentic RAG

Slide 116

Slide 116 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Function calling Gemini What’s the weather like in Paris? It’s sunny in Paris! External API or service user prompt + getWeather(String) function contract call getWeather(“Paris”) for me please 󰚦 getWeather(“Paris”) {“forecast”:”sunny”} function response is {“forecast”:”sunny”} Answer: “It’s sunny in Paris!” Chatbot app

Slide 117

Slide 117 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud AI Agents ● Different types, and capabilities ○ Reﬂection ■ Chain-of-Thought, self reﬂection & correction, self grading ○ Planning ■ Create a multi-step plan of action ○ Tool use ■ Multiple function calling ○ Multi-agent collaboration ■ Chain several LLMs and/or RAG searches

Slide 118

Slide 118 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Agentic RAG _49_agentic_RAG Berlin’s origins, population, geographic situation 🧠 Agentic Assistant 🧠 1) Identify topics 2) Create questions 3) RAG search 4) Collect answers & generate final report 🛠 History/Geography Tool 🛠 1) Execute RAG search 2) Call topic assistant to summarize topic 🧠 Topic Assistant 🧠 1) Study topic answers 2) Create a report summary on the topic TOPICAL REPORTS FINAL REPORT Vector database TOPICAL REPORT

Slide 119

Slide 119 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud 6. Quality and Data Governance - RAG Evaluation - Security - Data Lifecycle

Slide 120

Slide 120 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud RAGAS evaluation metrics GENERATION RETRIEVAL

Slide 121

Slide 121 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud DeepEval evaluation metrics

Slide 122

Slide 122 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Essential to compare each of the final generated answers in response to user queries. ● MRR (Mean Reciprocal Rank): ○ Measures how quickly the first relevant result appears by averaging the reciprocal of its rank across queries. ● NDCG (Normalized Discounted Cumulative Gain): ○ Assesses the overall ranking quality by assigning higher importance to relevant items at top positions, considering both relevance and position. ● Other Techniques ○ Bilingual Evaluation Understudy (BLEU) : Text translation ○ Recall-Oriented Understudy for Gisting Evaluation (ROUGE) : Text Summarization ○ BERTscore Evaluation MRR and NDCG

Slide 123

Slide 123 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud LLM as Judge ● Prepare a dataset of questions and golden responses ● Use your RAG pipeline to answer those questions ● Use an LLM as a judge to gauge the quality of your RAG results, against a set of metrics

Slide 124

Slide 124 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud OWASP Top 10 for LLM Applications

Slide 125

Slide 125 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Security & Data Privacy ● Anonymize data (for ex. with Google Cloud Data Loss Prevention) ● Don’t log PII details ● Use local models when possible ● Separate tenants for compliance with data protection laws

Slide 126

Slide 126 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Data Lifecycle ● Your data isn’t stale, it’s alive ● When a document is updated, ○ chunking has changed ○ old chunks need to be retired ● Chunk metadata should track document origin, last update timestamps or document versions ● Prepare an update schedule

Slide 127

Slide 127 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Conclusion - It’s hard… no “one size fit all” solution - The different types of questions (Multi-hop & reasoning tasks) - LLMs with large context windows are great at reasoning!

Slide 128

Slide 128 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Lots of techniques, which one to pick?

Slide 129

Slide 129 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud There are easy questions… and hard ones! Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering. arXiv preprint arXiv:2210.01613

Slide 130

Slide 130 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud LLM w/ large context + advanced database + agentic Combine the best of both worlds! ● Implement Retrieval Augmented Generation with a capable vector database ● Use multistep agentic reasoning with LLMs with large context windows

Slide 131

Slide 131 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Table of Contents (220 pages): • First Look at LangChain4j • Understanding LangChain for Java • Getting Started • Accessing Models • Invoking Models • Extending Models • Processing Documents • Handling Embeddings • Retrieval-Augmented Generation • AI Services • Putting It All Together • Summary https://agoncal.teachable.com https://amazon.com/author/agoncal

Slide 132

Slide 132 text

FROM NAIVE TO ADVANCED RAG, THE COMPLETE GUIDE Google Cloud Thanks for your attention! (is all you need?) github.com/ datastaxdevs/ conference-2024-devoxx