Jan Hauffa - A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences and Future Perspectives

Slide 1

Slide 1 text

A Case Study on Retrieval-Augmented Generation for Document Q&A: Experiences and Future Perspectives @ Jan Hauffa

Slide 2

Slide 2 text

Use Case: Document Q&A 2 Ask questions about the content of a document in natural language, receive answers in natural language. ➔ How to „teach“ an LLM new factual knowledge? LLM “Who invented the Transformer architecture?” “Ashish Vaswani et al.”

Slide 3

Slide 3 text

AGENDA 01 02 03 Teaching Factual Knowledge to an LLM Retrieval-Augmented Generation Measure, Debug, Improve

Slide 4

Slide 4 text

01 Teaching Factual Knowledge to an LLM

Slide 5

Slide 5 text

First Attempt: Fine Tuning • Taking a similar approach to “Databricks Dolly-2” (Implementation: Zaid Ur Rehman) • Base model Pythia-12b, 8-bit quantization, LoRA • Mix of supervised and self-supervised training: • German (machine) translation of the dolly-15k instruction dataset https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual • German legal texts (“Bürgerliches Gesetzbuch”, BGB) • 1x A100 (80 GB VRAM) on Azure Databricks, overall training time ~ 48h • Results • Quality of generated German text substantially lower than English text • Unable to answer even simple questions correctly: • “Was ist ein Unternehmer?” – “Unternehmer sind Personen, die eine wirtschaftliche Tätigkeit durchführen, die nicht der Erwerbsgeschäftsfelder entspricht, also etwa durch staatliche Beihilfen oder staatliche Subventionen profitieren.” 5

Slide 6

Slide 6 text

Lessons Learned • Need a base model that has been trained on a sufficient amount of German language text. • Fine tuning: What is it good for? • Suitable for controlling format and style of the expected output (“Instruction Fine Tuning”, Chung et al., 2022). • Not suitable for the acquisition of factual knowledge from arbitrary, “natural” text. • Hypothesis: Knowledge acquisition during training (and fine tuning) requires repeated exposure to the new facts in different contexts, phrased differently. • “Textbooks Are All You Need” (Gunasekar et al., 2023): Teach a model to write Python code using a training dataset of synthetic textbooks. • “Curse of reversal”: “[an LLM] does not increase the probability P(b = a) after training on a = b” – data augmentation by paraphrasing helps (Berglund et al. 2023) 6

Slide 7

Slide 7 text

Zero Shot, Few Shot, and In-Context Learning • Zero Shot Learning: Prompt contains the task description • Translate English to German. English: How are you? German: • Few Shot Learning: Prompt contains task description and a small number of examples • Translate English to German. English: Good morning! German: Guten Morgen! English: How are you? German: • Why does this work? • “During unsupervised pre-training, a language model develops a broad set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task.” (Brown et al., 2020) → In-Context Learning 7

Slide 8

Slide 8 text

Document Q&A as Zero Shot Learning? • Idea: Construct a prompt from the content of the document and the question to be answered. • Advantages: • Model no longer has to provide factual knowledge, just language skills and basic reasoning. • No expensive fine tuning for each new document. • One important caveat: • Limited context length of Transformer models → This only works for short documents! • Typical “open” LLMs: 2K tokens, 1-4 pages of text • Attention mechanism has quadratic complexity (time and space!) in the number of tokens. • Diverse attempts to “fix” attention: Efficient attention (Tay et al., 2020), RoPE (Su et al. 2021), YaRN (Peng et al., 2023) – but no definitive solution yet. ➔ What about Q&A with 2 documents? 10 documents? An entire library? 8

Slide 9

Slide 9 text

02 Retrieval-Augmented Generation 9

Slide 10

Slide 10 text

Retrieval-Augmented Generation for Document Q&A • Two steps: Indexing, Question Answering • Indexing: 1. Split an arbitrary number of documents into small chunks. 2. For each chunk, compute a semantic embedding. 3. Store the chunks and their embeddings in a vector database. • Question Answering: 1. Compute an embedding of the given question. Use asymmetric embeddings to ensure that the question is close to potential answers in the embedding space (Wang et al., 2022). 2. Perform an approximate nearest-neighbor search (ANN, e.g. Malkov and Yashunin, 2016) with the embedding of the question as the query. 3. Construct a prompt for an LLM from the top-k result chunks (“context”) and the question. 10

Slide 11

Slide 11 text

Semantic Similarity Search using Embeddings 11

Slide 12

Slide 12 text

The Document Q&A Pipeline 12 „The quick brown fox jumps“ „brown fox jumps over“ „jumps over the lazy dog.“ ( ) 0 0 . . ( ) 0 0 . . ( ) 0 0 . . Documents Overlapping chunks Embeddings Vector database „What sound does a cat make?“ ( ) 0 0 . . „Dog goes ‘woof‘.“ „Cat goes ‘meow‘.“ „And the seal goes ‘ow ow ow’.“ „Read the following information: Dog goes ‘woof‘. Cat goes ‘meow‘. And the seal goes ‘ow ow ow’. Now use that information to answer the question: What sound does a cat make?“ „Cat goes ‘meow‘.“ Question Query embedding Vector database Top-k most similar chunks Prompt LLM Answer query Indexing Question Answering

Slide 13

Slide 13 text

Demo 13 Implementation: Hoai-Nam Tran

Slide 14

Slide 14 text

Demo – Technical Specs • Models • Embedding: Multilingual E5 large (Wang et al., 2022) – Currently holds rank 6 on the MTEB leaderboard (Retrieval): https://huggingface.co/spaces/mteb/leaderboard • LLM: Llama-2 13b (https://ai.meta.com/llama/) • Hardware • Runs on an NVIDIA Titan X (12 GB VRAM) with 4-bit quantization. • CPU-only and mixed CPU/GPU inference: llama.cpp (https://github.com/ggerganov/llama.cpp) • Software • Vector database: Milvus • Document management: DaSense (Text extraction, OCR, ACLs, browsing/tagging/searching) ➔Experimentation is possible, even with consumer hardware! (But you’ll want more VRAM for larger context, less quantization, larger models, …) 14

Slide 15

Slide 15 text

Retrieval-Augmented Generation for Everything Else (1/2) • Indexing: 1. Split an arbitrary number of documents data into small chunks. 2. For each chunk, compute a semantic embedding. 3. Store the chunks and their embeddings in a vector database. • Question Answering Text Generation: 1. Compute an query embedding of the given question somehow. 2. Perform an approximate nearest-neighbor search (ANN) with the query embedding. 3. Construct a prompt for an LLM from the top-k result chunks and the question task description. 15

Slide 16

Slide 16 text

Retrieval-Augmented Generation for Everything Else (2/2) • Chat: • Store individual lines of chat history, use symmetric embeddings: → Search for most similar parts of past conversation. • Translation: • Store bilingual sentence pairs, compute embeddings for sentences in source language only: → Search for most relevant examples for few-shot learning. • Summarization: • Perform clustering on chunk embeddings, find most similar chunks to a centroid: → Topical summary. • etc. 16

Slide 17

Slide 17 text

03 Measure, Debug, Improve 17

Slide 18

Slide 18 text

Evaluation • Given a “ground truth” dataset of source documents and question/answer pairs, how to evaluate a document Q&A system? • Human evaluation: • Absolute rating: “Is the content of the generated text correct / equivalent to the reference answer?” → Accuracy • Relative rating: “Which of the two answers is more accurate / more similar to the reference?” → Ranking, Elo • Issues: Does not scale, annotators need domain knowledge, agreement, … • Evaluation by LLM: • Same questions as before, but LLM replaces human. • Stronger model assesses weaker model; GPT-4 generally acknowledged as “strongest”. • Issues: Unknown bias of GPT-4, token costs per test run, LLM drift (Chen et al., 2023) 18

Slide 19

Slide 19 text

Can an LLM Evaluate Itself? 19 Decompose the evaluation into multiple related tasks: Faithfulness: Can all claims that are made in the answer be inferred from the context? Answer Relevance: Does the answer directly and appropriately address the question? Context Recall: Proportion of ground truth sentences that are consistent with context. (here: context == top-k chunks!) Context Precision: Proportion of context that is relevant for answering the question. + Direct assessment of correctness via LLM (prompting) or embedding model (semantic similarity). https://github.com/explodinggradients/ragas

Slide 20

Slide 20 text

Debugging 20 „The quick brown fox jumps“ „brown fox jumps over“ „jumps over the lazy dog.“ ( ) 0 0 . . ( ) 0 0 . . ( ) 0 0 . . Documents Overlapping chunks Embeddings Vector database „What sound does a cat make?“ ( ) 0 0 . . „Dog goes ‘woof‘.“ „Cat goes ‘meow‘.“ „And the seal goes ‘ow ow ow’.“ „Read the following information: Dog goes ‘woof‘. Cat goes ‘meow‘. And the seal goes ‘ow ow ow’. Now use that information to answer the question: What sound does a cat make?“ „Cat goes ‘meow‘.“ Question Query embedding Vector database Top-k most similar chunks Prompt LLM Answer query This is not a black box!

Slide 21

Slide 21 text

Debugging 21 • Chunking issues? • Better heuristics for choosing chunk boundaries, e.g. via document segmentation (Li et al., 2022) → logical units of text • Retrieval issues? • More/different embeddings: Sparse Embeddings (SPLADE; Formal et al., 2021), Hypothetical Document Embeddings (HyDE; Gao et al., 2022) • Combine ANN and traditional retrieval (e.g. BM25) • Re-ranking (Glass et al., 2022) • ANN parameter tuning • Generation issues? • Larger model, less quantization • Sampling algorithm and parameters (temperature, repetition penalty, …)

Slide 22

Slide 22 text

What Kind of Questions can RAG (not) Answer? (1/2) 22 • Factual questions about the content of the document • “What’s a transformer model?” • Questions that reference earlier questions • “Who invented it?” • → Store past questions/answers in prompt or vector DB. • Questions about meta-data • “Which section contains the definition of the Transformer architecture?” • “How long is the paper?” • → Store meta-data in vector DB, per document or inside the chunks. • Questions that require background knowledge • “How was the Transformer architecture received by the scientific community?” • → Store supplementary documents in vector DB.

Slide 23

Slide 23 text

What Kind of Questions can RAG (not) Answer? (2/2) 23 • Questions about tables, photos, illustrations • “Which model achieves the best average BLEU score?” • “What happens to the product of Q and K?” • → Would need a multimodal LLM like LLaVA (Liu et al., 2023), but also multimodal embeddings! • Hack: Generate a textual description and embed that. • Questions that cannot be answered based on single chunks • “Give me a short summary of the paper!” • “Can you translate the text into French?” • Asking for opinions, unrelated questions, insults, random nonsense,… ➔ What kind of questions are my users asking?

Slide 24

Slide 24 text

Thank you! Contact us: Jan Hauffa (jah), Zaid Ur Rehman (zur), Hoai-Nam Tran (hnt) @norcom.de NorCom Information Technology GmbH & Co. KGaA Gabelsbergerstraße 4 80333 München T + 49 (0) 89 939 48 0 F + 49 (0) 89 939 48 111 E [email protected] 24

Slide 25

Slide 25 text

References (1/2) • Gunasekar et al., 2023: Textbooks Are All You Need. https://arxiv.org/abs/2306.11644 • Berglund et al., 2023: Taken out of context: On measuring situational awareness in LLMs. https://arxiv.org/abs/2309.00667 • Chung et al., 2022: Scaling Instruction-Finetuned Language Models. https://arxiv.org/abs/2210.11416 • Brown et al., 2020: Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165 • Tay et al., 2020: Efficient Transformers: A Survey. https://arxiv.org/abs/2009.06732 • Su et al., 2021: RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864 • Peng et al., 2023: YaRN: Efficient Context Window Extension of Large Language Models. https://arxiv.org/abs/2309.00071 • Wang et al., 2022: Text Embeddings by Weakly-Supervised Contrastive Pre-training. https://arxiv.org/abs/2212.03533 25

Slide 26

Slide 26 text

References (2/2) • Malkov and Yashunin, 2016: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. https://arxiv.org/abs/1603.09320 • Chen et al., 2023: How is ChatGPT's behavior changing over time? https://arxiv.org/abs/2307.09009 • Li et al., 2022: DiT: Self-supervised Pre-training for Document Image Transformer. https://arxiv.org/abs/2203.02378 • Formal et al., 2021: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. https://arxiv.org/abs/2107.05720 • Gao et al., 2022: Precise Zero-Shot Dense Retrieval without Relevance Labels. https://arxiv.org/abs/2212.10496 • Glass et al., 2022: Re2G: Retrieve, Rerank, Generate. Proceedings of NAACL-HLT. https://aclanthology.org/2022.naacl-main.194 • Liu et al., 2023: Visual Instruction Tuning. https://arxiv.org/abs/2304.08485 26