"Talk to your data": Improving RAG solutions based on real-world experiences

‘Talk to your data’ Improving RAG solutions based on real-world
experiences Sebastian Gingter | Developer Consultant | Thinktecture AG | [email protected]

2 Introduction Improving RAG solutions based on real-world experiences `Talk
to your data`

Retrieval-augmented generation (RAG) Indexing & (Semantic) search Cleanup & Split
Text Embedding Question Text Embedding Save Query Relevant Text Question LLM Vector DB Embedding model Embedding model Indexing / Embedding QA Improving RAG solutions based on real-world experiences `Talk to your data`

4 Vectors from your Embedding-Model 0 Improving RAG solutions based
on real-world experiences `Talk to your data`

5 ▪ Select your Embedding Model carefully for your use
case ▪ e.g. ▪ intfloat/multilingual-e5-large-instruct ~ 50% ▪ T-Systems-onsite/german-roberta-sentence-transformer-v2 < 70 % ▪ danielheinz/e5-base-sts-en-de > 80% hit rate ▪ Maybe fine-tuning of the embedding model might be an option ▪ As of now: Treat embedding models as exchangeable commodities! `Talk to your data` Important Improving RAG solutions based on real-world experiences

6 Indexing Improving RAG solutions based on real-world experiences `Talk
to your data`

7 ▪ Loading ▪ Clean-up ▪ Splitting ▪ Embedding ▪
Storing Steps of indexing Improving RAG solutions based on real-world experiences `Talk to your data`

8 ▪ Import documents from different sources, in different formats
▪ LangChain has very strong support for loading data ▪ Support for cleanup ▪ Support for splitting Loading https://python.langchain.com/docs/integrations/document_loaders Improving RAG solutions based on real-world experiences `Talk to your data`

9 ▪ HTML Tags ▪ Formatting information ▪ Normalization ▪
lowercasing ▪ stemming, lemmatization ▪ remove punctuation & stop words ▪ Enrichment ▪ tagging ▪ keywords, categories ▪ metadata Clean-up Improving RAG solutions based on real-world experiences `Talk to your data`

10 ▪ Document is too large / too much content
/ not concise enough Splitting (Text Segmentation) ▪ by size (text length) ▪ by character (\n\n) ▪ by paragraph, sentence, words (until small enough) ▪ by size (tokens) ▪ overlapping chunks (token-wise) Improving RAG solutions based on real-world experiences `Talk to your data`

11 Splitting (Semantic chunking) Improving RAG solutions based on real-world
experiences `Talk to your data`

12 ▪ Indexing Vector-Databases Splitted (smaller) parts Embedding- Model Embedding
𝑎 𝑏 𝑐 … Vector- Database Document Metadata: Reference to original document Improving RAG solutions based on real-world experiences `Talk to your data`

13 Retrieval (Search) Improving RAG solutions based on real-world experiences
`Talk to your data`

14 Retrieval Embedding- Model Embedding 𝑎 𝑏 𝑐 … Vector-
Database “What is the name of the teacher?” Query Doc. 1: 0.86 Doc. 2: 0.84 Doc. 3: 0.79 Weighted result … (Answer generation) Improving RAG solutions based on real-world experiences `Talk to your data`

15 Indexing II Not good enough? Improving RAG solutions based
on real-world experiences `Talk to your data`

16 Not good enough? ? Improving RAG solutions based on
real-world experiences `Talk to your data`

17 ▪ Semantic search is just search ▪ It’s just
as good as your embeddings ▪ Garbage in -> garbage out Not good enough? Improving RAG solutions based on real-world experiences `Talk to your data`

18 ▪ Search for a hypothetical Document HyDE (Hypothetical Document
Embedddings) LLM, e.g. GPT-3.5-turbo Embedding 𝑎 𝑏 𝑐 … Vector- Database Doc. 3: 0.86 Doc. 2: 0.81 Doc. 1: 0.81 Weighted result Hypothetical Document Embedding- Model Write a company policy that contains all information which will answer the given question: {QUERY} “What should I do, if I missed the last train?” Query https://arxiv.org/abs/2212.10496 Improving RAG solutions based on real-world experiences `Talk to your data`

19 ▪ Downside of HyDE: ▪ Each request needs to
be transformed through an LLM (slow & expensive) ▪ A lot of requests will probably be very similar to each other ▪ Each time a different hyp. document is generated, even for an extremely similar request ▪ Leads to very different results each time ▪ Idea: Alternative indexing ▪ Transform the document, not the query What else? Improving RAG solutions based on real-world experiences `Talk to your data`

20 Alternative Indexing HyQE: Hypothetical Question Embedding LLM, e.g. GPT-3.5-turbo
Transformed document Write 3 questions, which are answered by the following document. Chunk of Document Embedding- Model Embedding 𝑎 𝑏 𝑐 … Vector- Database Metadata: content of original chunk Improving RAG solutions based on real-world experiences `Talk to your data`

21 ▪ Retrieval Alternative Indexing Embedding- Model Embedding 𝑎 𝑏
𝑐 … Vector- Database Doc. 3: 0.89 Doc. 1: 0.86 Doc. 2: 0.76 Weighted result Original document from metadata “What should I do, if I missed the last train?” Query Improving RAG solutions based on real-world experiences `Talk to your data`

Comparing Embeddings TALK TO YOUR DATA DEMO Improving RAG solutions
based on real-world experiences `Talk to your data`

23 Conclusion Improving RAG solutions based on real-world experiences `Talk
to your data`

Retrieval-augmented generation (RAG) Indexing & (Semantic) search Cleanup & Split
Text Embedding Question Text Embedding Save Query Relevant Text Question LLM Vector DB Embedding model Embedding model Indexing / Embedding QA Improving RAG solutions based on real-world experiences `Talk to your data`

25 ▪ Tune text cleanup, segmentation, splitting ▪ HyDE or
HyQE or alternative indexing ▪ How many questions? ▪ With or without summary ▪ Other approaches ▪ Only generate summary ▪ Extract “Intent” from user input and search by that ▪ Transform document and query to a common search embedding ▪ HyKSS: Hybrid Keyword and Semantic Search https://www.deg.byu.edu/papers/HyKSS.pdf Recap: Not good enough? Improving RAG solutions based on real-world experiences `Talk to your data`

26 ▪ Semantic search is a first and fast Generative
AI business use-case ▪ Quality of results depend heavily on data quality and preparation pipeline ▪ Always evaluate approaches with your own data & queries ▪ The actual / final approach is more involved as it seems on the first glance ▪ RAG pattern can will produce breathtaking good results Conclusion Improving RAG solutions based on real-world experiences `Talk to your data`

Thank you! Sebastian Gingter https://thinktecture.com/sebastian-gingter https://github.com/thinktecture-labs/talk-to-your-data

"Talk to your data": Improving RAG solutions ba...

"Talk to your data": Improving RAG solutions based on real-world experiences

Sebastian Gingter

More Decks by Sebastian Gingter

Other Decks in Programming

Featured

Transcript

‘Talk to your data’ Improving RAG solutions based on real-world

2 Introduction Improving RAG solutions based on real-world experiences `Talk

Retrieval-augmented generation (RAG) Indexing & (Semantic) search Cleanup & Split

4 Vectors from your Embedding-Model 0 Improving RAG solutions based

5 ▪ Select your Embedding Model carefully for your use

6 Indexing Improving RAG solutions based on real-world experiences `Talk

7 ▪ Loading ▪ Clean-up ▪ Splitting ▪ Embedding ▪

8 ▪ Import documents from different sources, in different formats

9 ▪ HTML Tags ▪ Formatting information ▪ Normalization ▪

10 ▪ Document is too large / too much content

11 Splitting (Semantic chunking) Improving RAG solutions based on real-world

12 ▪ Indexing Vector-Databases Splitted (smaller) parts Embedding- Model Embedding

13 Retrieval (Search) Improving RAG solutions based on real-world experiences

14 Retrieval Embedding- Model Embedding 𝑎 𝑏 𝑐 … Vector-

15 Indexing II Not good enough? Improving RAG solutions based

16 Not good enough? ? Improving RAG solutions based on

17 ▪ Semantic search is just search ▪ It’s just

18 ▪ Search for a hypothetical Document HyDE (Hypothetical Document

19 ▪ Downside of HyDE: ▪ Each request needs to

20 Alternative Indexing HyQE: Hypothetical Question Embedding LLM, e.g. GPT-3.5-turbo

21 ▪ Retrieval Alternative Indexing Embedding- Model Embedding 𝑎 𝑏

Comparing Embeddings TALK TO YOUR DATA DEMO Improving RAG solutions

23 Conclusion Improving RAG solutions based on real-world experiences `Talk

Retrieval-augmented generation (RAG) Indexing & (Semantic) search Cleanup & Split

25 ▪ Tune text cleanup, segmentation, splitting ▪ HyDE or

26 ▪ Semantic search is a first and fast Generative

Thank you! Sebastian Gingter https://thinktecture.com/sebastian-gingter https://github.com/thinktecture-labs/talk-to-your-data