Semantic vs keyword search as context for GPT

Slide 1

Slide 1 text

Semantic vs keyword search as context for GPT Tudor Golubenco June 2023 1

Slide 2

Slide 2 text

2 Serverless data platform for PostgreSQL Free-text search Vector search Rich types Branches Admin UI SDKs Tudor Golubenco CTO @ xata

Slide 3

Slide 3 text

ChatGPT on your data 3

Slide 4

Slide 4 text

How does it work? 4 ● Run a text search against the documentation to find the content that is most relevant to the question asked by the user. ● Create a prompt for the ChatGPT api with that content ● Steam back the result as is

Slide 5

Slide 5 text

Creating the prompt 5 With these rules: {rules} And this text: {context} Given the above text, answer the question: {question}

Slide 6

Slide 6 text

Example rules 6 ● “Your name is DanGPT” ● “Answer with the personality of a pirate” ● “Answer in german” ● “Format the answer in Markdown” ● “If the answer to the question is not found in the provided context, do not answer it. It’s very important that you only answer questions from the provided context. Tell the user that you don’t have this information”

Slide 7

Slide 7 text

Context - a search problem 7 ● It can only work if the correct context is found ● The bigger the context, the better the chances

Slide 8

Slide 8 text

Context sizes for the OpenAI models 8 Model Max context size (tokens) Max Context (pages) Price per input tokens Final cost per question gpt-3.5-turbo 4096 6 pages $0.0015 / 1K tokens ~ $0.01 / question gpt-3.5-turbo-16k 16,384 24 pages $0.003 / 1K tokens ~ $0.05 / question gpt-4 8,192 12 pages $0.03 / 1K tokens ~ $0.25 / question gpt-4-32k 32,768 48 pages $0.06 / 1K tokens ~ $2 / question

Slide 9

Slide 9 text

The search step 9 Keyword search (classic) Vector search (semantic) vs

Slide 10

Slide 10 text

Keyword search algorithm 10 ● Transform question in keywords. We use ChatGPT for it: ● Use the provided keywords to run a free-text-search and pick the top results ● Take the top 3 results ● If the result is bigger than the context, use the search “highlights” to select parts of the pages Extract keywords for a search query from the text provided. Add synonyms for words where a more common one exists.

Slide 11

Slide 11 text

Vector search algorithm 11 ● Preparation: Split up the docs into paragraphs, and compute embeddings for each ● Compute embeddings for the question ● Use cosine similarity search ● Add up top results until the context is filled

Slide 12

Slide 12 text

Accuracy 12 Question Keyword search result Vector search result Winner How do I install the Xata CLI? ✅ ✅ Vector (more complete) How do you use Xata with Deno? ✅ ❌ Keyword How can I import a CSV file with custom column types? ✅ ✅ Keyword (more complete) How can I filter a table named Users by the email column? ❌ ✅ Vector What is Xata? ✅ ✅ Draw Result: draw More details: https://xata.io/blog/keyword-vs-semantic-search-chatgpt

Slide 13

Slide 13 text

Convenience 13 ✋With vector search you need split in paragraphs and calculate embeddings for your docs, then maintain them on updates ✋Vector search doesn’t need synonym dictionaries, analyzers, etc. Result: draw

Slide 14

Slide 14 text

Tunability 14 ✅Keyword search has the usual controls: boosters, column weights, relevancy functions 🟧 Vector search has the cosine similarity, filters, negative matches Winner: keyword search