Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unlocking the potential of vector embeddings

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Unlocking the potential of vector embeddings

Avatar for Frank van Dijk

Frank van Dijk

May 22, 2026

More Decks by Frank van Dijk

Other Decks in Marketing & SEO

Transcript

  1. How I was taught SEO works Find the right keywords

    Keyword research Add them to your page Follow a checklist Rank #1 and receive clicks Proudly report 😎
  2. But the clicks never came Ranked on page 2 or

    3 for the target keywords Followed every best practice Still no meaningful traffic Expectations Reality The checklist was green, but the results were red Time Clicks
  3. Looked at my competitors Expectations "Competitors ranking #1 must be

    using the keyword everywhere" My page: "Best running shoes for beginners 2026" Reality They ranked without even using the exact keyword Competition: "How to pick your first pair of running shoes" My page was "optimised" and theirs wasn't, yet they still won
  4. Piece of content “running shoes” Embedding model OpenAI / Google

    / … Vector [0.24, -0.87, 0.41, ...] What are vector embeddings? Text embeddings are abstract, high-dimensional vectors that capture the semantic meaning of text, converted into numbers so that a computer can process them.
  5. We can plot this vector in multi dimensional space LLMs

    do this, but with thousands of dimensions, impossible for us humans to visualize We’ve got a vector (8, 13) y = 13 x = 8
  6. The Netflix blindspot Netflix Research paper (2024) Cosine similarity of

    learned embeddings can yield arbitrary results. The underlying reason is that the learned embeddings have a degree of freedom that can render arbitrary cosine-similarities. Cosine similarity calculates the angle, not the actual difference This can cause false similarities when comparing content within the same topic
  7. Euclidean Distance The direct “line-of-sight” distance between two points in

    the vector space The smaller the distance, the more similar the vectors are
  8. Combining both rules out most of the risk Search query

    vs document Cosine Euclidean 0.92 0.92 0.05 8.4 0.3 11.2 Match? False positive Match No match Restaurant in Athens ↔ Travel guide Athens Restaurant in Athens ↔ Restaurants in Athens Restaurant in Athens ↔ Apple pie recipe Use cosine as the primary metric for semantic similarity. Use Euclidean distance as a validation check If they don't match, investigate manually. But that double-check is expensive at scale
  9. Most models use L2 Normalisation Length is often noise, word

    frequency or document size L2 normalization forces each vector to have an exact length of 1.
  10. How Google does this Matryoshka Representation Learning Google's embedding models

    use Matryoshka Representation Learning, a technique in which the most important information is contained in the first few dimensions, much like Russian nesting dolls.
  11. Just like those Russian nesting dolls The outer doll contains

    the main subject, the next ones more details, inner doll captures the subtlest nuances From broad to detailed 64 256 768 Food Greece restaurant Restaurant in Athens dim 257-768 dim 65-256 dim 1-64
  12. We can cut the embeddings 64 dim 256 dim 768

    dim Pace Fast Decent Slow Quality Rough Good Precise Dimensions 64 256 768 Use First filter Balance Final result Up to 30 times less storage space for 64 dimensions compared to 768 dimensions
  13. Why this matters Traditional Always requires all 768 dimensions Time-consuming

    with large datasets Does not scale well with thousands of pages Matryoshka Fast pre-filter with 64–256 dimensions Precision with all dimensions as needed Scalable for thousands of pages
  14. From query to meaning Keyword matching Keyword BM25 inverted index

    RankEmbed Embeddings semantic match DeepRank Deep understanding NavBoost User signals and click data Source: DOJ lawsuit against Google (2023–2024), testimony of Pandu Nayak RankEmbed evaluates the semantic similarity between a search query and a document, exactly what embeddings do
  15. Before: 33% After: 54% The same seems to be going

    on for AIO This hasn’t been officially confirmed by Google We saw a 64% increase in citations by improving content based on cosine similarity
  16. The major models Model Provider Dim Max tokens MTEB Price/1M

    Matryoshka gemini-embedding-001 Google 3,072 2,048 68.3 $0.15 Yes text-embedding-005 Google 768 2,048 63.8 $0.006 Yes text-embedding-3-large OpenAI 3,072 8,191 64.6 $0.13 Yes Embed v4 Cohere 1,536 128k 65.2 $0.12 Yes Voyage-3-large Voyage AI 1,024 32k 67.1 $0.06 No MTEB = Massive Text Embedding Benchmark, the standard benchmark. Scores not always directly comparable across versions. All major models now support Matryoshka
  17. More is not always better gemini-embedding-001 text-embedding-3-large Voyage-3-large text-embedding-005 Cohere

    Embed v4 More dimensions = more nuance, but also more storage and compute Google's text-embedding-005 scores competitively with just 768 dimensions, comparable to models with 4x larger vectors 1M pages × 768 dim × 4 bytes = ~3 GB 1M pages × 3,072 dim × 4 bytes = ~12 GB
  18. Text only Product description → embedding Search query → embedding

    Compare: cosine similarity Product images are ignored Text + images Recent models support multimodal Product description + photo → embedding Search query → embedding Compare: cosine similarity A more complete picture for vector search Models that support multimodal: gemini-embedding-2-preview Cohere Embed v4 Voyage-4
  19. 768 dimensions $0.006/1M tokens Choosing your model text-embedding-005 Budget multimodal

    Quality gemini-embedding-001 Gemini-embedding-2- preview 3,072 / 1,024 dimensions Top MTEB scores Text + images Single vector space Start cheap, scale up when needed, Matryoshka makes it flexible
  20. This unlocks many possibilities But first you need somewhere to

    store your embeddings Cannibalization detection Topical authority auditing Content gap analysis Internal link suggestions Clustering Redirect mapping Duplicate content detection Hreflang tag mapping
  21. You need a vector database Content Embedding model Vector embeddings

    Vector database Script You've embedded thousands of pages You can't work with it in a spreadsheet You need something that can quickly find the nearest vectors to any query
  22. What is a vector database? A database optimised for storing

    embeddings and performing similarity search. You put vectors in, and ask: "give me the 10 vectors closest to this query." Pinecone Weaviate ChromaDB Dedicated For most cases BigQuery is the best option BigQuery PostegreSQL + pgvector Elasticsearch Existing tools + vectors
  23. How to build one Crawl your site With Screaming Frog,

    Sitebulb or Python Chunk your content Per page, per H2 section or per paragraph Generate embeddings Gemini, OpenAI, Cohere or Voyage AI API Store in database BigQuery, Pinecone or ChromaDB Example BigQuery setup 3 tables: full page embeddings - H2-based sections chunks - internal links
  24. Retrieval Augmented Generation: instead of letting an LLM guess, you

    first retrieve the most relevant content from your vector database, then feed it to the model as context Build RAG on top of it Prompt Relevant content Vector DB search LLM call Grounded answer With RAG Without RAG LLM relies on search and training date what has a risk of hallucinations LLM answers based on your actual input to give better grounded output
  25. In addition to being incredibly useful on its own, a

    vector database can also be used for SEO purposes
  26. Prompt Relevant content Vector DB search LLM call Grounded answer

    It helps identify content opportunities Product brochures/manuals Product information that wasn’t published This allowed us to quickly identify content opportunities on existing blogs, product categories, and product pages using RAG. We used cosine similarity to identify relevant product information
  27. Adding sales & support intelligence Support tickets Sales notes Call

    transcripts Vector database Feature requests Installation problems Proces questions Specific complains Embedding clusters reveal hidden patterns The client was also monitoring sales and support activities like tickets, notes and transcripts That it showed a pattern of information demand that keyword tools did not reveal
  28. Keeping your vector database fresh A vector database is only

    useful if it reflects your current site. Google Cloud lets you automate this with a simple Python script. 1 Fetch sitemap Parse XML sitemap for all URLs + lastmod dates 2 Compare lastmod Check against BigQuery: which pages are new or updated? 3 Embed pages Only fetch + embed pages with a newer lastmod, skip the rest 4 Update BigQuery Upsert new embeddings into your vector table Scheduled: daily / weekly via Cloud Run or Cloud Functions Only re-embed changes 90%+ cheaper than full re-crawl Lastmod as trigger Sitemap XML or HTML meta tag Runs inside Google Cloud Cloud Run / Cloud Functions + Scheduler
  29. Take action fast on deleted pages Your vector database knows

    every page on your site. When a page disappears from the sitemap, you can act before users hit a 404. 1 Deleted page When during the refreshment deleted pages are found it will start a new flow 2 Backlink/traffic Check for incoming backlinks (Ahrefs) and traffic (GSC) generated for that URL 3 Find similar page Check if there is a page very similar to that page to find a potential redirect 4 Recommendation Based on the incoming backlinks and traffic give a recommendation ⇒ Redirect ⇒ New content Manual check
  30. What does it cost? $0.02/GB/month Storage After 90 days untouched:

    $0.01/GB (automatic) First 10 GB free $6.25/TB scanned Queries First 1 TB/month free Use partitioning + clustering to reduce scans by 60–90% $0.025/1M tokens Embedding API text-embedding-005 One-time cost per page Only re-embed on changes Example: 5,000 page webshop < $1 /month & < $1 setup
  31. Vector database Embeddings stored permanently Query anytime without re-embedding Supports

    RAG, clustering, similarity search Scales to thousands of pages Best for ongoing analysis, RAG pipelines, large sites Script based Generate embeddings on the fly No database to maintain Re-embed every time you run analysis Costs more API calls over time Best for quick experiments and one-off analyses Also works without a vector database
  32. SF has embedding functions built in Since v20, Screaming Frog

    generates embeddings during crawl via OpenAI, Gemini, or Ollama. No code needed, just an API key and a paid licence. Also with options of analysing inside the tool 1 Config > Spider > Extraction > enable Store HTML Config > Spider > Rendering > enable JavaScript Config > API Access > AI and connect OpenAI, Gemini or Ollama Add from Library > "Extract Semantic Embeddings from Page" and add API Setup in Screaming Frog Build-in functions in Screaming Frog Automatically finds pages with similar content Semantic similarity Low relevance detection Redirect mapping Embedding export Flags pages that deviate from your site's average Match old URLs to closest new URLs during migrations CSV with URL + embedding vector for further analysis 2 3 4
  33. 1. Embed every page 2. Reduce dimensions to 2D 3.

    Plot as a scatter plot 4. Similar pages appear closer One bubble chart that reveals your site's semantic structure Visualising your website Cluster 1 Cluster 4 Cluster 2 Cluster 3 Cluster 5 Cluster 6 Outliers?
  34. These bubble charts reveals a lot Gaps = content opportunities

    Tight clusters = strong topical authority Pages far from any cluster may be off-topic or outdated Combining competitors sites can show content gaps Outliers check Topical authority check
  35. How to build it Python Data pipeline Crawling, chunking and

    embedding Plotly Interactive data charts Create interactive charts to analyse, zoom, filter and more Streamlit Web app without frontend code Share with your team via a URL and add filters, search, sidebar controls ~20 lines of Python: from embeddings to a shareable web app
  36. Sheet outputs also works Page a Page b Cosine Action

    /shoes/nike-air-max /shoes/nike-air-max-review 0.97 Duplicate check /shoes/nike-air-max /shoes/adidas-ultraboost 0.82 Internal link /shoes/trail-running /guides/trail-running-guide 0.78 Internal link /shoes/nike-air-max /blog/summer-outfits 0.31 No action /category/running /category/running-shoes 0.96 Duplicate check /guides/sizing /faq/shoe-sizes 0.88 Internal link > 0.95 = Duplicate check 0.70 - 0.95 = Related content < 0.70 = Not related content The scatter plot is great for understanding your site, but for finding to dos, you need a spreadsheet
  37. The action framework Different topics No semantic relationship Expected for

    most pairs No action needed < 0.70 > 0.95 0.70 - 0.95 Different topics No semantic relationship Expected for most pairs Different topics No semantic relationship Expected for most pairs Potential internal link Duplicate check < 0.70 0.70 - 0.95 > 0.95 0 1 0.95 0.70
  38. Topical authority scoring Strong Weak Embed all pages in a

    topic cluster. Calculate the centroid, the average vector. Then measure how close each page is to that centroid. Tight cluster = strong authority All pages reinforce the same topic Scattered pages = weak authority Pages drift away from the core topic Topic cluster Pages Avg distance to centroid Authority score Running shoes 24 0.12 Strong Outdoor gear 18 0.18 Good Nutrition 6 0.35 Developing Lifestyle 3 0.52 Weak
  39. Passage level scoring Don't just embed pages, embed sections. This

    reveals which parts of a page are relevant and which are filler H1: Trail running guide Example: /guides/trail-running H2: Choosing the right shoes H2: Training schedule H2: Our store locations H2: Newsletter signup 0.92 0.88 0.85 0.31 0.15 What this reveals: High-scoring sections are on-topic and strengthen the page Low-scoring sections are off-topic filler that dilutes relevance Action: remove or move low-scoring sections to keep the page focused
  40. Add semantics to authority building Domain authority score High =

    good, but says nothing about relevance DR Traditional New Traditional How many visitors More = better, but also no relevance signal Embed the domains Closer vectors = more relevant link Traffic Similarity A DR 30 link from a semantically close site can outperform a DR 70 link from an unrelated domain
  41. Authority building assistant Embed your site Calculate similarity Embed prospect

    domains Combine with DR + traffic Prospect DR Traffic Semantic sim. Priority running-magazine.com 62 180K 0.89 High general-lifestyle-blog.com 71 350K 0.42 Low local-sports-shop.nl 28 12K 0.91 High tech-review-site.com 55 200K 0.35 Low outdoor-gear-blog.com 45 85K 0.82 Medium Closer vectors = more topically relevant link
  42. Detect differences in statements To appear credible to Google and

    LLMs, it’s important to be consistent in your statements. With embeddings, you can find similar text to ensure consistency /product/shoe-a /blog/shoe-a /product-category/shoes-a Unique content Statement about product Statement about product Unique content Statement about product Statement about product Unique content Statement about product Statement about product Using cosine similarity, statements can be quickly linked when you check them at the H2 level along with the underlying paragraph
  43. "Don't get lost in tools and technology. Focus on your

    customer, not on algorithms." You've probably heard this And they're right!
  44. 50 pages You can read every page Manually decide what

    to change Manual review works. And you should with this amount of pages 50.000 pages You can't read every page You can't manually compare 1.2 billion pairs You need something to tell you where to look But at scale, you need a starting point These techniques don't replace your judgment, they give you a starting point. The human decision still matters. But now you're making it with data instead of guessing where to begin.
  45. Start here Basic Using Screaming Frog + embeddings Start crawling

    websites with embeddings and use standard reports Moderate Start using code Code yourself or use AI to create Python scripts to analyse further Advanced Build and maintain a vector database Advance by creating a vector database and analyse with that data Advanced Implement further in the company Create workflows for different projects and let non technical SEOs use it
  46. What we covered 1 Embeddings capture meaning, not keywords 2

    Google uses them in Search and AI Overviews 3 You can use the same models yourself 4 Vector databases make it scalable 5 RAG grounds AI in your own data 6 Tools like Screaming Frog make it accessible