Unlocking the potential of vector embeddings

Athens SEO 2026 Welcomes On Stage SEO specialist, DEPT® Frank
van Dijk Your photo here

How I was taught SEO works Find the right keywords
Keyword research Add them to your page Follow a checklist Rank #1 and receive clicks Proudly report 😎

But the clicks never came Ranked on page 2 or
3 for the target keywords Followed every best practice Still no meaningful traffic Expectations Reality The checklist was green, but the results were red Time Clicks

Looked at my competitors Expectations "Competitors ranking #1 must be
using the keyword everywhere" My page: "Best running shoes for beginners 2026" Reality They ranked without even using the exact keyword Competition: "How to pick your first pair of running shoes" My page was "optimised" and theirs wasn't, yet they still won

I was optimizing for keywords, while Google was reading the
meaning of the webpages

This is how I discovered embeddings, the technology behind Google's
ability to understand meaning

Piece of content “running shoes” Embedding model OpenAI / Google
/ … Vector [0.24, -0.87, 0.41, ...] What are vector embeddings? Text embeddings are abstract, high-dimensional vectors that capture the semantic meaning of text, converted into numbers so that a computer can process them.

We can plot this vector in multi dimensional space LLMs
do this, but with thousands of dimensions, impossible for us humans to visualize We’ve got a vector (8, 13) y = 13 x = 8

Angle between vectors Cosine similarity 1.0 0.6 0.0 Identical Similar
in a way Unrelated

The Netflix blindspot Netflix Research paper (2024) Cosine similarity of
learned embeddings can yield arbitrary results. The underlying reason is that the learned embeddings have a degree of freedom that can render arbitrary cosine-similarities. Cosine similarity calculates the angle, not the actual difference This can cause false similarities when comparing content within the same topic

Euclidean Distance The direct “line-of-sight” distance between two points in
the vector space The smaller the distance, the more similar the vectors are

Combining both rules out most of the risk Search query
vs document Cosine Euclidean 0.92 0.92 0.05 8.4 0.3 11.2 Match? False positive Match No match Restaurant in Athens ↔ Travel guide Athens Restaurant in Athens ↔ Restaurants in Athens Restaurant in Athens ↔ Apple pie recipe Use cosine as the primary metric for semantic similarity. Use Euclidean distance as a validation check If they don't match, investigate manually. But that double-check is expensive at scale

Most models use L2 Normalisation Length is often noise, word
frequency or document size L2 normalization forces each vector to have an exact length of 1.

How Google does this Matryoshka Representation Learning Google's embedding models
use Matryoshka Representation Learning, a technique in which the most important information is contained in the first few dimensions, much like Russian nesting dolls.

Just like those Russian nesting dolls The outer doll contains
the main subject, the next ones more details, inner doll captures the subtlest nuances From broad to detailed 64 256 768 Food Greece restaurant Restaurant in Athens dim 257-768 dim 65-256 dim 1-64

We can cut the embeddings 64 dim 256 dim 768
dim Pace Fast Decent Slow Quality Rough Good Precise Dimensions 64 256 768 Use First filter Balance Final result Up to 30 times less storage space for 64 dimensions compared to 768 dimensions

Why this matters Traditional Always requires all 768 dimensions Time-consuming
with large datasets Does not scale well with thousands of pages Matryoshka Fast pre-filter with 64–256 dimensions Precision with all dimensions as needed Scalable for thousands of pages

You can use Google's embedding models Google uses embeddings in
search

From query to meaning Keyword matching Keyword BM25 inverted index
RankEmbed Embeddings semantic match DeepRank Deep understanding NavBoost User signals and click data Source: DOJ lawsuit against Google (2023–2024), testimony of Pandu Nayak RankEmbed evaluates the semantic similarity between a search query and a document, exactly what embeddings do

Before: 33% After: 54% The same seems to be going
on for AIO This hasn’t been officially confirmed by Google We saw a 64% increase in citations by improving content based on cosine similarity

The major models Model Provider Dim Max tokens MTEB Price/1M
Matryoshka gemini-embedding-001 Google 3,072 2,048 68.3 $0.15 Yes text-embedding-005 Google 768 2,048 63.8 $0.006 Yes text-embedding-3-large OpenAI 3,072 8,191 64.6 $0.13 Yes Embed v4 Cohere 1,536 128k 65.2 $0.12 Yes Voyage-3-large Voyage AI 1,024 32k 67.1 $0.06 No MTEB = Massive Text Embedding Benchmark, the standard benchmark. Scores not always directly comparable across versions. All major models now support Matryoshka

More is not always better gemini-embedding-001 text-embedding-3-large Voyage-3-large text-embedding-005 Cohere
Embed v4 More dimensions = more nuance, but also more storage and compute Google's text-embedding-005 scores competitively with just 768 dimensions, comparable to models with 4x larger vectors 1M pages × 768 dim × 4 bytes = ~3 GB 1M pages × 3,072 dim × 4 bytes = ~12 GB

Text only Product description → embedding Search query → embedding
Compare: cosine similarity Product images are ignored Text + images Recent models support multimodal Product description + photo → embedding Search query → embedding Compare: cosine similarity A more complete picture for vector search Models that support multimodal: gemini-embedding-2-preview Cohere Embed v4 Voyage-4

768 dimensions $0.006/1M tokens Choosing your model text-embedding-005 Budget multimodal
Quality gemini-embedding-001 Gemini-embedding-2- preview 3,072 / 1,024 dimensions Top MTEB scores Text + images Single vector space Start cheap, scale up when needed, Matryoshka makes it flexible

This unlocks many possibilities But first you need somewhere to
store your embeddings Cannibalization detection Topical authority auditing Content gap analysis Internal link suggestions Clustering Redirect mapping Duplicate content detection Hreflang tag mapping

You need a vector database Content Embedding model Vector embeddings
Vector database Script You've embedded thousands of pages You can't work with it in a spreadsheet You need something that can quickly find the nearest vectors to any query

What is a vector database? A database optimised for storing
embeddings and performing similarity search. You put vectors in, and ask: "give me the 10 vectors closest to this query." Pinecone Weaviate ChromaDB Dedicated For most cases BigQuery is the best option BigQuery PostegreSQL + pgvector Elasticsearch Existing tools + vectors

How to build one Crawl your site With Screaming Frog,
Sitebulb or Python Chunk your content Per page, per H2 section or per paragraph Generate embeddings Gemini, OpenAI, Cohere or Voyage AI API Store in database BigQuery, Pinecone or ChromaDB Example BigQuery setup 3 tables: full page embeddings - H2-based sections chunks - internal links

Retrieval Augmented Generation: instead of letting an LLM guess, you
first retrieve the most relevant content from your vector database, then feed it to the model as context Build RAG on top of it Prompt Relevant content Vector DB search LLM call Grounded answer With RAG Without RAG LLM relies on search and training date what has a risk of hallucinations LLM answers based on your actual input to give better grounded output

In addition to being incredibly useful on its own, a
vector database can also be used for SEO purposes

Prompt Relevant content Vector DB search LLM call Grounded answer
It helps identify content opportunities Product brochures/manuals Product information that wasn’t published This allowed us to quickly identify content opportunities on existing blogs, product categories, and product pages using RAG. We used cosine similarity to identify relevant product information

Adding sales & support intelligence Support tickets Sales notes Call
transcripts Vector database Feature requests Installation problems Proces questions Specific complains Embedding clusters reveal hidden patterns The client was also monitoring sales and support activities like tickets, notes and transcripts That it showed a pattern of information demand that keyword tools did not reveal

Keeping your vector database fresh A vector database is only
useful if it reflects your current site. Google Cloud lets you automate this with a simple Python script. 1 Fetch sitemap Parse XML sitemap for all URLs + lastmod dates 2 Compare lastmod Check against BigQuery: which pages are new or updated? 3 Embed pages Only fetch + embed pages with a newer lastmod, skip the rest 4 Update BigQuery Upsert new embeddings into your vector table Scheduled: daily / weekly via Cloud Run or Cloud Functions Only re-embed changes 90%+ cheaper than full re-crawl Lastmod as trigger Sitemap XML or HTML meta tag Runs inside Google Cloud Cloud Run / Cloud Functions + Scheduler

Take action fast on deleted pages Your vector database knows
every page on your site. When a page disappears from the sitemap, you can act before users hit a 404. 1 Deleted page When during the refreshment deleted pages are found it will start a new flow 2 Backlink/traffic Check for incoming backlinks (Ahrefs) and traffic (GSC) generated for that URL 3 Find similar page Check if there is a page very similar to that page to find a potential redirect 4 Recommendation Based on the incoming backlinks and traffic give a recommendation ⇒ Redirect ⇒ New content Manual check

What does it cost? $0.02/GB/month Storage After 90 days untouched:
$0.01/GB (automatic) First 10 GB free $6.25/TB scanned Queries First 1 TB/month free Use partitioning + clustering to reduce scans by 60–90% $0.025/1M tokens Embedding API text-embedding-005 One-time cost per page Only re-embed on changes Example: 5,000 page webshop < $1 /month & < $1 setup

Vector database Embeddings stored permanently Query anytime without re-embedding Supports
RAG, clustering, similarity search Scales to thousands of pages Best for ongoing analysis, RAG pipelines, large sites Script based Generate embeddings on the fly No database to maintain Re-embed every time you run analysis Costs more API calls over time Best for quick experiments and one-off analyses Also works without a vector database

SF has embedding functions built in Since v20, Screaming Frog
generates embeddings during crawl via OpenAI, Gemini, or Ollama. No code needed, just an API key and a paid licence. Also with options of analysing inside the tool 1 Config > Spider > Extraction > enable Store HTML Config > Spider > Rendering > enable JavaScript Config > API Access > AI and connect OpenAI, Gemini or Ollama Add from Library > "Extract Semantic Embeddings from Page" and add API Setup in Screaming Frog Build-in functions in Screaming Frog Automatically finds pages with similar content Semantic similarity Low relevance detection Redirect mapping Embedding export Flags pages that deviate from your site's average Match old URLs to closest new URLs during migrations CSV with URL + embedding vector for further analysis 2 3 4

1. Embed every page 2. Reduce dimensions to 2D 3.
Plot as a scatter plot 4. Similar pages appear closer One bubble chart that reveals your site's semantic structure Visualising your website Cluster 1 Cluster 4 Cluster 2 Cluster 3 Cluster 5 Cluster 6 Outliers?

These bubble charts reveals a lot Gaps = content opportunities
Tight clusters = strong topical authority Pages far from any cluster may be off-topic or outdated Combining competitors sites can show content gaps Outliers check Topical authority check

How to build it Python Data pipeline Crawling, chunking and
embedding Plotly Interactive data charts Create interactive charts to analyse, zoom, filter and more Streamlit Web app without frontend code Share with your team via a URL and add filters, search, sidebar controls ~20 lines of Python: from embeddings to a shareable web app

Sheet outputs also works Page a Page b Cosine Action
/shoes/nike-air-max /shoes/nike-air-max-review 0.97 Duplicate check /shoes/nike-air-max /shoes/adidas-ultraboost 0.82 Internal link /shoes/trail-running /guides/trail-running-guide 0.78 Internal link /shoes/nike-air-max /blog/summer-outfits 0.31 No action /category/running /category/running-shoes 0.96 Duplicate check /guides/sizing /faq/shoe-sizes 0.88 Internal link > 0.95 = Duplicate check 0.70 - 0.95 = Related content < 0.70 = Not related content The scatter plot is great for understanding your site, but for finding to dos, you need a spreadsheet

The action framework Different topics No semantic relationship Expected for
most pairs No action needed < 0.70 > 0.95 0.70 - 0.95 Different topics No semantic relationship Expected for most pairs Different topics No semantic relationship Expected for most pairs Potential internal link Duplicate check < 0.70 0.70 - 0.95 > 0.95 0 1 0.95 0.70

Topical authority scoring Strong Weak Embed all pages in a
topic cluster. Calculate the centroid, the average vector. Then measure how close each page is to that centroid. Tight cluster = strong authority All pages reinforce the same topic Scattered pages = weak authority Pages drift away from the core topic Topic cluster Pages Avg distance to centroid Authority score Running shoes 24 0.12 Strong Outdoor gear 18 0.18 Good Nutrition 6 0.35 Developing Lifestyle 3 0.52 Weak

Passage level scoring Don't just embed pages, embed sections. This
reveals which parts of a page are relevant and which are filler H1: Trail running guide Example: /guides/trail-running H2: Choosing the right shoes H2: Training schedule H2: Our store locations H2: Newsletter signup 0.92 0.88 0.85 0.31 0.15 What this reveals: High-scoring sections are on-topic and strengthen the page Low-scoring sections are off-topic filler that dilutes relevance Action: remove or move low-scoring sections to keep the page focused

Add semantics to authority building Domain authority score High =
good, but says nothing about relevance DR Traditional New Traditional How many visitors More = better, but also no relevance signal Embed the domains Closer vectors = more relevant link Traffic Similarity A DR 30 link from a semantically close site can outperform a DR 70 link from an unrelated domain

Authority building assistant Embed your site Calculate similarity Embed prospect
domains Combine with DR + traffic Prospect DR Traffic Semantic sim. Priority running-magazine.com 62 180K 0.89 High general-lifestyle-blog.com 71 350K 0.42 Low local-sports-shop.nl 28 12K 0.91 High tech-review-site.com 55 200K 0.35 Low outdoor-gear-blog.com 45 85K 0.82 Medium Closer vectors = more topically relevant link

Detect differences in statements To appear credible to Google and
LLMs, it’s important to be consistent in your statements. With embeddings, you can find similar text to ensure consistency /product/shoe-a /blog/shoe-a /product-category/shoes-a Unique content Statement about product Statement about product Unique content Statement about product Statement about product Unique content Statement about product Statement about product Using cosine similarity, statements can be quickly linked when you check them at the H2 level along with the underlying paragraph

"Don't get lost in tools and technology. Focus on your
customer, not on algorithms." You've probably heard this And they're right!

50 pages You can read every page Manually decide what
to change Manual review works. And you should with this amount of pages 50.000 pages You can't read every page You can't manually compare 1.2 billion pairs You need something to tell you where to look But at scale, you need a starting point These techniques don't replace your judgment, they give you a starting point. The human decision still matters. But now you're making it with data instead of guessing where to begin.

Start here Basic Using Screaming Frog + embeddings Start crawling
websites with embeddings and use standard reports Moderate Start using code Code yourself or use AI to create Python scripts to analyse further Advanced Build and maintain a vector database Advance by creating a vector database and analyse with that data Advanced Implement further in the company Create workflows for different projects and let non technical SEOs use it

What we covered 1 Embeddings capture meaning, not keywords 2
Google uses them in Search and AI Overviews 3 You can use the same models yourself 4 Vector databases make it scalable 5 RAG grounds AI in your own data 6 Tools like Screaming Frog make it accessible

Unlocking the potential of vector embeddings

Unlocking the potential of vector embeddings

More Decks by Frank van Dijk

Other Decks in Marketing & SEO

Featured

Transcript