Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Snapshots from 8 Years as an SEO Information Re...

Snapshots from 8 Years as an SEO Information Retrieval Interloping

Avatar for Dawn Anderson

Dawn Anderson

December 05, 2025
Tweet

More Decks by Dawn Anderson

Other Decks in Marketing & SEO

Transcript

  1. Who is Dawn Anderson? Professional life  SEO since 2007

     SEO Consultant since 2012  Pracademic 'doing SEO every day'  Information retrieval interloper since 2017  Msc Digital Marketing 2017  Currently studying for second MSc in Computer Science and Data Science Personal life  Mum, step-mum, step- grandma, wife  Runner (road, marathon, fells, cross- country)  Baker  Pomeranian lover
  2. Early learnings... About... duplicate content handling and web crawl scheduling

    About... shards and shingles and clusters and canonicalisation
  3. Learnings around 'importance' and scale in search engines About... 'cascading'

    and 'tiered' crawling, indexing, ranking and caching / storage systems About... 'breadth first' and 'depth first' crawling
  4. About lexical models like... TF:IDF, BM25 & BM25F Mainstay of

    search ranking Ranking across all of ElasticSearch BM25 has limitations BM25F deals with long documents and multi-field weights in documents (title, body etc)
  5. About... Ranking Evaluation Metrics (Offline) MAP (Mean Average Precision) -

    Order aware R@K (Recall @ K) - (TP / (TP + FN)) - Order unaware P@K (Precision @ K) – (TP / (TP + FP)) - Order unaware MRR (Mean Reciprocal Rank) - Order aware Accuracy (Accuracy @ K) - ((TP + TN / TP + TN + FP + FN) – Order unaware F1 (Harmonic mean of P & R) - Order unaware DCG & NDCG (Discounted & Normalised Discounted Cumulative Gain) - Order aware
  6. 3 Types of Learning to Rank Pairwise •Trains on pairs

    of documents •Predict which one should rank higher •Examples – RankNet, LambdaRank Pointwise •Treats ranking as a classification or a regression problem •Predict a relevance score for each document independently Listwise •Looks at the entire list of results at once •Optimises ranking specific metrics (MAP, NDCG) •Examples – LambdaMart, ListNet
  7. About... likely core updates process cycles Click data / past

    performance / determine current 'relevance' Gold standard query / data labelling HITL Quality raters / LLM as judge feedback loop NDCG / MAP adjustment / error level assessment reiteration Core update rollout Learning to Rank / neural re-ranking
  8. About... Neural ranking models Early models: DSSM, DRMM, KNRM, DUET.

    Transformer models: BERT, ColBERT, MonoT5. Dense retrieval: DPR, ANCE, RocketQA. Modern hybrids: SPLADE, ColBERTv2, LLM rerankers.
  9. Re-ranker refines ranks of preliminary candidate documents and provides more

    accurate relevancy scores using machine learning
  10. Cross encoders and Rerankers These score query– document pairs jointly:

    MonoT5 (ranking with T5 as a cross- encoder) ELECTRA rankers MiniLM rankers (lightweight Transformer rerankers)
  11. Temporal query intent shift – very nuanced, without qualifiers EASTER

    – MOST OF THE YEAR – WHEN IS IT? - DATE TWO WEEKS BEFORE – THINGS TO DO? - EVENTS & ACTIVITIES ON GOOD FRIDAY AND EASTER SUNDAY – WHAT IS THE MEANING OF EASTER? - COMPLETELY INFORMATIONAL
  12. Search engines / IR engineers / researchers only care about

    precision of top K (likely <= 20 results )
  13. BERT based and transformer models When BERT (2018) arrived, it

    transformed neural ranking:  BERT for Ranking (fine-tuned BERT directly for passage/document ranking)  MonoBERT (pointwise ranking with BERT)  TwinBERT (bi-encoder style for efficiency)  ColBERT (Contextualized Late Interaction over BERT)  DistilBERT-based rankers (smaller, faster variants)  ELECTRA, RoBERTa, ALBERT adaptations for ranking tasks
  14. Dense v sparse retrieval  Sparse retrieval is like an

    old-school librarian who flips through a card catalog to find the exact terms you wrote.  Dense retrieval is like a smart assistant who understands what you meant, even if you phrased it differently.  Modern search engines (like Google and Bing) actually combine both:  Sparse retrieval for exact matches (important for things like names, dates, product codes).  Dense retrieval for semantic understanding (important for natural questions, conversational queries, and AI search).
  15. Dual-encoder / Bi- encoder ranking models Used heavily in dense

    passage retrieval (DPR) and semantic search:  DPR (Dense Passage Retrieval, Facebook AI, 2020)  ANCE (Approximate Nearest Neighbor Negative Contrastive Estimation, MSR, 2020)  RocketQA (Baidu, improved training for dense retrievers, 2020)  Contriever (Meta AI, unsupervised dense retriever, 2022)
  16. Information Gain – Entropy – How much classification impurity is

    present? Used in decision trees / machine learning Used for classification Designed to identify where best to split a dataset so that it is most 'pure' to the class sought It's a measure of information impurity In decision trees 'entropy' (information gain) helps decide which features best separate the data
  17. About Machine learning being part of Google Search for years

    beyond the obvious Top-K and late stage re-ranking Crawl scheduling via importance prediction and spam swerving Feature engineering optimisation (understanding which factors should work together) Pairwise, pointwise and listwise ranking comparisons Learning to rank Dynamic index pruning Generative information retrieval
  18. Rethinking search with AI foundations Delphic costs (Broder) • Reducing

    the 'search' cost burden on searchers • Particularly Gen Z and Millennials tempted to defect to ChatGPT et al Rethinking search • Take the knowledge to the searcher rather than make them search
  19. Concerns in information retrieval field around Generative IR 'Involution not

    evolution' (Ricardo Baeza-Yates) Google are "rushing ahead" (Bender & Shah) Searchers still want to learn and search for themselves (Bender & Shah) Generative IR (e.g. AI Overviews) takes away the agency (control) of searchers Generative IR doesn't meet Belkin's -16 information seeking strategies 'Rethinking Search' paper by Google team rebuked by many reviewers
  20. A new paradigm "DSI is a new paradigm that learns

    and answers queries directly using a text-to-text model mapping string queries directly to relevant docids using only its parameters." (Google, 2022)
  21. What is a differentiable search index? A Differentiable Search Index

    (DSI) is a self-learning search system where the index itself is part of the AI’s training process. Instead of just storing documents, it adapts how queries and documents are matched, improving continuously from user interactions.
  22. Today versus a differentiable search index future? Today: static list

    of links + fixed snippet. Future: living, adaptive SERP that reorders, expands, or hides elements in real time as the system learns from you and other users.
  23. How can I help?  A DSI-powered SERP would feel

    more alive and personalised - less like a static ranking list and more like an adaptive assistant that reshapes itself as people use it.
  24. DSI Anticipated impacts Less reliance on keywords – stronger semantic

    and intent-based search. Faster adaptation – rankings shift continuously instead of through core updates. Personalisation – SERPs evolve in real time based on individual user signals. Richer results – entity panels, interactive tools, and dynamic answers.
  25. Modern hybrids & large scale models ColBERTv2 (efficient late interaction

    with BERT-style models) SPLADE (sparse neural retriever using expansion & reweighting) uniCOIL (contextualized bag-of-words retriever) DocT5Query (query expansion with T5 for better retrieval) OpenAI embeddings models (used in vector search, RAG pipelines) LLM rerankers (GPT- 4, Claude, etc., now being tested as “zero-shot rankers”)
  26. Hybrid approaches == DSI + RAG + kNN- LM +

    Seq2Seq Use DSI when your dataset is fixed and small-to- medium scale. Use Use RAG when you need scalability + easy updates. Use Use kNN-LM when you need dynamic adaptation without retraining. Use Use Seq2Seq generative retrievers when you want sequence modeling flexibility with some retrieval grounding. Use
  27. Future SEO with DSI Hybrid Checklist?? 1. Optimise for Meaning,

    Not Just Keywords 2. Feed the Feedback Loop with satisfying UX 3. Build Stickier Authoritativeness with expertise and freshness 4. Strengthen Contextual Relevance with schema and entity links 5. Drive Engagement with interactive content 6. Stay Adaptable with continuous refreshes and testing
  28. References & Sources  Bar-Yossef, Z., Keidar, I. and Schonfeld,

    U., 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Transactions on the Web (TWEB), 3(1), pp.1-31.  'On the efficient determination of most nearneighbors: horseshoes, hand grenades, web searchand other situations when close is close enough'(Manasse, 2022)  'Web crawling (Olston, C. and Najork, M., 2010)  'High performance web crawling' (Najork,2002)  'Modern information retrieval' (1999) - Baeza-Yeates