Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Google's Search Index Works

How Google's Search Index Works

How Google Search Index works based on monitoring indexing data at scale across millions of pages.

In this deck, you will gain a better understanding of how to action the data in the Page Indexing report and the URL Inspection Tool in Google Search Console.

Avatar for Adam Gent

Adam Gent

October 23, 2025
Tweet

Other Decks in Marketing & SEO

Transcript

  1. Hi. I’m Adam Gent. Technical SEO Consultant. • 15 years

    working in the SEO industry. • In-house and agency experience. • Ex-Product Manager at DeepCrawl. • Found of The SEO Sprint Newsletter. • Co-founder of Indexing Insight.
  2. We monitor A LOT of pages (LIKE A LOT). Source:

    Indexing Insight Indexing Insight inspects 500,00 pages a day. Indexed Not Indexed
  3. Page Indexing Report Explained Three-Tier Exclusion Engine The Core Signal

    System Quality Management Crawl Management 1 2 3 How Google Search Index Works 4 5
  4. Page Indexing Report Explained Three-Tier Exclusion Engine The Core Signal

    System Quality Management Crawl Management 1 2 3 How Google Search Index Works 4 5
  5. I believe the page indexing report is misunderstood by business/SEO

    teams. Source: Google Search Console > Indexing > Page Index Report
  6. Google’s index is just a database. “Google's index technically is

    just a large database sitting on thousands of computers.” Gary Illyes, ‘The Grumpy One’, Analyst on Google Search team How Google Search indexes pages
  7. Myth: Deindexed pages are not part of the index. Indexing

    pipeline Googlebot Tinternet Not Indexed (not included in Google’s Index). Indexed (included in Google’s Index).
  8. Reality: ALL processed information is stored. Indexing pipeline Googlebot Tinternet

    Google stores ALL processed data and information from Indexed and Not Indexed documents.
  9. For example, you can pull stored information from Google’s index

    for Not Indexed pages in the URL Inspection Tool and API. Source: Google Search Console > Indexing > URL Inspection Tool
  10. Page indexing report: All stored info and pages. What you

    are looking at in the Page Indexing report are all the stored pages in Google’s web index.
  11. Eligible to appear in Google organic search results. Source: Search

    Console Help, Google Search Indexing and Ranking FAQ “Once a page is indexed, it becomes eligible to show up in Google Search results...”
  12. Eligible to appear in Google AI Overviews and AI Mode.

    Source: Search Console Help, AI features and your website “To be eligible to be shown as a supporting link in AI Overviews or AI Mode, a page must be indexed …”
  13. Page Indexing Report Explained Three-Tier Exclusion Engine The Core Signal

    System Quality Management Crawl Management 1 2 3 How Google Search Index Works 4 5
  14. Google only wants to index useful parts of the web.

    Source: USA vs. Google antitrust trial
  15. There is A LOT of duplicate and low-quality content. 60%

    of the web is duplicate. 40 billion pages of spam everyday. 96.55% of content gets NO traffic from Google. Source: 60% Of The Internet Is Duplicate, Google Web Spam Report 2020, Ahrefs Study
  16. Three-tier exclusion engine. Indexing pipeline Crawler Web Index Filter 1:

    Technical Crawl errors, blocked by robots.txt, noindex tags. Filter 2: Duplication Deduplication and soft 404 errors. Filter 3: Quality Page quality based on signals collected over time.
  17. Filter 1: Excluding technical errors. These error types include: •

    Noindex tags • Blocked by robots.txt • 4xx status codes • 5xx status codes • 3xx status codes
  18. Filter 2: Similar and Duplicate Content Removal. These error types

    include: • Duplicate content • Similar content • Soft 404 errors
  19. Filter 3: Page Quality These error types include: • Crawled

    – currently not indexed • Discovered – currently not indexed • URL is unknown to Google
  20. Googlers talk about quality in relation to deindexing. “And in

    general, also the general quality of the site, that can matter a lot of how many of these ‘crawled - currently not indexed’, you see in Search Console. If the number of these URLs is very high that could hint at a general quality issues. And I've seen that a lot uh since February, where suddenly we just decided that we are de-indexing a vast amount of URLs on a site just because the perception, or our perception of the site has changed.” Gary Illyes, , ‘The Grumpy One’, Grumpy Analyst on Google Search team Google Search Confirms Deindexing Vast Amounts Of URLs In February 2024
  21. Quality is mentioned in official Google Videos. “The other far

    more common reason for pages staying in "Discovered-- currently not indexed" is quality, though. When Google Search notices a pattern of low-quality or thin content on pages, they might be removed from the index and might stay in Discovered.” Martin Splitt, ‘The Happy One’, Googlebot Whisperer on the Search Relations team, Search Engine Journal, Help! Google Search isn’t indexing my pages
  22. Quality is No. 1 Reason Why Pages are Not Indexed.

    Our study of 1.4 million pages found that 86% of not indexed pages are caused by “quality” issues on the website. Source: New Study: The Biggest Reason Why Your Pages are Not Indexed in Google, Indexing Insight
  23. Page Quality No. 1 Issue for Big or Small Brands.

    Source: New Study: The Biggest Reason Why Your Pages are Not Indexed in Google, Indexing Insight
  24. The definition makes it seem like a crawling problem. Source:

    What is ‘Crawled - Currently Not Indexed’? (And Why The Definition Must Change), Indexing Insight “The page was crawled by Google but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.” - Google Search Console Documents Crawled - currently not indexed
  25. But our data shows that pages have been indexed. Source:

    What is ‘Crawled - Currently Not Indexed’? (And Why The Definition Must Change), Indexing Insight “The page was crawled by Google but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.” - Google Search Console Documents Crawled - currently not indexed (Current) “Pages with the new ‘crawled - previously indexed’ status have been crawled AND historically indexed by Google. However, over time, Google has decided that these pages should not be served to users and removes them from being served in search results.” - Indexing Insight Crawled - currently not indexed (New)
  26. We found that pages moved from indexed… Submitted and Indexed

    Source: What is ‘Crawled - Currently Not Indexed’? (And Why The Definition Must Change), Indexing Insight
  27. …to ‘crawled – currently not indexed’. Submitted and Indexed Crawled

    - currently not indexed Source: What is ‘Crawled - Currently Not Indexed’? (And Why The Definition Must Change), Indexing Insight
  28. New index state: ‘Crawled – previously indexed’. Source: ‘Crawled –

    previously indexed’ report from Indexing Insight We had to create a new indexing state called: ‘Crawled – previously indexed’ to track behaviour.
  29. Example 1: ATM Location Listing Page Awful built pSEO website

    (I built to test Indexing Insight) that only has 6% of it’s pages indexed.
  30. Example 2: The SEO Sprint Article A unique article that

    I spent hours on writing and providing first hand experience working in agile.
  31. Example 3: Player Description and Stats A set of pSEO

    listing pages that provide useful information about cricket players.
  32. Example 5: Indeed Job Listing Pages 95% of explorer listing

    pages are indexed except for a tiny number of pages.
  33. Thanks to Public DOJ Documents we can uncover more about

    How Google’s System Works. Source: DOJ Trial Documents
  34. Page Indexing Report Explained Three-Tier Exclusion Engine The Core Signal

    System Quality Management Crawl Management 1 2 3 How Google Search Index Works 4 5
  35. A signal is any measurable data point or computed value

    that Google's algorithms use to assess relevance, quality, or user satisfaction of a piece of content for ranking. Source: Summarised from How Search Works, Google
  36. Stored Signals: Raw vs Computed Raw Signals Computed Signals For

    example: • Links • Body content • HTML Canonical Tag For example: • PageRank (uses link graph) • Vector Embeddings (uses words) • Google selected canonical URL Raw data pulled from crawled content. Signals calculated from raw data points.
  37. 3 Top-Level Signals in Google Popularity (P*) Topicality (T*) Quality

    (Q*) A signal that measures how popular a page is based on chrome browsing data and the number of authoritative links. A signal that is a measure of how relevant a document / web page is to a set of queries. A signal that measures of a page’s quality by looking at authority and trustworthiness based on a multitude of signals. Source: DOJ Trial Document 1436
  38. VERY interesting is the mention of Quality. Popularity (P*) Topicality

    (T*) Quality (Q*) A signal that measures how popular a page is based on chrome browsing data and the number of authoritative links. A signal that measures of a page’s quality by looking at authority and trustworthiness based on a multitude of signals. A signal that is a measure of how relevant a document / web page is to a set of queries.
  39. What is Page Quality (Q*)? 1) Content 2) Links 3)

    Clicks The page’s content (words) are turned into vector embeddings and compared against the topic map. A key signal of page quality is the measuring the user- satisfaction over time in Google SERPs using session logs. Nearest Seed PageRank is key signal in understanding page quality. Google measures the distance from “trusted” pages. Source: DOJ Trial Document 1436 & Interview with Google Engineer
  40. Pattern 1: Historic Engagement Pages Historically some pages have driven

    SEO traffic but over time the page has dropped in performance. And the page has few backlinks or internal links.
  41. Pattern 2: Low Engagement Pages The pages never driven a

    lot of SEO clicks or links (external and internal).
  42. Pattern 3: No Engagement Pages The pages didn’t get any

    SEO clicks and few/zero links (external and internal).
  43. These reports in GSC are full of pages with low

    page quality that are actively removed.
  44. Page Indexing Report Explained Three-Tier Exclusion Engine The Core Signal

    System Quality Management Crawl Management 1 2 3 How Google Search Index Works 4 5
  45. Relationship between quality and index selection. Gary Illyes, Grumpy Analyst

    on Google Search team Google Search Confirms Deindexing Vast Amounts Of URLs In February 2024 "Index selection, while it's largely about (RAM/flash/disk) space, it's tightly tied to quality of content. If we have tons of free space available, we're more likely to index crappier content. If we don't, we might deindex stuff to make space for higher quality docs.”
  46. Pages eligibility goes up and down as the quality threshold

    changes. The ‘Crawled’ and ‘Discovered’ reports see total pages go up and down as pages get further from Quality (Q*) threshold benchmark.
  47. Page Indexing Report Explained Three-Tier Exclusion Engine The Core Signal

    System Quality Management Crawl Management 1 2 3 How Google Search Index Works 4 5
  48. Google uses Quality (Q*) signals to manage crawling. Source: Document

    1436, Pg. 138, Google Antitrust Trial “Signals developed on user-interaction data play an important role in search index development. Quality and popularity signals, for instance, help Google determine how frequently to crawl web pages to ensure the index contains the freshest web content.”
  49. Source: New Study: The 130 Day Indexing Rule, Indexing Insight

    If a page is not crawled in 130 days, it gets actively removed.
  50. Less Crawl Priority Zero Crawl Priority If a page is

    not crawled in 190 days, it gets actively forgotten. Source: New Study: After 190 Days Since Last Crawl Googlebot Forgets, Indexing Insight If a page is not crawled in 130 days, it gets actively removed.
  51. Page Indexing report shows ALL processed content. Indexed means you

    are eligible to appear in search. Page quality is a BIG reason why pages are removed. Page quality is used by Google to manage its index. Page quality is used by to prioritise crawling for pages. 1 2 3 Summary & Takeaways 4 5