Slide 1

Slide 1 text

Managing Googlebot’s Greed Optimise for Efficient Crawling and Indexing Barry Adams 15 May 2025

Slide 2

Slide 2 text

Barry Adams ➢ Active in SEO since 1998 ➢ Specialist in SEO for News Publishers ➢ Newsletter: SEOforGoogleNews.com ➢ Co-founder of the News & Editorial SEO Summit

Slide 3

Slide 3 text

Google’s Three Main Processes Crawling Indexing Serving

Slide 4

Slide 4 text

Technical SEO Crawling Indexing

Slide 5

Slide 5 text

Today’s Talk Crawling

Slide 6

Slide 6 text

Google’s Model Crawl Queue Crawling Processing Index Render Queue Rendering Serving Rendered HTML URL HTML

Slide 7

Slide 7 text

Three Layers of Index Storage Crawl Queue Crawling Processing Index Render Queue Rendering Index Index Serving

Slide 8

Slide 8 text

Three Layers of Index Storage 1. RAM storage; ➢Pages that need to be served quickly and frequently ➢Includes news articles, popular content, high- traffic URLs 2. SSD storage; ➢Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢Pages that are rarely (if ever) served in SERPs

Slide 9

Slide 9 text

Three ‘layers’ of Googlebot? Crawl Queue Crawling Processing Index Render Queue Rendering Index Index Serving Crawling Crawling Crawl Queue Crawl Queue

Slide 10

Slide 10 text

Three Indices… Three Crawl Queues? Priority crawl queue Regular crawl queue Legacy crawl queue RAM storage SSD storage HDD storage

Slide 11

Slide 11 text

Priority Crawl Queue • Crawls VIPs ➢ Very Important Pages; Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages Highly volatile classified portals (jobs, properties) Large volume ecommerce (Amazon, eBay, Etsy) • Main purpose = discovery of valuable new content; ➢ i.e., news articles, new product pages, new classified listings • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time

Slide 12

Slide 12 text

Regular Crawl Queue • Google’s main crawling; ➢ Does most of the heavy lifting • Less frantic; ➢ More time for crawl selection, de-duplication, sanitisation of the crawl queue • Recrawls URLs first crawled by the Priority crawl queue; ➢ Checks for changes ➢ Updates relevant signals for next crawl prioritisation

Slide 13

Slide 13 text

Legacy Crawl Queue • Crawls VUPs; ➢ Very Unimportant Pages; URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors; -Likely also occasionally checks old redirects

Slide 14

Slide 14 text

It’s probably more complicated Priority crawl queue Regular crawl queue Legacy crawl queue RAM storage SSD storage HDD storage

Slide 15

Slide 15 text

Crawl Sources • Site crawl • Feeds & XML sitemaps • Inbound links • DNS records • Domain registrations • Browsing data?

Slide 16

Slide 16 text

URLs are Sacred • Search Engines don’t crawl, index, rank pages or content… • They crawl, index, and rank URLs. • One piece of content = one URL

Slide 17

Slide 17 text

Crawlable Links https://developers.google.com/search/docs/crawling- indexing/links-crawlable

Slide 18

Slide 18 text

Crawl Management • Robots.txt Disallow; ➢ Strongest crawl management signal ➢ Evaporates crawl budget

Slide 19

Slide 19 text

https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

Slide 20

Slide 20 text

Don't use robots.txt to temporarily reallocate crawl budget for other pages; use robots.txt to block pages or resources that you don't want Google to crawl at all. Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit. https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Robots.txt prevents crawling… … but not indexing! • Links to blocked URLs are still crawled • Their anchor texts carry relevancy for indexing

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

• Canonicals & noindex are NOT crawl management; ➢ Google needs to see meta tags before it can act on them ➢ That means Googlebot still crawls those URLs Crawl Management vs Index Management

Slide 25

Slide 25 text

What about ‘rel=nofollow’? https://developers.google.com/search/blog/2019/09/ evolving-nofollow-new-ways-to-identify

Slide 26

Slide 26 text

What about ‘rel=nofollow’? All the link attributes—sponsored, ugc, and nofollow—are treated as hints about which links to consider or exclude within Search. https://developers.google.com/search/blog/2019/09/ evolving-nofollow-new-ways-to-identify

Slide 27

Slide 27 text

Optimise Crawling • Server Response Time

Slide 28

Slide 28 text

Optimise Crawling • Server Response Time

Slide 29

Slide 29 text

GSC Crawl Stats

Slide 30

Slide 30 text

Page Resource Load

Slide 31

Slide 31 text

Page Resources

Slide 32

Slide 32 text

Multiple Hostnames

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Crawl Budget Per Hostname • Every hostname has its own allocation of crawl budget; ➢ ‘www.website.com’ is independently crawled from ‘cdn.website.com’ • Offload page resources to a subdomain; ➢ Frees up crawl budget on your main domain

Slide 35

Slide 35 text

Googlebot & AdsBot

Slide 36

Slide 36 text

AdsBot does not obey ‘User-Agent: *’

Slide 37

Slide 37 text

Sitewide Changes

Slide 38

Slide 38 text

Sitewide Changes • Googlebot detects large scale changes on a website; ➢ Crawl rate will temporarily be increased ➢ Ensures the changes are rapidly reflected in the index • Enable your hosting to handle increased crawl rate; ➢ Temporarily increase server capacity after a sitewide change ➢ Sitewide changes include: -Redesign -Site migrations -Large number of new URLs

Slide 39

Slide 39 text

https://developers.google.com/search/docs/crawling- indexing/http-network-errors

Slide 40

Slide 40 text

Optimise Crawling; 3xx

Slide 41

Slide 41 text

Optimise Crawling; 3xx

Slide 42

Slide 42 text

GSC hack: Inspect URL follows redirects

Slide 43

Slide 43 text

Optimise Crawling; 4xx

Slide 44

Slide 44 text

Optimise Crawling; 5xx

Slide 45

Slide 45 text

Summarised

Slide 46

Slide 46 text

Manage Googlebot’s Greed • ALL web resources are crawled by Googlebot; ➢ Not just HTML pages ➢ Reduce amount of HTTP requests per page • Link equity (PageRank) impacts crawling; ➢ More link value = more crawling ➢ Elevate key pages to VIPS • Each hostname has its own crawl budget; ➢ Offload resources to subdomains • AdsBot can use up crawl requests; ➢ Double-check your Google Ads campaigns • Serve correct HTTP status codes; ➢ Googlebot will adapt accordingly

Slide 47

Slide 47 text

Thank You [email protected] https://www.linkedin.com/in/barryadams/ https://www.SEOforGoogleNews.com/