Crawling and Indexing (and Ranking) in Google in 2024

#Pubcon #Pubcon Crawling and Indexing (and Ranking) in Google Barry
Adams March 2024

#Pubcon Barry Adams ➢ Doing SEO since 1998 ➢ Specialist
in SEO for News Publishers ➢ Newsletter: SEOforGoogleNews.com ➢ Co-founder of the News & Editorial SEO Summit

#Pubcon I’ve Worked With…

#Pubcon #Pubcon How does Google work?

#Pubcon #Pubcon Google’s Three Main Processes Crawling Indexing Ranking

#Pubcon #Pubcon Technical SEO Crawling Indexing

#Pubcon

#Pubcon #Pubcon 1. Crawling Crawling

#Pubcon Three ‘layers’ of Googlebot?

#Pubcon Three ‘layers’ of Googlebot 1. Priority crawl queue 2.
Regular crawl queue 3. Legacy content crawl queue

#Pubcon Priority Crawl Queue • Crawls VIPs ➢ Very Important
Pages; Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages Highly volatile classified portals (jobs, properties) Large volume ecommerce (Amazon, eBay, Etsy) • Main purpose = discovery of valuable new content; ➢ i.e., news articles • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time

#Pubcon Regular Crawl Queue • Google’s main crawler; ➢ Does
most of the hard work ➢ Less frantic; -More time for crawl selection, de-duplication, sanitisation of the queue

#Pubcon Legacy Content Crawl Queue • Crawls VUPs; ➢ Very
Unimportant Pages; URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors -Likely also occasionally checks old redirects

#Pubcon Crawlable Links https://developers.google.com/search/docs/crawling- indexing/links-crawlable

#Pubcon Crawl Management • Robots.txt Disallow; ➢ Strongest crawl management
signal ➢ Evaporates crawl budget

#Pubcon https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

#Pubcon Don't use robots.txt to temporarily reallocate crawl budget for
other pages; use robots.txt to block pages or resources that you don't want Google to crawl at all. Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit. https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

#Pubcon Robots.txt prevents crawling… … but not indexing! • Links
to blocked URLs are still crawled • Their anchor texts carry relevancy for indexing

#Pubcon

#Pubcon • Canonicals & noindex are NOT crawl management; ➢
Google needs to see meta tags before it can act on them ➢ That means Googlebot still crawls those URLs Crawl Management vs Index Management

#Pubcon

#Pubcon Optimise Crawling • Server Response Time

#Pubcon GSC Crawl Stats

#Pubcon Page Resource Load

#Pubcon Page Resources

#Pubcon Googlebot & AdsBot

#Pubcon AdsBot does not obey ‘User-Agent: *’

#Pubcon Optimise Crawling • Serve correct HTTP status codes; ➢
200 OK ➢ 3xx Redirects ➢ 4xx Errors -429 Too Many Requests ➢ 5xx Errors

#Pubcon https://developers.google.com/search/docs/crawling- indexing/http-network-errors

#Pubcon Optimise Crawling; 3xx

#Pubcon GSC hack: Inspect URL follows redirects

#Pubcon Optimise Crawling • ALL web resources are crawled by
Googlebot; ➢ Not just HTML pages ➢ Reduce amount of HTTP requests per page • AdsBot can use up crawl requests; ➢ Double-check your Google Ads campaigns • Link equity (PageRank) impacts crawling; ➢ More link value = more crawling ➢ Elevate key pages to VIPS • Serve correct HTTP status codes; ➢Googlebot will adapt accordingly

#Pubcon #Pubcon 2. Indexing Indexing

#Pubcon Indexing • HTML lexer; ➢ Cleaning & tokenising the
HTML • Index selection; ➢ De-duplication prior to indexing • Indexing; ➢ First-pass based on HTML ➢ Potential rendering (not guaranteed) • Index integrity; ➢ Canonicalisation & de-duplication

#Pubcon Two Stages* of Indexing Crawler Indexer Ranker 1 2
* There are MANY more; indexing is a collection of interconnected processes

#Pubcon Rendering

#Pubcon Evergreen Googlebot

#Pubcon Possible Rendering Issues in GSC

#Pubcon Rendering Issues • Inaccessible Resources; ➢ Make sure all
page resources can be crawled

#Pubcon Rendering Issues • JavaScript inserts invalid HTML in the
<head>; ➢ <body> tags in the <head> break Google’s processing of meta tags

#Pubcon https://developers.google.com/search/docs/ crawling-indexing/valid-page-metadata

#Pubcon Rendering Issues • HTML vs Render mismatch; ➢ Different
content in raw HTML vs fully rendered page

#Pubcon https://chrome.google.com/webstore/detail/view-rendered- source/ejgngohbdedoabanmclafpkoogegdpob

#Pubcon Google Tools *ALWAYS* Render

#Pubcon Optimise Indexing • Don’t rely on Google’s rendering; ➢
Use Server Side Rendering & CDN caching • Reduce page resources; ➢ Fewer page resources = more efficient crawling faster load speed & CWV less chance of rendering issues • Improve internal linking; ➢ More PageRank = higher chance of indexing

#Pubcon What about the Index itself?

#Pubcon Three Crawlers… Three Indices? Priority crawler Regular crawler Legacy
content crawler RAM storage SSD storage HDD storage

#Pubcon Three Layers of Index Storage 1. RAM storage; ➢
Pages that need to be served quickly and frequently Includes news articles but also popular content 2. SSD storage; ➢ Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢ Pages that are rarely (if ever) served in SERPs

#Pubcon It’s probably more complicated Priority crawler Regular crawler Legacy
content crawler RAM storage SSD storage HDD storage

#Pubcon #Pubcon 3. Ranking Ranking

#Pubcon Search Intent – first BERT, then MUM

#Pubcon

#Pubcon https://searchengineland.com/how-google-search- ranking-works-pandu-nayak-435395

#Pubcon Turns out… Slide from Trial Exhibit UPX0203

#Pubcon It’s clicks after all! Slide from Trial Exhibit UPX0228

#Pubcon Google is all about Keywords, Links, and Clicks 1.
Keywords: ➢ Allows Google to understand what your content should rank for 2. Links: ➢ Gets you onto the 1st page of Google 3. Clicks: ➢ Determines whether you stay there (and rise) or drop off

#Pubcon Can You Cheat This? Yes. For a while.

#Pubcon Building Long-Term Value 1. Create content optimised for ranking;
➢ Use keywords in the right places 2. Make this content link-worthy; ➢ And keep making more of it 3. Have a website people like engaging with; ➢ Good UX and all that jazz

#Pubcon Building Long-Term Value SEO as usual

#Pubcon #Pubcon Thank You [email protected] https://www.linkedin.com/in/barryadams/ https://www.SEOforGoogleNews.com/

Crawling and Indexing (and Ranking) in Google i...

Crawling and Indexing (and Ranking) in Google in 2024

More Decks by Barry Adams

Other Decks in Marketing & SEO

Featured

Transcript