Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Crawling and Indexing (and Ranking) in Google in 2024

Crawling and Indexing (and Ranking) in Google in 2024

Slides from my presentation at Pubcon 2024 where I look into the processes involved in crawling, indexing, and ranking pages in Google search.

Barry Adams

March 06, 2024

More Decks by Barry Adams

Other Decks in Marketing & SEO


  1. #Pubcon Barry Adams ➢ Doing SEO since 1998 ➢ Specialist

    in SEO for News Publishers ➢ Newsletter: SEOforGoogleNews.com ➢ Co-founder of the News & Editorial SEO Summit
  2. #Pubcon Three ‘layers’ of Googlebot 1. Priority crawl queue 2.

    Regular crawl queue 3. Legacy content crawl queue
  3. #Pubcon Priority Crawl Queue • Crawls VIPs ➢ Very Important

    Pages; Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages Highly volatile classified portals (jobs, properties) Large volume ecommerce (Amazon, eBay, Etsy) • Main purpose = discovery of valuable new content; ➢ i.e., news articles • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time
  4. #Pubcon Regular Crawl Queue • Google’s main crawler; ➢ Does

    most of the hard work ➢ Less frantic; -More time for crawl selection, de-duplication, sanitisation of the queue
  5. #Pubcon Legacy Content Crawl Queue • Crawls VUPs; ➢ Very

    Unimportant Pages; URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors -Likely also occasionally checks old redirects
  6. #Pubcon Don't use robots.txt to temporarily reallocate crawl budget for

    other pages; use robots.txt to block pages or resources that you don't want Google to crawl at all. Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit. https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget
  7. #Pubcon Robots.txt prevents crawling… … but not indexing! • Links

    to blocked URLs are still crawled • Their anchor texts carry relevancy for indexing
  8. #Pubcon • Canonicals & noindex are NOT crawl management; ➢

    Google needs to see meta tags before it can act on them ➢ That means Googlebot still crawls those URLs Crawl Management vs Index Management
  9. #Pubcon Optimise Crawling • Serve correct HTTP status codes; ➢

    200 OK ➢ 3xx Redirects ➢ 4xx Errors -429 Too Many Requests ➢ 5xx Errors
  10. #Pubcon Optimise Crawling • ALL web resources are crawled by

    Googlebot; ➢ Not just HTML pages ➢ Reduce amount of HTTP requests per page • AdsBot can use up crawl requests; ➢ Double-check your Google Ads campaigns • Link equity (PageRank) impacts crawling; ➢ More link value = more crawling ➢ Elevate key pages to VIPS • Serve correct HTTP status codes; ➢Googlebot will adapt accordingly
  11. #Pubcon Indexing • HTML lexer; ➢ Cleaning & tokenising the

    HTML • Index selection; ➢ De-duplication prior to indexing • Indexing; ➢ First-pass based on HTML ➢ Potential rendering (not guaranteed) • Index integrity; ➢ Canonicalisation & de-duplication
  12. #Pubcon Two Stages* of Indexing Crawler Indexer Ranker 1 2

    * There are MANY more; indexing is a collection of interconnected processes
  13. #Pubcon Rendering Issues • JavaScript inserts invalid HTML in the

    <head>; ➢ <body> tags in the <head> break Google’s processing of meta tags
  14. #Pubcon Rendering Issues • HTML vs Render mismatch; ➢ Different

    content in raw HTML vs fully rendered page
  15. #Pubcon Optimise Indexing • Don’t rely on Google’s rendering; ➢

    Use Server Side Rendering & CDN caching • Reduce page resources; ➢ Fewer page resources = more efficient crawling faster load speed & CWV less chance of rendering issues • Improve internal linking; ➢ More PageRank = higher chance of indexing
  16. #Pubcon Three Crawlers… Three Indices? Priority crawler Regular crawler Legacy

    content crawler RAM storage SSD storage HDD storage
  17. #Pubcon Three Layers of Index Storage 1. RAM storage; ➢

    Pages that need to be served quickly and frequently Includes news articles but also popular content 2. SSD storage; ➢ Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢ Pages that are rarely (if ever) served in SERPs
  18. #Pubcon It’s probably more complicated Priority crawler Regular crawler Legacy

    content crawler RAM storage SSD storage HDD storage
  19. #Pubcon Google is all about Keywords, Links, and Clicks 1.

    Keywords: ➢ Allows Google to understand what your content should rank for 2. Links: ➢ Gets you onto the 1st page of Google 3. Clicks: ➢ Determines whether you stay there (and rise) or drop off
  20. #Pubcon Building Long-Term Value 1. Create content optimised for ranking;

    ➢ Use keywords in the right places 2. Make this content link-worthy; ➢ And keep making more of it 3. Have a website people like engaging with; ➢ Good UX and all that jazz