Slide 1

Slide 1 text

#Pubcon #Pubcon Crawling and Indexing (and Ranking) in Google Barry Adams March 2024

Slide 2

Slide 2 text

#Pubcon Barry Adams ➢ Doing SEO since 1998 ➢ Specialist in SEO for News Publishers ➢ Newsletter: SEOforGoogleNews.com ➢ Co-founder of the News & Editorial SEO Summit

Slide 3

Slide 3 text

#Pubcon I’ve Worked With…

Slide 4

Slide 4 text

#Pubcon #Pubcon How does Google work?

Slide 5

Slide 5 text

#Pubcon #Pubcon Google’s Three Main Processes Crawling Indexing Ranking

Slide 6

Slide 6 text

#Pubcon #Pubcon Technical SEO Crawling Indexing

Slide 7

Slide 7 text

#Pubcon

Slide 8

Slide 8 text

#Pubcon #Pubcon 1. Crawling Crawling

Slide 9

Slide 9 text

#Pubcon Three ‘layers’ of Googlebot?

Slide 10

Slide 10 text

#Pubcon Three ‘layers’ of Googlebot 1. Priority crawl queue 2. Regular crawl queue 3. Legacy content crawl queue

Slide 11

Slide 11 text

#Pubcon Priority Crawl Queue • Crawls VIPs ➢ Very Important Pages; Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages Highly volatile classified portals (jobs, properties) Large volume ecommerce (Amazon, eBay, Etsy) • Main purpose = discovery of valuable new content; ➢ i.e., news articles • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time

Slide 12

Slide 12 text

#Pubcon Regular Crawl Queue • Google’s main crawler; ➢ Does most of the hard work ➢ Less frantic; -More time for crawl selection, de-duplication, sanitisation of the queue

Slide 13

Slide 13 text

#Pubcon Legacy Content Crawl Queue • Crawls VUPs; ➢ Very Unimportant Pages; URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors -Likely also occasionally checks old redirects

Slide 14

Slide 14 text

#Pubcon Crawlable Links https://developers.google.com/search/docs/crawling- indexing/links-crawlable

Slide 15

Slide 15 text

#Pubcon Crawl Management • Robots.txt Disallow; ➢ Strongest crawl management signal ➢ Evaporates crawl budget

Slide 16

Slide 16 text

#Pubcon https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

Slide 17

Slide 17 text

#Pubcon Don't use robots.txt to temporarily reallocate crawl budget for other pages; use robots.txt to block pages or resources that you don't want Google to crawl at all. Google won't shift this newly available crawl budget to other pages unless Google is already hitting your site's serving limit. https://developers.google.com/search/docs/crawling- indexing/large-site-managing-crawl-budget

Slide 18

Slide 18 text

#Pubcon Robots.txt prevents crawling… … but not indexing! • Links to blocked URLs are still crawled • Their anchor texts carry relevancy for indexing

Slide 19

Slide 19 text

#Pubcon

Slide 20

Slide 20 text

#Pubcon • Canonicals & noindex are NOT crawl management; ➢ Google needs to see meta tags before it can act on them ➢ That means Googlebot still crawls those URLs Crawl Management vs Index Management

Slide 21

Slide 21 text

#Pubcon

Slide 22

Slide 22 text

#Pubcon Optimise Crawling • Server Response Time

Slide 23

Slide 23 text

#Pubcon Optimise Crawling • Server Response Time

Slide 24

Slide 24 text

#Pubcon GSC Crawl Stats

Slide 25

Slide 25 text

#Pubcon Page Resource Load

Slide 26

Slide 26 text

#Pubcon Page Resources

Slide 27

Slide 27 text

#Pubcon Googlebot & AdsBot

Slide 28

Slide 28 text

#Pubcon AdsBot does not obey ‘User-Agent: *’

Slide 29

Slide 29 text

#Pubcon Optimise Crawling • Serve correct HTTP status codes; ➢ 200 OK ➢ 3xx Redirects ➢ 4xx Errors -429 Too Many Requests ➢ 5xx Errors

Slide 30

Slide 30 text

#Pubcon https://developers.google.com/search/docs/crawling- indexing/http-network-errors

Slide 31

Slide 31 text

#Pubcon Optimise Crawling; 3xx

Slide 32

Slide 32 text

#Pubcon Optimise Crawling; 3xx

Slide 33

Slide 33 text

#Pubcon GSC hack: Inspect URL follows redirects

Slide 34

Slide 34 text

#Pubcon Optimise Crawling; 4xx

Slide 35

Slide 35 text

#Pubcon Optimise Crawling; 5xx

Slide 36

Slide 36 text

#Pubcon Optimise Crawling • ALL web resources are crawled by Googlebot; ➢ Not just HTML pages ➢ Reduce amount of HTTP requests per page • AdsBot can use up crawl requests; ➢ Double-check your Google Ads campaigns • Link equity (PageRank) impacts crawling; ➢ More link value = more crawling ➢ Elevate key pages to VIPS • Serve correct HTTP status codes; ➢Googlebot will adapt accordingly

Slide 37

Slide 37 text

#Pubcon #Pubcon 2. Indexing Indexing

Slide 38

Slide 38 text

#Pubcon Indexing • HTML lexer; ➢ Cleaning & tokenising the HTML • Index selection; ➢ De-duplication prior to indexing • Indexing; ➢ First-pass based on HTML ➢ Potential rendering (not guaranteed) • Index integrity; ➢ Canonicalisation & de-duplication

Slide 39

Slide 39 text

#Pubcon Two Stages* of Indexing Crawler Indexer Ranker 1 2 * There are MANY more; indexing is a collection of interconnected processes

Slide 40

Slide 40 text

#Pubcon Rendering

Slide 41

Slide 41 text

#Pubcon Evergreen Googlebot

Slide 42

Slide 42 text

#Pubcon Possible Rendering Issues in GSC

Slide 43

Slide 43 text

#Pubcon Rendering Issues • Inaccessible Resources; ➢ Make sure all page resources can be crawled

Slide 44

Slide 44 text

#Pubcon Rendering Issues • JavaScript inserts invalid HTML in the ; ➢ tags in the break Google’s processing of meta tags

Slide 45

Slide 45 text

#Pubcon https://developers.google.com/search/docs/ crawling-indexing/valid-page-metadata

Slide 46

Slide 46 text

#Pubcon Rendering Issues • HTML vs Render mismatch; ➢ Different content in raw HTML vs fully rendered page

Slide 47

Slide 47 text

#Pubcon https://chrome.google.com/webstore/detail/view-rendered- source/ejgngohbdedoabanmclafpkoogegdpob

Slide 48

Slide 48 text

#Pubcon Google Tools *ALWAYS* Render

Slide 49

Slide 49 text

#Pubcon Optimise Indexing • Don’t rely on Google’s rendering; ➢ Use Server Side Rendering & CDN caching • Reduce page resources; ➢ Fewer page resources = more efficient crawling faster load speed & CWV less chance of rendering issues • Improve internal linking; ➢ More PageRank = higher chance of indexing

Slide 50

Slide 50 text

#Pubcon What about the Index itself?

Slide 51

Slide 51 text

#Pubcon Three Crawlers… Three Indices? Priority crawler Regular crawler Legacy content crawler RAM storage SSD storage HDD storage

Slide 52

Slide 52 text

#Pubcon Three Layers of Index Storage 1. RAM storage; ➢ Pages that need to be served quickly and frequently Includes news articles but also popular content 2. SSD storage; ➢ Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢ Pages that are rarely (if ever) served in SERPs

Slide 53

Slide 53 text

#Pubcon It’s probably more complicated Priority crawler Regular crawler Legacy content crawler RAM storage SSD storage HDD storage

Slide 54

Slide 54 text

#Pubcon #Pubcon 3. Ranking Ranking

Slide 55

Slide 55 text

#Pubcon Search Intent – first BERT, then MUM

Slide 56

Slide 56 text

#Pubcon

Slide 57

Slide 57 text

#Pubcon https://searchengineland.com/how-google-search- ranking-works-pandu-nayak-435395

Slide 58

Slide 58 text

#Pubcon Turns out… Slide from Trial Exhibit UPX0203

Slide 59

Slide 59 text

#Pubcon It’s clicks after all! Slide from Trial Exhibit UPX0228

Slide 60

Slide 60 text

#Pubcon Google is all about Keywords, Links, and Clicks 1. Keywords: ➢ Allows Google to understand what your content should rank for 2. Links: ➢ Gets you onto the 1st page of Google 3. Clicks: ➢ Determines whether you stay there (and rise) or drop off

Slide 61

Slide 61 text

#Pubcon Can You Cheat This? Yes. For a while.

Slide 62

Slide 62 text

#Pubcon Building Long-Term Value 1. Create content optimised for ranking; ➢ Use keywords in the right places 2. Make this content link-worthy; ➢ And keep making more of it 3. Have a website people like engaging with; ➢ Good UX and all that jazz

Slide 63

Slide 63 text

#Pubcon Building Long-Term Value SEO as usual

Slide 64

Slide 64 text

#Pubcon #Pubcon Thank You [email protected] https://www.linkedin.com/in/barryadams/ https://www.SEOforGoogleNews.com/