Slide 1

Slide 1 text

#NESS23 Technical SEO for Publishing Sites in 2023 Barry Adams Polemic Digital & SEOforGoogleNews.com

Slide 2

Slide 2 text

#NESS23 Advance Warning “The whole problem with the world is that fools and fanatics are always so certain of themselves and wiser people so full of doubts.” - Bertrand Russell

Slide 3

Slide 3 text

#NESS23

Slide 4

Slide 4 text

#NESS23 #NESS23 How does Google work?

Slide 5

Slide 5 text

#NESS23 #NESS23 Google’s Three Main Processes Crawling Indexing Ranking

Slide 6

Slide 6 text

#NESS23 #NESS23 Technical SEO Crawling Indexing

Slide 7

Slide 7 text

#NESS23

Slide 8

Slide 8 text

#NESS23 Google’s model Crawl Queue Crawling Processing Index Render Queue Rendering Index Index URL HTML Rendered HTML

Slide 9

Slide 9 text

#NESS23 #NESS23 1. Crawling Crawling

Slide 10

Slide 10 text

#NESS23

Slide 11

Slide 11 text

#NESS23 Googlebot since 2011 https://developers.google.com/search/blog/2011/08/ google-news-now-crawling-with-googlebot

Slide 12

Slide 12 text

#NESS23 Three ‘layers’ of Googlebot? Crawling Processing Render Queue Rendering Crawling Crawling Index Crawl Queue Crawl Queue Crawl Queue

Slide 13

Slide 13 text

#NESS23 Three ‘layers’ of Googlebot 1. Realtime crawler 2. Regular crawler 3. Legacy content crawler

Slide 14

Slide 14 text

#NESS23 Realtime Crawler • Crawls VIPs ➢ Very Important Pages; Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages • Main purpose = discovery of valuable new content; ➢ i.e., news articles • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time

Slide 15

Slide 15 text

#NESS23 Regular Crawler • Google’s main crawler; ➢ Does most of the hard work ➢ Probably the crawler that also fetches page resources

Slide 16

Slide 16 text

#NESS23 Legacy Content Crawler • Crawls VUPs ➢ Very Unimportant Pages; URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors -Likely also occasionally checks old redirects

Slide 17

Slide 17 text

#NESS23 Key Take-Away: • Realtime Crawler crawls your article once; ➢It is then passed on to Regular Crawler ➢Regular Crawler will visit the URL several hours later ➢Any changes made after the first crawl are unlikely to be seen until then ➢By then the story is not news anymore – the news cycle has moved on • Consequence: ➢You usually get one chance to rank in Google’s news ecosystem ➢Get your SEO right before you click ‘Publish’ • Possible Exception: LiveBlogPosting articles

Slide 18

Slide 18 text

#NESS23 #NESS23 2. Indexing Indexing

Slide 19

Slide 19 text

#NESS23 Indexing and Rendering Crawl Queue Crawling Processing Index Render Queue Rendering Index Index URL HTML Rendered HTML

Slide 20

Slide 20 text

#NESS23 Indexing and Rendering Render Queue Rendering Crawl Queue Crawling Processing Index Index Index URL HTML

Slide 21

Slide 21 text

#NESS23 Indexing and Rendering Rendering takes time, and news doesn’t have time. Indexing is initially with raw HTML only. Crawl Queue Crawling Processing Index Render Queue Rendering Index Index URL HTML

Slide 22

Slide 22 text

#NESS23

Slide 23

Slide 23 text

#NESS23 Rendering isn’t the only shortcut… Google wants publishers to noindex syndicated content. Because Google sucks at identifying duplicate content. At least, it can’t de-duplicate quickly.

Slide 24

Slide 24 text

#NESS23 Indexing is a multi-layered set of processes Render Queue Rendering Crawl Queue Crawling Processing Index Processing Processing Processing

Slide 25

Slide 25 text

#NESS23 Known Indexing Processes • HTML Lexer; ➢ Tokenises HTML • Parser; ➢ Extracts content from HTML for indexing • Canonicaliser; ➢ Determines a URL’s canonical version • Deduplicator; ➢ Reduces the amount of identical content in the index • Pageranker; ➢ Calculates link value (FMA PageRank) for each URL • Many, many more…

Slide 26

Slide 26 text

#NESS23 What about the Index itself? Render Queue Rendering Crawl Queue Crawling Processing Index Index Index

Slide 27

Slide 27 text

#NESS23 Three Crawlers… Three Indices? Realtime crawler Regular crawler Legacy content crawler RAM storage SSD storage HDD storage

Slide 28

Slide 28 text

#NESS23 Three Layers of Index Storage 1. RAM storage; ➢ Pages that need to be served quickly and frequently Includes news articles but also popular content 2. SSD storage; ➢ Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢ Pages that are rarely (if ever) served in SERPs

Slide 29

Slide 29 text

#NESS23 It’s probably more complicated Realtime crawler Regular crawler Legacy content crawler RAM storage SSD storage HDD storage

Slide 30

Slide 30 text

#NESS23 Key Take-Aways: 1. Make indexing easy for Googlebot; Put all your critical content in the HTML source Don’t rely on rendering to load valuable content 2. There’s no such thing as a duplicate content penalty; However, duplicate content on a single site means the site is competing with itself… and that’s stupid.

Slide 31

Slide 31 text

#NESS23 #NESS23 Technical SEO in 2023

Slide 32

Slide 32 text

#NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server Response Time - Aim for 600ms or faster

Slide 33

Slide 33 text

#NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server Response Time ➢ Clean URLs - Never use tracking parameters on internal links https://www.website.com/news/article-123?recommended=1 https://www.website.com/news/article-123

Slide 34

Slide 34 text

#NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server Response Time ➢ Clean URLs ➢ Lightweight pages; - Page resources consume crawl budget

Slide 35

Slide 35 text

#NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination; - Balance between paginated URLs and crawl waste

Slide 36

Slide 36 text

#NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination ➢ Correct HTTP status codes https://developers.google.com/search/docs/crawling-indexing/http-network-errors

Slide 37

Slide 37 text

#NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination ➢ Correct HTTP status codes ➢ AdsBot can be unruly

Slide 38

Slide 38 text

#NESS23 The Basics – Indexing 2. Effortless indexing; ➢ Semantic HTML - No client-side JS to load content

Slide 39

Slide 39 text

#NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic HTML ➢

headlines
This is a bad way to code an article headline

This is a properly coded article headline

Slide 40

Slide 40 text

#NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic HTML ➢

headlines ➢ Clean HTML in

Slide 41

Slide 41 text

#NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic HTML ➢

headlines ➢ Clean HTML in ➢ Uninterrupted HTML in article

Slide 42

Slide 42 text

#NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic HTML ➢

headlines ➢ Clean HTML in ➢ Uninterrupted HTML in article ➢ Good structured data; - NewsArticle for articles - Person for author pages - Keep it lean, don’t over-annotate

Slide 43

Slide 43 text

#NESS23 Google used to be Deterministic • Action A leads to ranking B; ➢Relatively simple crawling, indexing, and ranking systems ➢Few websites, low competition ➢Fairly predictable

Slide 44

Slide 44 text

#NESS23 Google today is Probabilistic • Action A increases the probability of ranking B; ➢Massively complicated systems ➢Intensely competitive web ➢All SEO is geared towards maximising probabilities; - But… 99% probability still means 1% chance of it not happening

Slide 45

Slide 45 text

#NESS23 #NESS23 What’s Changed in Tech SEO?

Slide 46

Slide 46 text

#NESS23

Slide 47

Slide 47 text

#NESS23 #NESS23 Stop AI!

Slide 48

Slide 48 text

#NESS23 Blocking LLMs • Robots.txt Disallow Rules: User-agent: CCbot Disallow: / User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: Google-Extended Disallow: /

Slide 49

Slide 49 text

#NESS23 However…

Slide 50

Slide 50 text

#NESS23 https://www.gpp.io/news/how-to-block-genai-crawlers-such-as-googles- bard-or-openais-chatgpt-from-your-website-aD5S85s7E2CA

Slide 51

Slide 51 text

#NESS23 CWV: out with FID, in with INP

Slide 52

Slide 52 text

#NESS23 What is INP?

Slide 53

Slide 53 text

#NESS23 What is INP? https://web.dev/inp/

Slide 54

Slide 54 text

#NESS23 Sitemaps Ping is dead

Slide 55

Slide 55 text

#NESS23 Alternative: www.PubIndexAPI.com

Slide 56

Slide 56 text

#NESS23 Unambiguous Timestamps https://developers.google.com/search/docs/appearance/ publication-dates

Slide 57

Slide 57 text

#NESS23 Unambiguous Timestamps https://developers.google.com/search/docs/appearance/publication-dates

Slide 58

Slide 58 text

#NESS23 Content Pruning

Slide 59

Slide 59 text

#NESS23 Content Pruning?

Slide 60

Slide 60 text

#NESS23 Topic Authority https://developers.google.com/search/blog/2023/05/understanding-news-topic-authority

Slide 61

Slide 61 text

#NESS23 Content Pruning v Topic Authority • Your volume of (good) articles on a topic determines your topic authority • Topic authority = good visibility for your stories • Deleting old content could undermine your topic authority • Only delete bad content; ➢ Age and low traffic are not enough

Slide 62

Slide 62 text

#NESS23 #NESS23 Wrapping Up…

Slide 63

Slide 63 text

#NESS23 https://techpolicy.press/the-value-of-news-content-to-google-is-way-more-than-you-think/

Slide 64

Slide 64 text

#NESS23 www.SEOforGoogleNews.com

Slide 65

Slide 65 text

#NESS23 Thank You Q&A