Technical SEO for Publishing Sites in 2023

#NESS23 Technical SEO for Publishing Sites in 2023 Barry Adams
Polemic Digital & SEOforGoogleNews.com

#NESS23 Advance Warning “The whole problem with the world is
that fools and fanatics are always so certain of themselves and wiser people so full of doubts.” - Bertrand Russell

#NESS23

#NESS23 #NESS23 How does Google work?

#NESS23 #NESS23 Google’s Three Main Processes Crawling Indexing Ranking

#NESS23 #NESS23 Technical SEO Crawling Indexing

#NESS23

#NESS23 Google’s model Crawl Queue Crawling Processing Index Render Queue
Rendering Index Index URL HTML Rendered HTML

#NESS23 #NESS23 1. Crawling Crawling

#NESS23

#NESS23 Googlebot since 2011 https://developers.google.com/search/blog/2011/08/ google-news-now-crawling-with-googlebot

#NESS23 Three ‘layers’ of Googlebot? Crawling Processing Render Queue Rendering
Crawling Crawling Index Crawl Queue Crawl Queue Crawl Queue

#NESS23 Three ‘layers’ of Googlebot 1. Realtime crawler 2. Regular
crawler 3. Legacy content crawler

#NESS23 Realtime Crawler • Crawls VIPs ➢ Very Important Pages;
Webpages that have a high change frequency and/or are seen as highly authoritative News website homepages & key section pages • Main purpose = discovery of valuable new content; ➢ i.e., news articles • Rarely re-crawls newly discovered URLs; ➢ New URLs can become VIPs over time

#NESS23 Regular Crawler • Google’s main crawler; ➢ Does most
of the hard work ➢ Probably the crawler that also fetches page resources

#NESS23 Legacy Content Crawler • Crawls VUPs ➢ Very Unimportant
Pages; URLs that have very little link value and/or are very rarely updated ➢ Recrawls URLs that serve 4XX errors -Likely also occasionally checks old redirects

#NESS23 Key Take-Away: • Realtime Crawler crawls your article once;
➢It is then passed on to Regular Crawler ➢Regular Crawler will visit the URL several hours later ➢Any changes made after the first crawl are unlikely to be seen until then ➢By then the story is not news anymore – the news cycle has moved on • Consequence: ➢You usually get one chance to rank in Google’s news ecosystem ➢Get your SEO right before you click ‘Publish’ • Possible Exception: LiveBlogPosting articles

#NESS23 #NESS23 2. Indexing Indexing

#NESS23 Indexing and Rendering Crawl Queue Crawling Processing Index Render
Queue Rendering Index Index URL HTML Rendered HTML

#NESS23 Indexing and Rendering Render Queue Rendering Crawl Queue Crawling
Processing Index Index Index URL HTML

#NESS23 Indexing and Rendering Rendering takes time, and news doesn’t
have time. Indexing is initially with raw HTML only. Crawl Queue Crawling Processing Index Render Queue Rendering Index Index URL HTML

#NESS23

#NESS23 Rendering isn’t the only shortcut… Google wants publishers to
noindex syndicated content. Because Google sucks at identifying duplicate content. At least, it can’t de-duplicate quickly.

#NESS23 Indexing is a multi-layered set of processes Render Queue
Rendering Crawl Queue Crawling Processing Index Processing Processing Processing

#NESS23 Known Indexing Processes • HTML Lexer; ➢ Tokenises HTML
• Parser; ➢ Extracts content from HTML for indexing • Canonicaliser; ➢ Determines a URL’s canonical version • Deduplicator; ➢ Reduces the amount of identical content in the index • Pageranker; ➢ Calculates link value (FMA PageRank) for each URL • Many, many more…

#NESS23 What about the Index itself? Render Queue Rendering Crawl
Queue Crawling Processing Index Index Index

#NESS23 Three Crawlers… Three Indices? Realtime crawler Regular crawler Legacy
content crawler RAM storage SSD storage HDD storage

#NESS23 Three Layers of Index Storage 1. RAM storage; ➢
Pages that need to be served quickly and frequently Includes news articles but also popular content 2. SSD storage; ➢ Pages that are regularly served in SERPs but aren’t super popular 3. HDD storage; ➢ Pages that are rarely (if ever) served in SERPs

#NESS23 It’s probably more complicated Realtime crawler Regular crawler Legacy
content crawler RAM storage SSD storage HDD storage

#NESS23 Key Take-Aways: 1. Make indexing easy for Googlebot; Put
all your critical content in the HTML source Don’t rely on rendering to load valuable content 2. There’s no such thing as a duplicate content penalty; However, duplicate content on a single site means the site is competing with itself… and that’s stupid.

#NESS23 #NESS23 Technical SEO in 2023

#NESS23 The Basics - Crawling 1. Efficient Crawling; ➢ Server
Response Time - Aim for 600ms or faster

Response Time ➢ Clean URLs - Never use tracking parameters on internal links https://www.website.com/news/article-123?recommended=1 https://www.website.com/news/article-123

Response Time ➢ Clean URLs ➢ Lightweight pages; - Page resources consume crawl budget

Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination; - Balance between paginated URLs and crawl waste

Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination ➢ Correct HTTP status codes https://developers.google.com/search/docs/crawling-indexing/http-network-errors

Response Time ➢ Clean URLs ➢ Lightweight pages ➢ Pagination ➢ Correct HTTP status codes ➢ AdsBot can be unruly

#NESS23 The Basics – Indexing 2. Effortless indexing; ➢ Semantic
HTML - No client-side JS to load content

#NESS23 The Basics - Indexing 2. Effortless indexing; ➢ Semantic
HTML ➢ <h1> headlines <div class="headline">This is a bad way to code an article headline</div> <h1>This is a properly coded article headline</h1>

HTML ➢ <h1> headlines ➢ Clean HTML in <head>

HTML ➢ <h1> headlines ➢ Clean HTML in <head> ➢ Uninterrupted HTML in article <body>

HTML ➢ <h1> headlines ➢ Clean HTML in <head> ➢ Uninterrupted HTML in article <body> ➢ Good structured data; - NewsArticle for articles - Person for author pages - Keep it lean, don’t over-annotate

#NESS23 Google used to be Deterministic • Action A leads
to ranking B; ➢Relatively simple crawling, indexing, and ranking systems ➢Few websites, low competition ➢Fairly predictable

#NESS23 Google today is Probabilistic • Action A increases the
probability of ranking B; ➢Massively complicated systems ➢Intensely competitive web ➢All SEO is geared towards maximising probabilities; - But… 99% probability still means 1% chance of it not happening

#NESS23 #NESS23 What’s Changed in Tech SEO?

#NESS23

#NESS23 #NESS23 Stop AI!

#NESS23 Blocking LLMs • Robots.txt Disallow Rules: User-agent: CCbot Disallow:
/ User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: Google-Extended Disallow: /

#NESS23 However…

#NESS23 https://www.gpp.io/news/how-to-block-genai-crawlers-such-as-googles- bard-or-openais-chatgpt-from-your-website-aD5S85s7E2CA

#NESS23 CWV: out with FID, in with INP

#NESS23 What is INP?

#NESS23 What is INP? https://web.dev/inp/

#NESS23 Sitemaps Ping is dead

#NESS23 Alternative: www.PubIndexAPI.com

#NESS23 Unambiguous Timestamps https://developers.google.com/search/docs/appearance/ publication-dates

#NESS23 Unambiguous Timestamps https://developers.google.com/search/docs/appearance/publication-dates

#NESS23 Content Pruning

#NESS23 Content Pruning?

#NESS23 Topic Authority https://developers.google.com/search/blog/2023/05/understanding-news-topic-authority

#NESS23 Content Pruning v Topic Authority • Your volume of
(good) articles on a topic determines your topic authority • Topic authority = good visibility for your stories • Deleting old content could undermine your topic authority • Only delete bad content; ➢ Age and low traffic are not enough

#NESS23 #NESS23 Wrapping Up…

#NESS23 https://techpolicy.press/the-value-of-news-content-to-google-is-way-more-than-you-think/

#NESS23 www.SEOforGoogleNews.com

#NESS23 Thank You Q&A

Technical SEO for Publishing Sites in 2023

Technical SEO for Publishing Sites in 2023

More Decks by Barry Adams

Other Decks in Marketing & SEO

Featured

Transcript