#NESS23
Technical SEO for Publishing Sites
in 2023
Barry Adams
Polemic Digital & SEOforGoogleNews.com
Slide 2
Slide 2 text
#NESS23
Advance Warning
“The whole problem with the world is
that fools and fanatics are always so
certain of themselves and wiser
people so full of doubts.”
- Bertrand Russell
Slide 3
Slide 3 text
#NESS23
Slide 4
Slide 4 text
#NESS23
#NESS23
How does Google work?
Slide 5
Slide 5 text
#NESS23
#NESS23
Google’s Three Main Processes
Crawling Indexing Ranking
Slide 6
Slide 6 text
#NESS23
#NESS23
Technical SEO
Crawling Indexing
Slide 7
Slide 7 text
#NESS23
Slide 8
Slide 8 text
#NESS23
Google’s model
Crawl Queue Crawling Processing Index
Render
Queue
Rendering
Index
Index
URL HTML
Rendered HTML
Slide 9
Slide 9 text
#NESS23
#NESS23
1. Crawling
Crawling
Slide 10
Slide 10 text
#NESS23
Slide 11
Slide 11 text
#NESS23
Googlebot since 2011
https://developers.google.com/search/blog/2011/08/
google-news-now-crawling-with-googlebot
Slide 12
Slide 12 text
#NESS23
Three ‘layers’ of Googlebot?
Crawling Processing
Render
Queue
Rendering
Crawling
Crawling Index
Crawl Queue
Crawl Queue
Crawl Queue
Slide 13
Slide 13 text
#NESS23
Three ‘layers’ of Googlebot
1. Realtime crawler
2. Regular crawler
3. Legacy content crawler
Slide 14
Slide 14 text
#NESS23
Realtime Crawler
• Crawls VIPs
➢ Very Important Pages;
Webpages that have a high change frequency and/or are
seen as highly authoritative
News website homepages & key section pages
• Main purpose = discovery of valuable new content;
➢ i.e., news articles
• Rarely re-crawls newly discovered URLs;
➢ New URLs can become VIPs over time
Slide 15
Slide 15 text
#NESS23
Regular Crawler
• Google’s main crawler;
➢ Does most of the hard work
➢ Probably the crawler that
also fetches page resources
Slide 16
Slide 16 text
#NESS23
Legacy Content Crawler
• Crawls VUPs
➢ Very Unimportant Pages;
URLs that have very little link value and/or are
very rarely updated
➢ Recrawls URLs that serve 4XX errors
-Likely also occasionally checks old redirects
Slide 17
Slide 17 text
#NESS23
Key Take-Away:
• Realtime Crawler crawls your article once;
➢It is then passed on to Regular Crawler
➢Regular Crawler will visit the URL several hours later
➢Any changes made after the first crawl are unlikely to be seen until then
➢By then the story is not news anymore – the news cycle has moved on
• Consequence:
➢You usually get one chance to rank in Google’s news ecosystem
➢Get your SEO right before you click ‘Publish’
• Possible Exception: LiveBlogPosting articles
Slide 18
Slide 18 text
#NESS23
#NESS23
2. Indexing
Indexing
Slide 19
Slide 19 text
#NESS23
Indexing and Rendering
Crawl Queue Crawling Processing Index
Render
Queue
Rendering
Index
Index
URL HTML
Rendered HTML
Slide 20
Slide 20 text
#NESS23
Indexing and Rendering
Render Queue Rendering
Crawl Queue Crawling Processing Index
Index
Index
URL HTML
Slide 21
Slide 21 text
#NESS23
Indexing and Rendering
Rendering takes time, and news doesn’t have time.
Indexing is initially with raw HTML only.
Crawl Queue Crawling Processing Index
Render Queue Rendering
Index
Index
URL HTML
Slide 22
Slide 22 text
#NESS23
Slide 23
Slide 23 text
#NESS23
Rendering isn’t the only shortcut…
Google wants publishers to noindex syndicated content.
Because Google sucks at identifying duplicate content.
At least, it can’t de-duplicate quickly.
Slide 24
Slide 24 text
#NESS23
Indexing is a multi-layered set of processes
Render Queue Rendering
Crawl Queue Crawling Processing
Index
Processing
Processing
Processing
Slide 25
Slide 25 text
#NESS23
Known Indexing Processes
• HTML Lexer;
➢ Tokenises HTML
• Parser;
➢ Extracts content from HTML for indexing
• Canonicaliser;
➢ Determines a URL’s canonical version
• Deduplicator;
➢ Reduces the amount of identical content in the index
• Pageranker;
➢ Calculates link value (FMA PageRank) for each URL
• Many, many more…
Slide 26
Slide 26 text
#NESS23
What about the Index itself?
Render Queue Rendering
Crawl Queue Crawling Processing Index
Index
Index
Slide 27
Slide 27 text
#NESS23
Three Crawlers… Three Indices?
Realtime crawler
Regular crawler
Legacy content crawler
RAM storage
SSD storage
HDD storage
Slide 28
Slide 28 text
#NESS23
Three Layers of Index Storage
1. RAM storage;
➢ Pages that need to be served quickly and frequently
Includes news articles but also popular content
2. SSD storage;
➢ Pages that are regularly served in SERPs but aren’t super popular
3. HDD storage;
➢ Pages that are rarely (if ever) served in SERPs
#NESS23
Key Take-Aways:
1. Make indexing easy for Googlebot;
Put all your critical content in the HTML source
Don’t rely on rendering to load valuable content
2. There’s no such thing as a duplicate content penalty;
However, duplicate content on a single site means the site
is competing with itself… and that’s stupid.
Slide 31
Slide 31 text
#NESS23
#NESS23
Technical SEO in 2023
Slide 32
Slide 32 text
#NESS23
The Basics - Crawling
1. Efficient Crawling;
➢ Server Response Time
- Aim for 600ms or faster
Slide 33
Slide 33 text
#NESS23
The Basics - Crawling
1. Efficient Crawling;
➢ Server Response Time
➢ Clean URLs
- Never use tracking parameters on internal links
https://www.website.com/news/article-123?recommended=1
https://www.website.com/news/article-123
Slide 34
Slide 34 text
#NESS23
The Basics - Crawling
1. Efficient Crawling;
➢ Server Response Time
➢ Clean URLs
➢ Lightweight pages;
- Page resources consume crawl budget
Slide 35
Slide 35 text
#NESS23
The Basics - Crawling
1. Efficient Crawling;
➢ Server Response Time
➢ Clean URLs
➢ Lightweight pages
➢ Pagination;
- Balance between paginated URLs and crawl waste
Slide 36
Slide 36 text
#NESS23
The Basics - Crawling
1. Efficient Crawling;
➢ Server Response Time
➢ Clean URLs
➢ Lightweight pages
➢ Pagination
➢ Correct HTTP status codes
https://developers.google.com/search/docs/crawling-indexing/http-network-errors
Slide 37
Slide 37 text
#NESS23
The Basics - Crawling
1. Efficient Crawling;
➢ Server Response Time
➢ Clean URLs
➢ Lightweight pages
➢ Pagination
➢ Correct HTTP status codes
➢ AdsBot can be unruly
Slide 38
Slide 38 text
#NESS23
The Basics – Indexing
2. Effortless indexing;
➢ Semantic HTML
- No client-side JS to load content
Slide 39
Slide 39 text
#NESS23
The Basics - Indexing
2. Effortless indexing;
➢ Semantic HTML
➢
headlines
This is a bad way to code an article headline
This is a properly coded article headline
Slide 40
Slide 40 text
#NESS23
The Basics - Indexing
2. Effortless indexing;
➢ Semantic HTML
➢
headlines
➢ Clean HTML in
Slide 41
Slide 41 text
#NESS23
The Basics - Indexing
2. Effortless indexing;
➢ Semantic HTML
➢
headlines
➢ Clean HTML in
➢ Uninterrupted HTML in article
Slide 42
Slide 42 text
#NESS23
The Basics - Indexing
2. Effortless indexing;
➢ Semantic HTML
➢
headlines
➢ Clean HTML in
➢ Uninterrupted HTML in article
➢ Good structured data;
- NewsArticle for articles
- Person for author pages
- Keep it lean, don’t over-annotate
Slide 43
Slide 43 text
#NESS23
Google used to be Deterministic
• Action A leads to ranking B;
➢Relatively simple crawling, indexing, and ranking systems
➢Few websites, low competition
➢Fairly predictable
Slide 44
Slide 44 text
#NESS23
Google today is Probabilistic
• Action A increases the probability of ranking B;
➢Massively complicated systems
➢Intensely competitive web
➢All SEO is geared towards maximising probabilities;
- But… 99% probability still means 1% chance of it not happening
#NESS23
Content Pruning v Topic Authority
• Your volume of (good) articles on a topic determines your
topic authority
• Topic authority = good visibility for your stories
• Deleting old content could undermine your topic authority
• Only delete bad content;
➢ Age and low traffic are not enough