Slide 1

Slide 1 text

The New Content SEO FLOQ - Amanda King Sydney SEO Conference 14 April 2023

Slide 2

Slide 2 text

The New Content SEO What we’ll talk about 1. A quick refresher 2. Have keywords ever actually been a thing Google used? 3. How Google reads content may not be what you think 4. So what do we do about all this? 5. Who tf am I?

Slide 3

Slide 3 text

DEAR READERS: I’m still learning. So are you. If I’ve royally butchered a concept, come talk to me after. I like learning.

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

A quick refresher

Slide 6

Slide 6 text

A brief refresher on how Google crawls the Internet It’s three separate stages: crawl, index, serve; with sub-processes for scoring and ranking. Content analysis is included in the indexing engine, content relevancy is in the serving engine. While this is an old patent (2011) the fundamentals still apply for this reminder. Source: https://patents.google.com/patent/US8572075B1/, retrieved 22 Mar 2023 https://developers.google.com/search/docs/fundamentals/how-search-works

Slide 7

Slide 7 text

● Query Deserves Freshness is a system ● Helpful Content is a system ● MUM & BERT are systems ○ “Bidirectional Encoder Representations from Transformers (BERT) is an AI system Google uses that allows us to understand how combinations of words express different meanings and intent.” The search engine ranking engine works in systems https://developers.google.com/search/docs/appearance/ranking-systems-guide

Slide 8

Slide 8 text

Have keywords ever actually been a thing Google used?

Slide 9

Slide 9 text

While Google is a machine, it’s moved fundamentally beyond keywords…and has since at least 2015.

Slide 10

Slide 10 text

Why hasn’t SEO?

Slide 11

Slide 11 text

Queries very quickly become entities “[...]identifying queries in query data; determining, in each of the queries, (i) an entity-descriptive portion that refers to an entity and (ii) a suffix; determining a count of a number of times the one or more queries were submitted“ - patent granted in 2015, submitted in 2012 Source: https://patents.google.com/patent/US9047278B1/en ; https://patents.google.com/patent/US20150161127A1/

Slide 12

Slide 12 text

Google acknowledges query-only based matching is pretty terrible. “Direct “Boolean” matching of query terms has well known limitations, and in particular does not identify documents that do not have the query terms, but have related words [...]The problem here is that conventional systems index documents based on individual terms, rather than on concepts. Concepts are often expressed in phrases [...] Accordingly, there is a need for an information retrieval system and methodology that can comprehensively identify phrases in a large scale corpus, index documents according to phrases, search and rank documents in accordance with their phrases, and provide additional clustering and descriptive information about the documents. [...]” - Information retrieval system for archiving multiple document versions, granted 2017 (link)

Slide 13

Slide 13 text

So it decided to make it’s search engine concept and phrase-based. “The system is adapted to identify phrases that have sufficiently frequent and/or distinguished usage in the document collection to indicate that they are “valid” or “good” phrases [...]The system is further adapted to identify phrases that are related to each other, based on a phrase's ability to predict the presence of other phrases in a document.” - Information retrieval system for archiving multiple document versions, granted 2017 (link)

Slide 14

Slide 14 text

“Rather than simply searching for content that matches individual words, BERT comprehends how a combination of words expresses a complex idea.” Source: https://blog.google/products/search/how-ai-powers-great-search-results/

Slide 15

Slide 15 text

MUM takes this a step further ● About 1,000 times more powerful than BERT ● Trained across 75 languages for greater context ● Recognises this across different types of media (video, text, etc) https://blog.google/products/search/introducing-mum/

Slide 16

Slide 16 text

How Google reads content may not be what you think

Slide 17

Slide 17 text

Step 1 Indexing Indexing is the stage where content is analysed, so how does Google do it?

Slide 18

Slide 18 text

BERT is a technique for pre-training natural language classification. So how does natural language processing work, once it has a corpus of data? Source: https://blog.google/products/search/search-language-understanding-bert/

Slide 19

Slide 19 text

Is there anything in this process that even looks like “keywords”?

Slide 20

Slide 20 text

1. Parsing: Tokenisation, parts of speech, stemming (for Google, lemmatization) 2. Topic Modelling: entity detection, relation detection 3. Understanding 4. Onto the next engine, ranking So the broad strokes steps in the indexation process are

Slide 21

Slide 21 text

● Semantic distance ● Keyword-seed affinity ● Category-seed affinity ● Category-seed affinity to threshold Parsing is intrinsically categorisation https://patents.google.com/patent/US11106712B2; https://www.seobythesea.com/2021/09/semantic-relevance-of-keywords/

Slide 22

Slide 22 text

How natural language processing usually works: tokenization and subwords Source: https://ai.googleblog.com/2021/12/a-fast-wordpiece-tokenization-system.html

Slide 23

Slide 23 text

● N-grams: important to find the primary concepts of the sentence by identifying and excluding stop words ● “Running” “runs” “ran” = same base — “run” This gets broken down even further https://patents.google.com/patent/US8423350B1/

Slide 24

Slide 24 text

Google does a lot of things when detecting entities and relationships ● Identifying aspects to define entities based on popularity and diversity, granted in 2011 (link) ● Finding the entity associated with a query before returning a result, using input from human quality raters to confirm objective fact associated with an entity, granted in 2015 (link) ● Understanding the context of the query, entity and related answer you’re searching for, granted in 2019 (link) ● Aims to understand user generated content signals in relation to a webpage, granted in 2022 (link)

Slide 25

Slide 25 text

Google does a lot of things when detecting entities and relationships ● Understanding the best way to present an entity in a results page, granted in 2016 (link) ● Managing and identifying disambiguation in entities, granted in 2016 (link) ● Build entities through co-occurring ”methodology based on phrases” and store lower information gain documents in a secondary index, granted in 2020 (link) ● Understanding context from previous query results and behaviour, granted in 2016 (link)

Slide 26

Slide 26 text

Step 2 Scoring In their own description of their ranking & scoring engine, Google offers 5 buckets: ● Meaning ● Relevance ● Quality ● Usability ● Context

Slide 27

Slide 27 text

Scoring is all those 200+ factors we talk about… Google has cited everything from internal links, external links, pogo sticking, “user behaviour”, proximity of the query terms to each other, context, attributes, and more Just a few of the patents related to scoring: ● Evaluating quality based on neighbor features (link) ● Entity confidence (link) ● Search operation adjustment and re-scoring (link) ● Evaluating website properties by partitioning user feedback (link) ● Providing result-based query suggestions (link) ● Multi-process scoring (link) ● Block spam blog posts with “low link-based score” (link)

Slide 28

Slide 28 text

It actually looks like they have a classification engine for entities as well This patent was filed in 2010, granted in 2014. Likely a basis for the Knowledge Graph. (US8838587B1) https://patents.google.com/patent/US8838587B1/en

Slide 29

Slide 29 text

“...link structure may be unavailable, unreliable, or limited in scope, thus, limiting the value of using PageRank in ascertaining the relative quality of some documents.” (circa 2005) https://patents.google.com/patent/US7962462B1/en

Slide 30

Slide 30 text

There’s more than one document scoring function, which are weighted, and has been since the beginning

Slide 31

Slide 31 text

How Google ranks content ● Based on historical behaviour from similar searches in aggregate (application) ● Based on external links (link) ● Based on your own previous searches (link) ● Based on or not it should directly provide the answer via Knowledge Graph (link) ● Phrase- and entity-based co-occurrence threshold scores (link) ● Understanding intent based on contextual information (link)

Slide 32

Slide 32 text

Helpful Content Update & Information Gain Score (granted Jun 2022) ● The information gain score might be personal to you and the results you’ve already seen ● Featured snippets may be different from one search to another based on the information gain score of your second search ● Pre-training a ML model on a first set of data shown to users in aggregate, getting an information gain score, and using that to generate new results in SERPs. https://patents.google.com/patent/US20200349181A1/en

Slide 33

Slide 33 text

What is “information gain”? “Information gain, as the ratio of actual co-occurrence rate to expected co-occurrence rate, is one such prediction measure. Two phrases are related where the prediction measure exceeds a predetermined threshold. In that case, the second phrase has significant information gain with respect to the first phrase.“ - Phrase-based searching in an information retrieval system, granted 2009 (link)

Slide 34

Slide 34 text

So, basically, it’s quantifying to what degree you talk about all the topics Google sees as related to your main subject.

Slide 35

Slide 35 text

If information gain is such a strong concept in which results Google chooses which content to show, why do so few folks talk about it? https://patents.google.com/patent/US7962462B1/en

Slide 36

Slide 36 text

So what do we do about all this?

Slide 37

Slide 37 text

When is the last time you’ve done a full content inventory?

Slide 38

Slide 38 text

What I mean when I say content inventory https://www.portent.com/onetrick

Slide 39

Slide 39 text

Redo keyword research and overlay entities ● Pull content for at least the top 10 search results ranking for your target keyword ● Dump them into Diffbot (https://demo.nl.diffbot.com/) or the Natural Language AI demo (https://cloud.google.com/natural-language) ● Note the entities and salience ● Run your target page ● Understand the differences ● Update your content accordingly

Slide 40

Slide 40 text

Start with keyword research, find co-occuring terms ● Pull content for at least the top 10 search results ranking for your target keyword ● Look at TF-IDF calculators to reverse engineer the topic correlation (Ryte has a paid one) ● Note the terms included ● Run your target page ● Understand the differences ● Update your content accordingly

Slide 41

Slide 41 text

Break old content habits ● FAQ on product pages ● Consolidate super-granularly targeted blog articles ● Think outside of the blog folder — the semantic relationship can carry through to the directory order of the website as well ● Internal linking can be a secret weapon ● Fit content to purpose: not everything needs a 3,000 word in-depth article

Slide 42

Slide 42 text

Measure what really matters to the business — traffic and revenue from organic.

Slide 43

Slide 43 text

Who tf am I?

Slide 44

Slide 44 text

Amanda King is a human ● Over a decade in the SEO industry ● Traveled to 40+ countries ● Business- and product-focussed ● Knows CRO, Data, UX ● Always open to learning something new ● Slightly obsessed with tea

Slide 45

Slide 45 text

Thank you Amanda King t. @amandaecking i. @floq.co / @amandaecking w. floq.co

Slide 46

Slide 46 text

How Google reads content ● ● BERT is open source // BERT q&a demo ● Latent Direchlet Allocation ● BERT or PaLM? (PaLM = LLM) or LaMDA? Or CALM ● Recent deep learning with BERT ● MuM

Slide 47

Slide 47 text

And like an elephant Google doesn’t forget Google is vector-based.