Humans vs AI Quality Raters for Search Engines

Humans vs LLMs As Quality Raters for Search Engines Are
major changes coming? Dawn Anderson - March 2024

Dawn Anderson • UK based SEO consultant • 17 years
in SEO • Occasional SEO conference speaker • EU, UK, US, Global Search Awards judge • Previous digital marketing lecturer & trainer • Industry publication contributor • Now predominantly consulting all of the time Stalker of information retrieval threads and IR conference hashtags since 2017

A sea-change is coming for a fundamental part of search
On the other side of the ‘front door’

The important algorithmic ranking evaluation stage If importance thresholds reached
Indexing Discovery & refresh Crawling Dynamic build at runtime Serving In response to a query Ranking (& Re-ranking)

The process of search results evaluation (Ranking System) Determine how
well a ‘system’ (ranking system) fares either currently (continuous evaluation), or when compared to proposed changes

Why relevance evaluation?

Whilst some change is temporally predictable Seasonality, temporality, data-driven probability
of ‘precision-to-relevance’ intent shift

Often, what ‘relevance’ means is changing unpredictably Search CANNOT be
static in a changing world

The process of evaluation… • Is both continuous - On
existing ‘systems’ • And intermittent - On proposed ‘system improvements’

A ‘system’ is simply a recipe or multiple recipes SYSTEM
== ALGORITHMIC BLEND

‘System’ evaluation… is simply taste-testing Did the recipe turn out
as we hoped???

Before… roll out

AKA Google Algorithm Updates Jagger Big D addy Florida Fritz
Everﬂux Austin Bourbon PageRank Dewey Vince Caffeine Exact-match Penguin Bert RankBrain Pigeon Panda Fred

‘Human in the loop (HITL)’ is the mainstay of ‘system’
evaluation

Predominantly two types of ‘HITL’ evaluation • Implicit • Explicit

Implicit (‘Human’ in the loop has no awareness) evaluation •
Tests on real searcher segments • Anonymous scroll and click behaviour • UX testing on any site (heatmaps / recordings all fall into this category)

Explicit (human knows they are actively evaluating) evaluation • E.g.
Searchers asked to provide feedback • Netﬂix users asked to thumbs up a ﬁlm • Spotify favouriting or playlist building - leads to further recommendations • User groups / user panels • Sites asking for feedback • Professional expert relevance annotators • Paid human contractor evaluators

But it all mostly comes down to labels & labelling
anyways IMPORTANT… Labels are training data for machine learning

Labels are all around us In vast numbers they are
converted into mathematical form for machine learning training data

We are ALL data labellers… every single day Every day

A cohort of similar data labellers help with recommender systems
Birds of a feather ﬂock together… they like the same things

Data labels teach machines to know the diﬀerence between cats
and dogs (reinforcement learning) Cat, dog, dog, cat, cat, dog, cat, dog, dog, dog, cat

Search engines have used ‘The Crowd’ for HITL evaluation for
more than two decades

In search… ‘The Crowd’ ‘labels’ sample comparative search result sets
‘Relevant’ or ‘not relevant’

Pair-wise SERP results side-by-side comparisons are in the majority of
relevance evaluation exercises PAIR-WISE COMPARISON OR

Instead of “yum” or “barf” labels, it’s “relevant” or “not-relevant”
labels

But it’s mostly aggregated binary data Binary labels rolled up
into overall relevance scores

Eﬀectively a measurement of NDCG (Normalised Discounted Cumulative Gain) and
/ or DCG (Discounted Cumulative Gain)

The ‘recipe’ ingredients or quantities are then adjusted accordingly And
the cycle begins again

Until… Acceptable Precision at K is achieved (P@k) E.g. The
top 10 (k) or 20 (k) (whatever) in enough samples are deemed relevant

And… roll it out

But… Not all labels are created equally • Gold labels
- High quality / lower availability • Silver labels - Lower quality / higher availability

Real search engine users in experiments Create ‘gold labels’

Before ‘The Crowd’… professional subject matter expert annotators Created gold
labels too

But… Scale…

‘The Crowd’ came for the scale Professional expert annotators were
not scalable

Universal Human Relevance System

16,000+ Google human quality raters alone

But to the detriment of quality “Such annotation tasks were
delegated to crowd workers, with a substantial decrease in terms of quality of the annotation, compensated by a huge increase in annotated data.” (Clarke et al, 2022)

‘The Crowd’ likely produce silver labels

There are issues with the data labelling industry overall

Data labelling industry crisis… Demand outstrips supply • There is
a bottleneck (and it’s going to get worse) • Not enough labels produced to deal with the size of machine learning modes

“The global data collection and labeling market size was valued
at $2.22 billion in 2022 and it is expected to expand at a compound annual growth rate of 28.9% from 2023 to 2030, with the market then expected to be worth $13.7 billion.” Source: Grand View Research, 2021

Data labellers work across many industries, many companies • Maps
• Assistant • AI content detection • Search quality evaluation • Image detection labelling • AI content detection training • Any other ML driven application

High risk of under-trained ML models due to scaling without
label volume increase

Deepmind researchers - “We find current large language models are
significantly undertrained, a consequence of the recent focus on scaling language models while keeping the amount of training data constant. …we find for compute-optimal training …for every doubling of model size the number of training tokens should also be doubled.” – “Training Compute-Optimal Large Language Models” (Hoffman et al, 2022)

There’s a dark side to the data labelling industry too

Low paid ‘ghost workers’ in emerging economies

Pay protests at Google

There’s also a problem with humans as relevance labellers too
We ain’t all that… it seems

In addition to the very subjective 170 page ‘Quality Rater
Guidelines’

Notorious subjectivity of ‘relevance’

‘The crowd is made of people - Observations from large
scale crowd labelling’ (Thomas et al, 2022)

Bing Researchers - ‘The Crowd is Made of People: Observations
from large scale crowd-labelling’ (Thomas et al, 2022) Findings: • Fatigue • Time of day & day of week • Anchoring • Task-switching • Left-side bias • General disagreement on relevance

Human challenges Ethical concerns Undertrained models Bottlenecks / over-demand More
scale needed A perfect storm

Helloooooo… ChatGPT & Bing

‘Large language models can accurately predict searcher preferences’ (Thomas et
al, 2023) Bing’s LLM & GPT4 research

• GPT4 prompt engineering (role playing prompt) • (Up to
5) LLM agents to emulate the behaviour of search relevance evaluators • Produce enough gold and silver labels to build relevance training data for much larger data sets • Train the agents initially on gold labels ‘Large language models can accurately predict searcher preferences’ (Thomas et al, 2023)

Bing’s research and implementation Has caused quite a stir in
the information retrieval community

“To measure agreement with real searchers needs high-quality “gold” labels,
but with these we ﬁnd that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.” (Thomas et al, 2023)

Bing’s LLM evaluators - “A fraction of the cost and
better rankers” (Thomas et al, 2023)

Bing’s LLM evaluators are monitored Using several methods

A spectrum of LLM & human rater collaborative approach ?
‘Frontiers of Information Access Experimentation for Research and Education’ (Clarke et al, 2022)

Will human quality raters go?

“It is yet to be understood what the risks associated
with such technology are: it is likely that in the next few years, we will assist in a substantial increase in the usage of LLMs to replace human annotators.” (Clarke et al, 2022)

But…Concerns about reduced quality in exchange for scale “It is
a concern that machine-annotated assessments might degrade the quality, while dramatically increasing the number of annotations available.” Clarke et al, 2022

Will Google follow suit?

Bing test and release updates seamlessly and quickly Potential for
Google to go this way with better evaluation pipelines

Google cancels their contract with Appen

Some surmised a switch to AI evaluators was part of
the reason

Algorithms - bigger, broader, multi-modal / multi-aspected Aspected algorithms quickly
go into core or simultaneous • Product reviews • Helpful content classiﬁer • Panda historically • Spam updates

Machine learning classiﬁers Google is learning quickly: • what ‘unhelpful
content looks like’ • What AI generated content looks like • What paid links look like

AI content detection is where it’s at next For the
data labelling industry

Fun times ahead Enjoy!!!

Thank you Twitter - @dawnieando Website: https://bertey.com LinkedIn - https://www.linkedin.com/in/msdawnanderson/

Humans vs AI Quality Raters for Search Engines

Humans vs AI Quality Raters for Search Engines

More Decks by Dawn Anderson

Other Decks in Marketing & SEO

Featured

Transcript