Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Humans vs AI Quality Raters for Search Engines

Humans vs AI Quality Raters for Search Engines

Dawn Anderson

November 21, 2024
Tweet

More Decks by Dawn Anderson

Other Decks in Marketing & SEO

Transcript

  1. Humans vs LLMs As Quality Raters for Search Engines Are

    major changes coming? Dawn Anderson - March 2024
  2. Dawn Anderson • UK based SEO consultant • 17 years

    in SEO • Occasional SEO conference speaker • EU, UK, US, Global Search Awards judge • Previous digital marketing lecturer & trainer • Industry publication contributor • Now predominantly consulting all of the time Stalker of information retrieval threads and IR conference hashtags since 2017
  3. A sea-change is coming for a fundamental part of search

    On the other side of the ‘front door’
  4. The important algorithmic ranking evaluation stage If importance thresholds reached

    Indexing Discovery & refresh Crawling Dynamic build at runtime Serving In response to a query Ranking (& Re-ranking)
  5. The process of search results evaluation (Ranking System) Determine how

    well a ‘system’ (ranking system) fares either currently (continuous evaluation), or when compared to proposed changes
  6. The process of evaluation… • Is both continuous - On

    existing ‘systems’ • And intermittent - On proposed ‘system improvements’
  7. AKA Google Algorithm Updates Jagger Big D addy Florida Fritz

    Everflux Austin Bourbon PageRank Dewey Vince Caffeine Exact-match Penguin Bert RankBrain Pigeon Panda Fred
  8. Implicit (‘Human’ in the loop has no awareness) evaluation •

    Tests on real searcher segments • Anonymous scroll and click behaviour • UX testing on any site (heatmaps / recordings all fall into this category)
  9. Explicit (human knows they are actively evaluating) evaluation • E.g.

    Searchers asked to provide feedback • Netflix users asked to thumbs up a film • Spotify favouriting or playlist building - leads to further recommendations • User groups / user panels • Sites asking for feedback • Professional expert relevance annotators • Paid human contractor evaluators
  10. But it all mostly comes down to labels & labelling

    anyways IMPORTANT… Labels are training data for machine learning
  11. Labels are all around us In vast numbers they are

    converted into mathematical form for machine learning training data
  12. A cohort of similar data labellers help with recommender systems

    Birds of a feather flock together… they like the same things
  13. Data labels teach machines to know the difference between cats

    and dogs (reinforcement learning) Cat, dog, dog, cat, cat, dog, cat, dog, dog, dog, cat
  14. Pair-wise SERP results side-by-side comparisons are in the majority of

    relevance evaluation exercises PAIR-WISE COMPARISON OR
  15. Until… Acceptable Precision at K is achieved (P@k) E.g. The

    top 10 (k) or 20 (k) (whatever) in enough samples are deemed relevant
  16. But… Not all labels are created equally • Gold labels

    - High quality / lower availability • Silver labels - Lower quality / higher availability
  17. But to the detriment of quality “Such annotation tasks were

    delegated to crowd workers, with a substantial decrease in terms of quality of the annotation, compensated by a huge increase in annotated data.” (Clarke et al, 2022)
  18. Data labelling industry crisis… Demand outstrips supply • There is

    a bottleneck (and it’s going to get worse) • Not enough labels produced to deal with the size of machine learning modes
  19. “The global data collection and labeling market size was valued

    at $2.22 billion in 2022 and it is expected to expand at a compound annual growth rate of 28.9% from 2023 to 2030, with the market then expected to be worth $13.7 billion.” Source: Grand View Research, 2021
  20. Data labellers work across many industries, many companies • Maps

    • Assistant • AI content detection • Search quality evaluation • Image detection labelling • AI content detection training • Any other ML driven application
  21. Deepmind researchers - “We find current large language models are

    significantly undertrained, a consequence of the recent focus on scaling language models while keeping the amount of training data constant. …we find for compute-optimal training …for every doubling of model size the number of training tokens should also be doubled.” – “Training Compute-Optimal Large Language Models” (Hoffman et al, 2022)
  22. ‘The crowd is made of people - Observations from large

    scale crowd labelling’ (Thomas et al, 2022)
  23. Bing Researchers - ‘The Crowd is Made of People: Observations

    from large scale crowd-labelling’ (Thomas et al, 2022) Findings: • Fatigue • Time of day & day of week • Anchoring • Task-switching • Left-side bias • General disagreement on relevance
  24. • GPT4 prompt engineering (role playing prompt) • (Up to

    5) LLM agents to emulate the behaviour of search relevance evaluators • Produce enough gold and silver labels to build relevance training data for much larger data sets • Train the agents initially on gold labels ‘Large language models can accurately predict searcher preferences’ (Thomas et al, 2023)
  25. “To measure agreement with real searchers needs high-quality “gold” labels,

    but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.” (Thomas et al, 2023)
  26. Bing’s LLM evaluators - “A fraction of the cost and

    better rankers” (Thomas et al, 2023)
  27. A spectrum of LLM & human rater collaborative approach ?

    ‘Frontiers of Information Access Experimentation for Research and Education’ (Clarke et al, 2022)
  28. “It is yet to be understood what the risks associated

    with such technology are: it is likely that in the next few years, we will assist in a substantial increase in the usage of LLMs to replace human annotators.” (Clarke et al, 2022)
  29. But…Concerns about reduced quality in exchange for scale “It is

    a concern that machine-annotated assessments might degrade the quality, while dramatically increasing the number of annotations available.” Clarke et al, 2022
  30. Bing test and release updates seamlessly and quickly Potential for

    Google to go this way with better evaluation pipelines
  31. Algorithms - bigger, broader, multi-modal / multi-aspected Aspected algorithms quickly

    go into core or simultaneous • Product reviews • Helpful content classifier • Panda historically • Spam updates
  32. Machine learning classifiers Google is learning quickly: • what ‘unhelpful

    content looks like’ • What AI generated content looks like • What paid links look like