Improving Search @scale with efficient query experimentation @BerlinBuzzwords 2024

Slide 1

Slide 1 text

IMPROVING SEARCH @scale with eﬃcient query experimentation.

from relevance guesswork to granular search experimentation

Slide 2

Slide 2 text

I’m ANDY

Founder & CTO @searchhub.io

Deep passion for improving Product Discovery (Search & Recs) for over 2 decades.

Slide 3

Slide 3 text

IT’S THE AUTOPILOT FOR SEARCH It consistently and incrementally improves your search, driven by data and experimentation and works with any search engine WHAT IS ? we currently monitor & enhance approximately 60 billion searches, generating around €3.5 billion in annual revenue for our clients.

visit searchhub.io

Slide 4

Slide 4 text

WARNING today, I’m not going to talk about Gen-AI, Vector-Search or LLMs

Slide 5

Slide 5 text

06 INSTEAD I’M GOING TO TALK ABOUT. Improving SEARCH over time — framing the problem. SEARCH Experimentation / Testing. Sketching out a potential solution. Obstacles along the way. 01 Metrics to choose. Is it worth it? 02 03 04 05

Slide 6

Slide 6 text

WHAT DOES IT MEAN TO IMPROVE SEARCH @scale 01

Slide 7

Slide 7 text

WHAT PAPERS TELL YOU

By plugging in technology/approach (X) you’ll see an improvement of (~ Y)%

It’s pretty easy to improve search in a snapshot scenario (fixed parameters) for a single point in time. 01

Slide 8

Slide 8 text

WHAT REALITY TEACHES YOU ❏ Indexed documents change over time ❏ Incoming queries change over time ❏ Configurations change over time (Syns, Boostings, Rewrites) ❏ Features change over time You face different VERSIONS of your SEARCH over time CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG CONFIG INDEX QUERIES CONFIG TIME 01

Slide 9

Slide 9 text

DO CHANGES IMPROVE OR DECREASE SEARCH QUALITY OVER TIME? Local Optimum Global Optimum 01

Slide 10

Slide 10 text

WHEN REALITY HITS YOU results of results of product-type: laptop T0 syn: laptop ~ notebook T1: new assortment: laptops laptops product-type: laptop OR product-type: sketchbook 01

Slide 11

Slide 11 text

WHEN REALITY HITS YOU finds …/shoes/ T0 redirect: sneakers T1: improved retrieval sneakers sneakers 01

Slide 12

Slide 12 text

WHEN REALITY HITS YOU T0: keyword search T1: vector search dark chair with velvet cover without rotation function dark chair with velvet cover without rotation function 01

Slide 13

Slide 13 text

THOSE WHO CAN NOT LEARN FROM HISTORY ARE DOOMED TO REPEAT IT. ― GEORGE SANTAYANA

Slide 14

Slide 14 text

Measuring the impact of changes made to the retrieval system over time.

SEARCH EXPERIMENTATION 02

Slide 15

Slide 15 text

WHAT THE MARKET TELLS YOU

You’ll be able to incrementally improve your search by plugging in an Experimentation / Multivariate Testing Stack (X)

Really? - let’s think about this for a bit 02

Slide 16

Slide 16 text

WHAT REALITY TEACHES YOU USERS OR SESSIONS What is our randomization unit? 50% 50% Almost any experimentation system out there tries to randomize on users with a session fallback. But how can we judge query-based changes on a user or session level? 02

Slide 17

Slide 17 text

WHAT REALITY TEACHES YOU USERS/SESSIONS vs. QUERIES Are the underlying distributions norm.dist? Users and Sessions are, if the randomization function is properly designed. Queries are not, as their likelihood follows a zipfian-distribution 02

Slide 18

Slide 18 text

WHAT'S THE PROBLEM? If we split perfectly, our USERS/SESSIONS in 2 groups. How many queries could we experiment with 02 we lose at least 60% of the initial opportunity

Slide 19

Slide 19 text

WHAT WE LEARN Can we guarantee variant equality (No SRM)? For Users and Sessions - YES For Queries - NO For changes with near global impact, classic User/Session — based splitting works quite well. For query dependent changes, we CANNOT. One way around this is query — based splitting. Same randomization method but different unit 02

Slide 20

Slide 20 text

DEPENDING ON THE TYPE OF CHANGE, WE NEED TO CHOOSE THE RIGHT RANDOMIZATION UNIT

Slide 21

Slide 21 text

Which metrics could significantly improve search over time

METRICS TO USE 03

Slide 22

Slide 22 text

WHAT MANAGERS WANT North Star Metrics like CR AND/OR ARPU Show me significant uplifts in Problems with North Star Metrics 1) low sensitivity of the north star metric 2) differences between the short-term and long-term impact on the north star metric. Pareto Optimal Proxy Metrics 03

Slide 23

Slide 23 text

WHAT IS LEAST FRAGILE Showing significant uplift in positive direct proxy metrics is a lot easier and often less error-prone. SEARCH-PATH / MICRO-SESSIONS to the rescue CTR ~ ATBR ~ ABRPQ ~ Avg-added-Basket- Revenue-per-Query Add-to-Basket-rate Click-through-rate KPIs with SEARCH-PATH scope 03 Assign events (clicks,carts,..) to triggers (query) probabilistically

Slide 24

Slide 24 text

IT’S BETTER TO HAVE A CLEAN DIRECT PROXY-METRIC THAN A FUZZY NORTH STAR METRIC

Slide 25

Slide 25 text

Measuring the impact of changes made to the retrieval system over time.

SKETCHING OUT A POTENTIAL SOLUTION 04

Slide 26

Slide 26 text

BUILDING-BLOCKS for Search-Experimentation Identifying Experiments Experiment Design Experimentation Platform Which changes and metrics will be selected for an Experiment? How will the design of the experiment setup look? How will the experiments be managed and evaluated? 04

Slide 27

Slide 27 text

BUILDING-BLOCKS Identifying Experiments Which changes and metrics will be selected for an experiment? When dealing with millions of queries and thousands of alterations, a (semi)-automatic system for identifying experiments is essential. Identification needs to be supported by observations of how results and user behavior evolve. For efficiency it's also crucial to filter identified experiments 04 Additionally, these observations help distinguish between global shifts and query-specific variations.

Slide 28

Slide 28 text

BUILDING-BLOCKS Experiment Design How will the design of the experiment setup look? Choose the right type of randomization unit based on the classification of global changes vs. query dependent changes. Use expected data and prior data, to initialize your experiments. Reduce external influences and maintain minimal sparseness. We use direct proxy metrics like CTR, ATBR, ABRPQ. 04

Slide 29

Slide 29 text

BUILDING-BLOCKS Experimentation Platform How will the experiments be managed and evaluated? We need a system that automatically stores, starts, evaluates and ends experiments. Ideally, we would employ various evaluation methods tailored to different types of metric distributions (such as binomial or non-binomial) and generate experiment results accordingly. 04

Slide 30

Slide 30 text

so we just have to PUT IT ALL TOGETHER throw PROD-DATA at it and CELEBRATE WELL…

Slide 31

Slide 31 text

Prepare for trouble.

OBSTACLES ALONG THE WAY 05

Slide 32

Slide 32 text

SEARCH DATA IS SPARSE Problem It should be obvious, but it’s shocking how sparse even direct KPIs are, like Searches, CTR and ATBR. Solutions ❏ Aggregate by search-intent ❏ Find methods that provide valid experiment evaluations with less data. 05

Slide 33

Slide 33 text

IMPROVING SPARSENESS — 1 Solutions Aggregate by search-intent. Depending on the coverage/quality, this can reduce the sparseness around 10-33%. Without aggregation With aggregation 05 share of total unique queries (%) share of total query frequency (%)

Slide 34

Slide 34 text

IMPROVING SPARSENESS — 2 Solutions Find statistical methods that provide valid experiment evaluations with less data. (Group Sequential Testing) We have specifically tuned the Lan DeMets functions with ML to maximize sample-size efficiency in terms of experiment abortion. 05

Slide 35

Slide 35 text

BE AWARE OF STATISTICAL POWER PROBLEM With small sample sizes, controlling statistical power (TYPE-2 Error) gets more important. Solution Do not use post-experiment power, instead model in the error by simulation and adjust p-Value accordingly. A/B Testing Intuition Busters 05

Slide 36

Slide 36 text

SEARCH DATA IS UNSTABLE Problem Search KPIs are very sensitive to trends and seasonality, this makes the data unstable (high variance) Solutions Cap Experiment runtime (max 28 days) AND reduce variance via CUPED or other similar methods. CUPED — Improving Sensitivity Of Controlled Experiments by Utilizing Pre-Experiment Data KPI Samples needed to detect 5% uplift without CUPED Samples needed to detect 5% uplift with CUPED SAVINGS CTR ~25k ~18k ~30% ATBR ~73k ~23k ~69% 05

Slide 37

Slide 37 text

SEARCH DATA IS IMBALANCED Problem Sample sizes are often very imbalanced which increases minimum sample size and variance. Solutions Use the whole Search Funnel to guide and influence the user query distribution. Query Suggestions, Query Recommendations We can use them to balance sample sizes by actively promoting them. 05 Query Unique Searches laptop 2475 notebook 4159

Slide 38

Slide 38 text

SEARCH DATA IS IMBALANCED Query Suggestions Promote the variant with too little traffic in your auto-suggestions to reduce the imbalance. Query Recommendations Promote the variant with too little traffic in your query recommendations (others searched for) to reduce the imbalance. 05

Slide 39

Slide 39 text

NOT EVERY EXPERIMENT MAKES SENSE Even if data shows that an Experiment has great potential — this could dramatically change over time. Implement FAST-EXITS for cases like: ❏ Zero-Result Queries ❏ Marketing Campaigns ❏ Seasonal Changes 05 Problem Solution

Slide 40

Slide 40 text

IS SIGNIFICANCE ALL WE CARE ABOUT? Experimentation also involves managing risk. If an observed effect isn't statistically significant, but the likelihood of it being better is substantial and the risk of being much worse is minimal, why not give it a try? Problem Solutions 05

Slide 41

Slide 41 text

Is incremental automated experimentation in SEARCH really worth it?

IS IT WORTH IT? 06

Slide 42

Slide 42 text

SOME FACTS from PROD success-rate we finished ~1900 8.51% +16.1% avg. Treatment Effect Experiments last week for our customers with a and a 06

Slide 43

Slide 43 text

WHY YOU SHOULD CARE Leveraging the power of tiny gains In the past five months since its launch, Search Experimentation (Query Testing) has consistently boosted the overall weekly search KPIs by approximately 0.49% compared to the hold-out-variant. While this may seem limited, it accumulates to an impressive overall enhancement of 9.87% over the entire period, without any signs of decline.(and seasonal effect)

Slide 44

Slide 44 text

WORK IN PROGRESS Communicating Experiment Results Together with our customers, we are still figuring out the best way to communicate experimental results to maximize impact. Unfortunately the interpretation of these results is not always straightforward: ❏ Contradicting KPIs (CTR improves, ATBR does not) ❏ Effect not significant but close and potential very high (Risk Management)

Slide 45

Slide 45 text

WORK IN PROGRESS

Slide 46

Slide 46 text

NOW GO OUT AND MAKE SEARCH EXPERIMENTATION WORK FOR YOUR ORGANIZATION AS WELL

Slide 47

Slide 47 text

THANK YOU! Questions? Talk to me and stalk me here CREDITS Presentation Template: SlidesMania Icons: Flaticon