Robust AI Search Ranking for Radical C2C Marketplace Growth

1 Robust AI Search Ranking for Radical C2C Marketplace Growth
Search & Discovery, Ranking Team Teo Narboneta Zosa, Chingis Oinar

2 Intro Chingis Oinar Teo Narboneta Zosa

3 • Search at Mercari • Dataset Construction & Ofﬂine
Evaluation • Model Building • De-biasing through Historical Data • Takeaways & What’s Next Content

4 Search at Mercari

5 About Mercari • A global marketplace where anyone can
buy & sell Introduction

6 • Monthly active users (MAU): >23 million (1 in
4 adults) *98% of all e-commerce users in Japan • Total number of listings to date: >3 billion • Gross Merchandise Value (GMV): ~¥985 billion (~$6.5 billion USD) Introduction Japan’s largest C2C e-commerce marketplace

7 Search at Mercari • Primary discovery mechanism in Mercari’s
marketplace • Performance critical for user experience • Retrieval powered by Elasticsearch • Post-retrieval ML re-ranking, decoupled from Elasticsearch

8 source: Building MLOps Infrastructure at Japan's Largest C2C E-Commerce
Site (Berlin Buzzwords 2023) Search at Mercari

9 source: Building MLOps Infrastructure at Japan's Largest C2C E-Commerce
Site (Berlin Buzzwords 2023) Search at Mercari

10 Learning to Rank (LtR) for Search Listing 1 Listing
2 Listing 3 Listing 4 Listing 2 Listing 3 Listing 1 Listing 4 Ranking function ML to learn the “ideal” ranking function Most relevant listing Least relevant listing Relevancy Client SERP* order (top K) ES SERP* order (~1000 items) *SERP: Search Engine Results Pages

11 Examples Purchased item Liked item Page 1 Page 2
Query: “rjレインボー” (“rj rainbow”) Page 1 Page 2

13 Dataset Construction & Ofﬂine Evaluation

14 SERPs with at least 1 label • Training: 5
days • Validation: 1 day Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v0: Implicit Judgments Most relevant listing Least relevant listing Relevancy

15 Ofﬂine Evaluation v0: Quantitative & Qualitative Offline dataset variant
1 Offline dataset variant n … Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success

1 Offline dataset variant n … Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How?

1 Offline dataset variant n … nDCG score: 0.69 nDCG score: 0.58 Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How? • normalized Discounted Cumulative Gain • Measures ranking quality • the closer to 1 the better

1 Offline dataset variant n … nDCG score: 0.69 nDCG score: 0.58 Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How? • normalized Discounted Cumulative Gain • Measures ranking quality • the closer to 1 the better • Qualitative “spot checks”: human-in-the-loop search quality validation (“vibe check”)

1 Offline dataset variant n … nDCG score: 0.69 nDCG score: 0.58 Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How? • normalized Discounted Cumulative Gain • Measures ranking quality • the closer to 1 the better • Qualitative “spot checks”: human-in-the-loop search quality validation (“vibe check”) This needs to be really effective, reproducible, and fast

22 Qualitative Spot Checks

days • Validation: 1 day Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v1: High-Signal Labels Most relevant listing Least relevant listing Relevancy

25 At Mercari we rely heavily on: • Purchase nDCG:
non-purchase labels are masked Ofﬂine Evaluation v1: Custom Metrics

non-purchase labels are masked Ofﬂine Evaluation v1: Custom Metrics

non-purchase labels are masked Ofﬂine Evaluation v1: Custom Metrics ✨ ✨

non-purchase labels are masked Ofﬂine Evaluation v1: Custom Metrics ✨ ✨ Key to repeatedly delivering strong & enduring business impact

29 Custom Metrics: Purchase nDCG

30 Custom Metrics: Purchase nDCG

non-purchase labels are masked • Counterfactual nDCG: weighted nDCG where some training samples (SERPs) might get higher weights to counterbalance engagement position-bias Ofﬂine Evaluation v1: Custom Metrics ✨ ✨ Key to repeatedly delivering strong & enduring business impact

days • Validation: 1 day Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v2: High-Signal Filtering Most relevant listing Least relevant listing Relevancy

33 SERPs with at least 1 future purchase label •
Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v2: High-Signal Filtering Most relevant listing Least relevant listing Relevancy

34 Unique Items: Sparse Labels

Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v3: High-Signal Dense Labels Most relevant listing Least relevant listing Relevancy

Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • [Other User] Purchase • [Other User] Checkout (“Purchase Started) • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v3: High-Signal Dense Labels Most relevant listing Least relevant listing Relevancy

37 Unique Items: Dense Labels

Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • [Other User] Purchase • [Other User] Checkout (“Purchase Started) • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v4: “Good Listing” Labels Most relevant listing Least relevant listing Relevancy

Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • [Other User] Purchase • [Other User] Checkout (“Purchase Started) • Clicks on “Sold” Items • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v4: “Good Listing” Labels Most relevant listing Least relevant listing Relevancy

40 Unique Items: “Good Listing” Labels

41 SERPs with at least 1 ??? label • Training:
28 days • Validation: 7 days Possible Relevance Labels (high to low): • “Likely to be purchased” • “Likely to inspire users to list their items” • “An exemplar listing likely to reinforce trust in our marketplace and its reputability” Dataset Construction v5: Explicit Judgements via LLM Most relevant listing Least relevant listing Relevancy

42 Model Building

43 Introduction to Learning to Rank Pointwise Pairwise Listwise Considers
items independently (Regression, Classiﬁcation) Considers items pairs (RankNet, RankSVM) Considers items lists (ListNet, ApproxNDCG, GumbelApproxNDCG)

44 • In our model, there are two types of
features: • Context features contain information about the SERP • Document features contain information about each listing Feature Engineering - Nomenclature 1. Listing a price, title, freshness... 2. Listing b price, title, freshness... 3. Listing c price, title, freshness... 1. Query → tokens 2. User behaviour 3. SERP statistics Context Document

45 Features in LTR datasets are diverse and can be
of different scales. • MeCab for Japanese text tokenization • Z-score normalization for SERP statistics • “log1p” transformation for numeric features (Qin et al., 2021) FEATURE TRANSFORMATION

46 Model Overview

47 Model De-biasing

48 • Historically, we started with ES as the sole
ranking function producing item_rank. • As of now, we rely on ES item_rank feature to give our reranker models the information from the retrieval stage. • However, due to the prevailing engagement at the top of SERPs for highly optimized queries, models could become over-focused towards the item_rank. ◦ Overall, retrieval is a noisy task. ◦ Potential spurious correlations. • Lazy solution: De-biasing Through Augmentations

49 Implicit Feedback brings its own difﬁculties: • Noise: ◦
Users click for unexpected reasons. ◦ Many clicks happen not because of relevancy. ◦ Many clicks do not occur despite of relevancy. • Bias: ◦ Position bias: Higher ranked documents get more attention. ◦ Item selection bias: Interactions are limited to the presented documents. ◦ Presentation bias: Results that are presented differently will be treated differently. Learning from Implicit Feedback: Difﬁculties

50 Most commonly adopted probabilistic model: • The click probability
depends on the relevance probability and observation probability. Position Bias

51 How can this propensity score be estimated for position
bias? • If you randomly display the products, it's only the rank that matters. • Randomly shufﬂe the top n items • Record clicks -> Aggregate clicks per rank • Normalize to obtain propensities However, this hurts users’ search experience. Estimating Position Bias: Random Shufﬂing

52 • Main idea: In real-world production systems many (randomized)
interventions take place, such as A/B tests. Can we use these interventions instead? • This approach is called intervention harvesting (Agarwal et al. (2017); Fang et al. (2019); Agarwal et al. (2019b)) Estimating Position Bias: Intervention Harvesting Image source: https://notesonai.com/Counterfactual+Evaluation+and+LTR

53 Estimating Position Bias: High-level • We only need propensities
proportional to the true observation probability for learning.

54 Position Bias on Mercari: • Rows positioned higher typically
get greater engagement • After a few scrolls, users generally show a tendency to engage more with the central positions as opposed to the right and left positions. Biased data is consumed by a machine learning model used in production • it may result in the creation of a feedback loop reinforcing the initial bias. Position Bias @ Mercari

55 How IPW is Used to Update the Model •
In tensorflow_ranking, you pass document weights as a feature and it will magically handle computations for you. • However, we apply weights on labels instead of the loss. Thus, • the higher the weight -> the larger the update -> example is prioritized more • the lower the weight -> the lower the update -> example is prioritized less Model De-biasing: In Practice Where weight is Image source: https://github.com/tensorflow/ranking/blob/v0.5.2/tensorflow_ranking/python/losses_impl.py#L99 1-L1002

56 Using the estimated biases for each position, we can
de-bias our training data and therefore our machine learning model. • Instead of weighing sample losses • We suspect that we over-penalized our model Model De-biasing: Alternative Approach Method Full Data Tail Queries Baseline 48.84 52.45 IPW-Loss 48.74 52.66 IPW-Labels 49.10 52.92 Ofﬂine Purchase nDCG

57 Takeaways & What’s Next?

58 • Foundation to get to and stay in production
(Berlin Buzzwords 2023) Takeaways & What’s Next?

(Berlin Buzzwords 2023) • Systems for tight eval & data enrichment feedback loops (Berlin Buzzwords 2024; this talk) Takeaways & What’s Next?

60 The ROI of AI YoY +9% YoY +10% GMV1/MAU2
1. Aggregate transaction value after adjusting for cancellations; aggregates C2C and B2C ﬁgures 2. Quarterly average number of users who browsed our service (app or web) at least once during a given month Million users Billion JPY (Billion JPY) (Million users) 2 1

(Berlin Buzzwords 2023) • Systems for tight eval & data enrichment feedback loops (Berlin Buzzwords 2024; this talk) • Meta-foundation & system to unify AI applications across Mercari (COMING SOON!) Takeaways & What’s Next?

62 Acknowledgements Kaiyi Liu Kentaro Takiguchi Yui Takeuchi Takuma Kinoshita
Asir Saeed Daniele Hohol Antoine Lecubin Ryan Ginstrom Pathompong Yupensuk Tomohiro Furusawa

63 Questions? Thank You!

Robust AI Search Ranking for Radical C2C Market...

Robust AI Search Ranking for Radical C2C Marketplace Growth

Other Decks in Technology

Featured

Transcript