Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Robust AI Search Ranking for Radical C2C Market...

Robust AI Search Ranking for Radical C2C Marketplace Growth

In today's competitive e-commerce landscape, effective search ranking systems are no longer a luxury, but a necessity. At Mercari, Japan’s largest C2C e-commerce marketplace, AI search ranking is the flagship feature in our commitment to continuously integrate AI into search, driving significant engagement and GMV uplift in our pursuit to provide the best search experience for our millions of users.

In this talk, we'll delve into key components of Mercari's search ranking system, specifically:

Dataset Construction: Demonstrating how we build a rich and diverse dataset incorporating user behavior, item attributes, and marketplace dynamics. We will show how we deal with implicit feedback specific to two-sided marketplaces with unique items.

Model Tracking & Monitoring: Showing how we approach tracking and monitoring our ML ranking models, including key custom metrics we’ve developed for robust evaluation. We will show how our custom metrics make our model debugging simpler and ensure our features have a measurable business impact in online testing.

Model Building: Sharing the technical details of our model building process, exploring what learning-to-rank is as well as our goals for optimal performance.

De-biasing through Historical Data: Outlining the potential problems inherent in implicit feedback and how we deal with potential biases in our ranking models through counterfactual learning. We will show how biased behavior affects the model learning process, what factors can trigger specific user behaviors, and how we mitigate these biases in a way that maintains and enhances model performance over time.

Avatar for bigcowana

bigcowana

June 17, 2024

Other Decks in Technology

Transcript

  1. 1 Robust AI Search Ranking for Radical C2C Marketplace Growth

    Search & Discovery, Ranking Team Teo Narboneta Zosa, Chingis Oinar
  2. 3 • Search at Mercari • Dataset Construction & Offline

    Evaluation • Model Building • De-biasing through Historical Data • Takeaways & What’s Next Content
  3. 6 • Monthly active users (MAU): >23 million (1 in

    4 adults) *98% of all e-commerce users in Japan • Total number of listings to date: >3 billion • Gross Merchandise Value (GMV): ~¥985 billion (~$6.5 billion USD) Introduction Japan’s largest C2C e-commerce marketplace
  4. 7 Search at Mercari • Primary discovery mechanism in Mercari’s

    marketplace • Performance critical for user experience • Retrieval powered by Elasticsearch • Post-retrieval ML re-ranking, decoupled from Elasticsearch
  5. 8 source: Building MLOps Infrastructure at Japan's Largest C2C E-Commerce

    Site (Berlin Buzzwords 2023) Search at Mercari
  6. 9 source: Building MLOps Infrastructure at Japan's Largest C2C E-Commerce

    Site (Berlin Buzzwords 2023) Search at Mercari
  7. 10 Learning to Rank (LtR) for Search Listing 1 Listing

    2 Listing 3 Listing 4 Listing 2 Listing 3 Listing 1 Listing 4 Ranking function ML to learn the “ideal” ranking function Most relevant listing Least relevant listing Relevancy Client SERP* order (top K) ES SERP* order (~1000 items) *SERP: Search Engine Results Pages
  8. 11 Examples Purchased item Liked item Page 1 Page 2

    Query: “rjレインボー” (“rj rainbow”) Page 1 Page 2
  9. 14 SERPs with at least 1 label • Training: 5

    days • Validation: 1 day Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v0: Implicit Judgments Most relevant listing Least relevant listing Relevancy
  10. 15 Offline Evaluation v0: Quantitative & Qualitative Offline dataset variant

    1 Offline dataset variant n … Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success
  11. 16 Offline Evaluation v0: Quantitative & Qualitative Offline dataset variant

    1 Offline dataset variant n … Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How?
  12. 17 Offline Evaluation v0: Quantitative & Qualitative Offline dataset variant

    1 Offline dataset variant n … nDCG score: 0.69 nDCG score: 0.58 Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How? • normalized Discounted Cumulative Gain • Measures ranking quality • the closer to 1 the better
  13. 18 Offline Evaluation v0: Quantitative & Qualitative Offline dataset variant

    1 Offline dataset variant n … nDCG score: 0.69 nDCG score: 0.58 Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How? • normalized Discounted Cumulative Gain • Measures ranking quality • the closer to 1 the better • Qualitative “spot checks”: human-in-the-loop search quality validation (“vibe check”)
  14. 19 Offline Evaluation v0: Quantitative & Qualitative Offline dataset variant

    1 Offline dataset variant n … nDCG score: 0.69 nDCG score: 0.58 Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How? • normalized Discounted Cumulative Gain • Measures ranking quality • the closer to 1 the better • Qualitative “spot checks”: human-in-the-loop search quality validation (“vibe check”) This needs to be really effective, reproducible, and fast
  15. 20 Offline Evaluation v0: Quantitative & Qualitative Offline dataset variant

    1 Offline dataset variant n … nDCG score: 0.69 nDCG score: 0.58 Why? • Quick, low-overhead way to test and validate possible improvements • Increase odds and magnitude of A/B test success How? • normalized Discounted Cumulative Gain • Measures ranking quality • the closer to 1 the better • Qualitative “spot checks”: human-in-the-loop search quality validation (“vibe check”) This needs to be really effective, reproducible, and fast
  16. 23 SERPs with at least 1 label • Training: 5

    days • Validation: 1 day Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v1: High-Signal Labels Most relevant listing Least relevant listing Relevancy
  17. 24 SERPs with at least 1 label • Training: 5

    days • Validation: 1 day Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v1: High-Signal Labels Most relevant listing Least relevant listing Relevancy
  18. 25 At Mercari we rely heavily on: • Purchase nDCG:

    non-purchase labels are masked Offline Evaluation v1: Custom Metrics
  19. 26 At Mercari we rely heavily on: • Purchase nDCG:

    non-purchase labels are masked Offline Evaluation v1: Custom Metrics
  20. 27 At Mercari we rely heavily on: • Purchase nDCG:

    non-purchase labels are masked Offline Evaluation v1: Custom Metrics ✨ ✨
  21. 28 At Mercari we rely heavily on: • Purchase nDCG:

    non-purchase labels are masked Offline Evaluation v1: Custom Metrics ✨ ✨ Key to repeatedly delivering strong & enduring business impact
  22. 31 At Mercari we rely heavily on: • Purchase nDCG:

    non-purchase labels are masked • Counterfactual nDCG: weighted nDCG where some training samples (SERPs) might get higher weights to counterbalance engagement position-bias Offline Evaluation v1: Custom Metrics ✨ ✨ Key to repeatedly delivering strong & enduring business impact
  23. 32 SERPs with at least 1 label • Training: 5

    days • Validation: 1 day Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v2: High-Signal Filtering Most relevant listing Least relevant listing Relevancy
  24. 33 SERPs with at least 1 future purchase label •

    Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v2: High-Signal Filtering Most relevant listing Least relevant listing Relevancy
  25. 35 SERPs with at least 1 future purchase label •

    Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v3: High-Signal Dense Labels Most relevant listing Least relevant listing Relevancy
  26. 36 SERPs with at least 1 future purchase label •

    Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • [Other User] Purchase • [Other User] Checkout (“Purchase Started) • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v3: High-Signal Dense Labels Most relevant listing Least relevant listing Relevancy
  27. 38 SERPs with at least 1 future purchase label •

    Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • [Other User] Purchase • [Other User] Checkout (“Purchase Started) • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v4: “Good Listing” Labels Most relevant listing Least relevant listing Relevancy
  28. 39 SERPs with at least 1 future purchase label •

    Training: 28 days • Validation: 7 days Relevance Labels (high to low*): • Purchase • Checkout (“Purchase Started”) • Comment • Like • [Other User] Purchase • [Other User] Checkout (“Purchase Started) • Clicks on “Sold” Items • Clicks *If an item has multiple engagement labels, the most relevant label takes precedence Dataset Construction v4: “Good Listing” Labels Most relevant listing Least relevant listing Relevancy
  29. 41 SERPs with at least 1 ??? label • Training:

    28 days • Validation: 7 days Possible Relevance Labels (high to low): • “Likely to be purchased” • “Likely to inspire users to list their items” • “An exemplar listing likely to reinforce trust in our marketplace and its reputability” Dataset Construction v5: Explicit Judgements via LLM Most relevant listing Least relevant listing Relevancy
  30. 43 Introduction to Learning to Rank Pointwise Pairwise Listwise Considers

    items independently (Regression, Classification) Considers items pairs (RankNet, RankSVM) Considers items lists (ListNet, ApproxNDCG, GumbelApproxNDCG)
  31. 44 • In our model, there are two types of

    features: • Context features contain information about the SERP • Document features contain information about each listing Feature Engineering - Nomenclature 1. Listing a price, title, freshness... 2. Listing b price, title, freshness... 3. Listing c price, title, freshness... 1. Query → tokens 2. User behaviour 3. SERP statistics Context Document
  32. 45 Features in LTR datasets are diverse and can be

    of different scales. • MeCab for Japanese text tokenization • Z-score normalization for SERP statistics • “log1p” transformation for numeric features (Qin et al., 2021) FEATURE TRANSFORMATION
  33. 48 • Historically, we started with ES as the sole

    ranking function producing item_rank. • As of now, we rely on ES item_rank feature to give our reranker models the information from the retrieval stage. • However, due to the prevailing engagement at the top of SERPs for highly optimized queries, models could become over-focused towards the item_rank. ◦ Overall, retrieval is a noisy task. ◦ Potential spurious correlations. • Lazy solution: De-biasing Through Augmentations
  34. 49 Implicit Feedback brings its own difficulties: • Noise: ◦

    Users click for unexpected reasons. ◦ Many clicks happen not because of relevancy. ◦ Many clicks do not occur despite of relevancy. • Bias: ◦ Position bias: Higher ranked documents get more attention. ◦ Item selection bias: Interactions are limited to the presented documents. ◦ Presentation bias: Results that are presented differently will be treated differently. Learning from Implicit Feedback: Difficulties
  35. 50 Most commonly adopted probabilistic model: • The click probability

    depends on the relevance probability and observation probability. Position Bias
  36. 51 How can this propensity score be estimated for position

    bias? • If you randomly display the products, it's only the rank that matters. • Randomly shuffle the top n items • Record clicks -> Aggregate clicks per rank • Normalize to obtain propensities However, this hurts users’ search experience. Estimating Position Bias: Random Shuffling
  37. 52 • Main idea: In real-world production systems many (randomized)

    interventions take place, such as A/B tests. Can we use these interventions instead? • This approach is called intervention harvesting (Agarwal et al. (2017); Fang et al. (2019); Agarwal et al. (2019b)) Estimating Position Bias: Intervention Harvesting Image source: https://notesonai.com/Counterfactual+Evaluation+and+LTR
  38. 53 Estimating Position Bias: High-level • We only need propensities

    proportional to the true observation probability for learning.
  39. 54 Position Bias on Mercari: • Rows positioned higher typically

    get greater engagement • After a few scrolls, users generally show a tendency to engage more with the central positions as opposed to the right and left positions. Biased data is consumed by a machine learning model used in production • it may result in the creation of a feedback loop reinforcing the initial bias. Position Bias @ Mercari
  40. 55 How IPW is Used to Update the Model •

    In tensorflow_ranking, you pass document weights as a feature and it will magically handle computations for you. • However, we apply weights on labels instead of the loss. Thus, • the higher the weight -> the larger the update -> example is prioritized more • the lower the weight -> the lower the update -> example is prioritized less Model De-biasing: In Practice Where weight is Image source: https://github.com/tensorflow/ranking/blob/v0.5.2/tensorflow_ranking/python/losses_impl.py#L99 1-L1002
  41. 56 Using the estimated biases for each position, we can

    de-bias our training data and therefore our machine learning model. • Instead of weighing sample losses • We suspect that we over-penalized our model Model De-biasing: Alternative Approach Method Full Data Tail Queries Baseline 48.84 52.45 IPW-Loss 48.74 52.66 IPW-Labels 49.10 52.92 Offline Purchase nDCG
  42. 58 • Foundation to get to and stay in production

    (Berlin Buzzwords 2023) Takeaways & What’s Next?
  43. 59 • Foundation to get to and stay in production

    (Berlin Buzzwords 2023) • Systems for tight eval & data enrichment feedback loops (Berlin Buzzwords 2024; this talk) Takeaways & What’s Next?
  44. 60 The ROI of AI YoY +9% YoY +10% GMV1/MAU2

    1. Aggregate transaction value after adjusting for cancellations; aggregates C2C and B2C figures 2. Quarterly average number of users who browsed our service (app or web) at least once during a given month Million users Billion JPY (Billion JPY) (Million users) 2 1
  45. 61 • Foundation to get to and stay in production

    (Berlin Buzzwords 2023) • Systems for tight eval & data enrichment feedback loops (Berlin Buzzwords 2024; this talk) • Meta-foundation & system to unify AI applications across Mercari (COMING SOON!) Takeaways & What’s Next?
  46. 62 Acknowledgements Kaiyi Liu Kentaro Takiguchi Yui Takeuchi Takuma Kinoshita

    Asir Saeed Daniele Hohol Antoine Lecubin Ryan Ginstrom Pathompong Yupensuk Tomohiro Furusawa