Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mercari CS/CRE Tech Talk: Content Moderation Us...

Avatar for sugan sugan
August 20, 2021

Mercari CS/CRE Tech Talk: Content Moderation Using Machine Learning

Avatar for sugan

sugan

August 20, 2021
Tweet

Other Decks in Technology

Transcript

  1. 2 Tech Lead TnS Team / CRE Division Joined in

    2018/10 Suganprabu (@sugan)
  2. 3 Content Moderation in Mercari Agenda Moderation of items ML

    based moderation Architecture 02 03 04 01
  3. 5 What is content moderation Content Moderation TnS CRE Provide

    a safe marketplace for customers and gain their trust Trust and Safety Reduce customer anxiety about the marketplace Customer Reliability Engineering Moderate items to detect items violating our policies Content Moderation
  4. 6 Why do we need content moderation • More than

    2B items listed • More than 10M MAU • Proactively detecting violation items using technology is necessary
  5. 8 Overview of content moderation Edit Rule based system ML

    system Rule based warning / stop Automatic deletion Customer Support Rule based + ML Model 1. Before Listing 2. After Listing New Listing Client-side rule based
  6. 9 Rule-based moderation • Advantage of rule-based moderation ◦ Easy

    to add/modify rules ◦ Reproducible performance ◦ High precision • Simplicity of rule-based system ◦ これはガンです -> search using regexp “ガン” ◦ Overpriced items -> compare item price with item category
  7. 10 ML based moderation • Advantages of ML-based moderation ◦

    Can detect novel violations ◦ Can detect violations from images of items • Bad ML system is worse than rule based system ◦ ML system should detect items that were not detected by rule-based system to add value ▪ Should be worth the cost • Data ETL is the most important for violated item detection ◦ System for easy annotation ◦ Massive labeled data ◦ Feedback loop
  8. 13 Violation topics and models • 20+ ML models for

    various topics (weapons, drugs, counterfeit goods etc.) ◦ Easy to re-train and deploy independently ◦ One-vs-all models handle class imbalance better ◦ Abstracted code and training pipelines makes maintenance easy • Millions of requests handled daily • Combination of tree-based and deep learning models
  9. 14 Model training • PyTorch for deep learning and LGBM

    + Optuna for tree-based models • Kubeflow for model training and offline evaluation ◦ Kubeflow pipeline created using components for dataset fetching, training and offline evaluation ◦ Same pipeline for experiments and production model training • Model evaluation metric Precision@K ◦ K=bound on the number of alerts from model
  10. 16 Architecture of Content Moderation System Gateway Seller Listing service

    GCP Pub/Sub ML system Worker Mod Tool Publish events Subscribe Publish Subscribe Report item List item Rule-based system Subscribe Publish RPC
  11. 17 Architecture of ML system Message queue Message queue Proxy

    Inference for topic A (Treelite) Inference for topic N (Onnxruntime) Subscribe Publish . Circle CI Kubeflow Submit training job GCS Model Training . .
  12. 18 Backlog queue for error handling • Issue: Prevent downtime

    of one model from affecting others • Approach: Use a backlog queue to hold messages that couldn’t be acknowledged • For more details, refer tech blog
  13. 19 Feel free to follow me Thank you! Suganprabu Nagarajan:

    linkedin.com/in/suganprabu-n-a98731b6/