Mercari CS/CRE Tech Talk: Content Moderation Using Machine Learning

1 Content Moderation using Machine Learning @suganprabu

2 Tech Lead TnS Team / CRE Division Joined in
2018/10 Suganprabu (@sugan)

3 Content Moderation in Mercari Agenda Moderation of items ML
based moderation Architecture 02 03 04 01

4 Content Moderation in Mercari

5 What is content moderation Content Moderation TnS CRE Provide
a safe marketplace for customers and gain their trust Trust and Safety Reduce customer anxiety about the marketplace Customer Reliability Engineering Moderate items to detect items violating our policies Content Moderation

6 Why do we need content moderation • More than
2B items listed • More than 10M MAU • Proactively detecting violation items using technology is necessary

7 Moderation of items

8 Overview of content moderation Edit Rule based system ML
system Rule based warning / stop Automatic deletion Customer Support Rule based + ML Model 1. Before Listing 2. After Listing New Listing Client-side rule based

9 Rule-based moderation • Advantage of rule-based moderation ◦ Easy
to add/modify rules ◦ Reproducible performance ◦ High precision • Simplicity of rule-based system ◦ これはガンです -> search using regexp “ガン” ◦ Overpriced items -> compare item price with item category

10 ML based moderation • Advantages of ML-based moderation ◦
Can detect novel violations ◦ Can detect violations from images of items • Bad ML system is worse than rule based system ◦ ML system should detect items that were not detected by rule-based system to add value ▪ Should be worth the cost • Data ETL is the most important for violated item detection ◦ System for easy annotation ◦ Massive labeled data ◦ Feedback loop

11 Feedback loop

12 ML based moderation

13 Violation topics and models • 20+ ML models for
various topics (weapons, drugs, counterfeit goods etc.) ◦ Easy to re-train and deploy independently ◦ One-vs-all models handle class imbalance better ◦ Abstracted code and training pipelines makes maintenance easy • Millions of requests handled daily • Combination of tree-based and deep learning models

14 Model training • PyTorch for deep learning and LGBM
+ Optuna for tree-based models • Kubeflow for model training and offline evaluation ◦ Kubeflow pipeline created using components for dataset fetching, training and offline evaluation ◦ Same pipeline for experiments and production model training • Model evaluation metric Precision@K ◦ K=bound on the number of alerts from model

15 Architecture

16 Architecture of Content Moderation System Gateway Seller Listing service
GCP Pub/Sub ML system Worker Mod Tool Publish events Subscribe Publish Subscribe Report item List item Rule-based system Subscribe Publish RPC

17 Architecture of ML system Message queue Message queue Proxy
Inference for topic A (Treelite) Inference for topic N (Onnxruntime) Subscribe Publish . Circle CI Kubeflow Submit training job GCS Model Training . .

18 Backlog queue for error handling • Issue: Prevent downtime
of one model from affecting others • Approach: Use a backlog queue to hold messages that couldn’t be acknowledged • For more details, refer tech blog

19 Feel free to follow me Thank you! Suganprabu Nagarajan:
linkedin.com/in/suganprabu-n-a98731b6/

Mercari CS/CRE Tech Talk: Content Moderation Us...

Mercari CS/CRE Tech Talk: Content Moderation Using Machine Learning

sugan

Other Decks in Technology

Featured

Transcript

1 Content Moderation using Machine Learning @suganprabu

2 Tech Lead TnS Team / CRE Division Joined in

3 Content Moderation in Mercari Agenda Moderation of items ML

4 Content Moderation in Mercari

5 What is content moderation Content Moderation TnS CRE Provide

6 Why do we need content moderation • More than

7 Moderation of items

8 Overview of content moderation Edit Rule based system ML

9 Rule-based moderation • Advantage of rule-based moderation ◦ Easy

10 ML based moderation • Advantages of ML-based moderation ◦

11 Feedback loop

12 ML based moderation

13 Violation topics and models • 20+ ML models for

14 Model training • PyTorch for deep learning and LGBM

15 Architecture

16 Architecture of Content Moderation System Gateway Seller Listing service

17 Architecture of ML system Message queue Message queue Proxy

18 Backlog queue for error handling • Issue: Prevent downtime

19 Feel free to follow me Thank you! Suganprabu Nagarajan: