Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multilingual models for toxicity detection at Bumble

Multilingual models for toxicity detection at Bumble

Words are powerful. They can be used to lift people up, but sometimes are intended to cause harm. At Bumble Inc.— the parent company that operates Bumble, Badoo and Fruitz - we believe it’s unacceptable to be abusive in any form — online or offline. That’s why we’ve worked hard on a fully multilingual, machine learning-based toxicity detector within our platform that’s designed to protect our community from harmful messaging. During the talk we will cover some of the technical aspects and challenges of the year-long project that allowed us to build a reliable and scalable engine (that also powers the Rude Message Detector!), together with some interesting learnings we made opening the deep learning black box we trained.


Massimo Belloni

March 11, 2022

More Decks by Massimo Belloni

Other Decks in Science


  1. Multilingual models for toxicity detection 10th March 2022 MIT Analytics

    Spring Speaker Series
  2. None
  3. Contents Intro and context Transformers and XLM-R Architecture and ML

    Project Bonus: Rude Message Detector Deep Dive #1: Deep Learning Deep Dive #2: MLOps Conclusion and next steps
  4. Massimo Belloni Machine Learning Engineer @Bumble MLOps, NLP and miscellaneous

    Interests in philosophy of science, consciousness, Strong vs Weak AI ◦ Coding consciousness ◦ Interpretability and trust MSc Computer Science and Engineering (Politecnico di Milano)
  5. We are a global business As we all know —

    words are powerful. They can be used to lift people up, but sometimes are intended to cause harm. We needed to design a system that was fully and natively multilingual, with comparable performances in all of the supported languages and without necessarily knowing the language of each message beforehand. Our scale requires scalable deployments, ready to deal with hundreds of millions of requests per day! 150 countries +150M messages/day
  6. Warning/Block Automatic sanctions (for severe and obvious cases) or manual

    moderation flow are imposed to suspicious cases Proactive Monitoring Our internal integrity systems monitor this flow for toxic behavior or harmful content Message Users exchange messages (or other forms of text) on the platform (Simplified) Context - Integrity Services ML
  7. Build Leverage internal talents and datasets to design and deploy

    a solution for a problem • Full control over the solution • Expensive at first • Possibly leveraging unique internal dataset • Decent risk of failure Buy Integrate off-shelf 3rd party solutions for solving specific business needs • Appealing prices at smaller scales • No business risk, completely outsourcing the problem* • Solution isn’t tailored for the use-case • Continuous improvements at fixed cost
  8. Read the paper

  9. Learn more Diagrams from https://jalammar.github.io/ Encoder only!

  10. Learn more Diagrams from https://jalammar.github.io/ Self-Attention Transformers embeddings are context-aware!

    Self-Attention is the mechanism by which we look at other positions in the input sequence for clues that can help lead to a better encoding for the word (token) at hand.
  11. Diagrams from https://jalammar.github.io/ Multi-headed Attention Results of each attention layer

    are concatenated and multiplied for a weights matrix learnt along the way
  12. Wrapping up: XLM-RoBERTa Two different versions CC-100 and MLM Training

    xlm-roberta-base (270M) and xlm-roberta-large (550M). The smaller version has 768-sized embeddings, 12 hidden layers and 12 attention heads each. The larger one has 16 heads for 24 layers and 1024 embeddings. A multilingual corpus (100 languages!) is crawled from the web. The model is trained trying to predict a masked token in the original sentence given as input. 1.5 Million updates on five-hundred 32GB Nvidia V100 GPUs with a batch size of 8192!
  13. CC-100 (100 languages) Bumble Messages (16 languages) XLM-R (base) XLM-R

  14. Validation and scaling • Invest time in a complete and

    actionable validation process - crucial when data scales! • Make sure that training scripts and general code base is high quality and ease to action • Document experiments’ lifecycle Sketching • Understand the problem and data flows • Read literature, investigate architectures • Mock up training pipeline with open source data • Draw tentative baseline • Set up clear and reproducible process for data labelling - if needed! Deployment • Deploy to production as soon as possible! • Understand requirements for continuous deployment and future releases Phase 1 Phase 2 Phase 3 ML Project management
  15. Engineering first Some learnings In real world NLP problems it

    is the most complex and time consuming part! • Define simple and replicable guidelines: if humans cannot agree on a concept, a model will never! • Account for large waiting times, and manage the project accordingly • Labelling more granularly and possibly merge labels back Deploy a toy model to production as soon as possible to spot problems and be mindful of production load! Computing power has to be forecasted and budgeted even in a cloud setting Don’t aim for a perfect model at the first release. Incremental releases with improved performances (or new supported languages) are the best way for proving value sooner and keep stakeholders excited. Data Labelling Incremental releases
  16. Rude Message Detector Our ML engine was successfully deployed as

    a user facing feature on Badoo, in August 2021. When a member receives a message that could be harmful or hurtful to the reader, we’re able to check in in real-time through a pop-up message. We give them the control to dismiss it if they're comfortable with the language used. If they aren’t, we encourage the receiver to report the conversation directly. Bonus
  17. Embeddings space analysis Are similar sentences in different languages represented

    close to each other? Are multilingual embeddings a good proxy for language detection?
  18. XLM-R (base) XLM-R (base) Hidden layers’ embeddings Each token in

    the input sentence is first mapped to a context-unaware 768-dimensional embedding (one per token). Each embedding then goes through the hidden layers (H=12), and from each we can retrieve the Nx768 context-aware embeddings (H x N x 768 overall), resulting from the self-attention mechanisms.
  19. We leveraged some general purpose and open-source multi-lingual datasets available

    through HuggingFace - an NLP library currently being used by some of the biggest AI organisations in the world. All the pairs from the two datasets are built with English as a source language and have the other language as target.
  20. Baseline: ~16% The results show pretty good evidence of cross-lingual

    semantic embeddings in all the model's layers, with a decrease in performances the closer to the final classification head. Experiment results
  21. Infrastructure and MLOps How we managed to serve thousands of

    requests per second?
  22. We have an internal suite of services for deploying and

    monitoring Tensorflow-based, deep learning models, guaranteeing both high performances and full observability. • Tokenizer implemented in TF-graph • Several GPU nodes in two different zones • Shared logging system for centralised monitoring and real-time anomaly detection Our solution
  23. Next Steps Contextual aware sanctions Beyond message level decisions Add

    more labels Get rid of some legacy systems with ML Continuous improvement Automatic periodic retraining Decrease latency Faster model = more real-time use cases Machine learning is the only viable solution for complex decisions at our scale. Increase the scope of our NLP models. Improve the way we take automated decisions on top of message-based triggers. Work on users’ behavior and conversations’ context Society evolves, and so the way in which people are abusive on online platforms. Making the model smaller and the inference faster allows even more real time use-cases and more efficient use of computing resources.
  24. Thank you! team.bumble.com