Pro Yearly is on sale from $80 to $50! »

#30 Back to the chatbot - for real

#30 Back to the chatbot - for real

2018, c'est bon la folie des chatbots est enfin derrière nous. On va enfin pouvoir... s'y remettre sérieusement. Au delà du hype, l'interaction en language naturel (à l'oral ou bien à l'écrit) restera une approche naturelle et humaine. Mais comment dépasser la pile if/else d'un bot de base ? Comment rendre le chatbot utile ? Comment avancer vers le fameux agent personnel interactif promis par tous les bouquins de SF ?

Ce talk proposera d'explorer l'état actuel des choses en s'attachant - en plus de la mise un place d'un bot - au problème de la gestion des modèles ML sous-jacent et de leur évaluation par rapport à un cas d'utilisation réaliste. En particulier nous présenterons ParlAI, le projet open source issue de Facebook qui s'attaque à ce problème de la mesure de performance de chatbots.

Bio:
- Gérard DUPONT - senior data scientist - AIRBUS research
- Alexandre ARNOLD - senior data scientist - AIRBUS research

6aa4f3c589d3108830b371d0310bc4da?s=128

Toulouse Data Science

May 15, 2018
Tweet

Transcript

  1. https://code.garron.us/css/BTTF_logo/ by Alexandre Arnold & Gerard Dupont

  2. Gérard Dupont gerard.dupont@airbus.com More than 10 years on research projects

    with NLP, IR/search, ML, RL and massive data processing. Who we are? Alexandre Arnold alexandre.arnold@airbus.com Virtual assistant experience wrt. to architecture, NLP, S2T & planning/reasoning capabilities via RL. Central Research & Technology
  3. Chatbot?

  4. Chatbot?

  5. Chatbot? https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

  6. Where are the bots? Chatbots use-cases 1. Customer services/support 2.

    Automating simple transactions 3. User engagement/branding 4. Social media activity 5. User retention 6. Personal assistant
  7. From https://blog.keyreply.com/the-chatbot-landscape-2017-edition-ff2e3d2a0bdb

  8. Chatbot implementation high level overview 1. User speaks 2. ASR

    transform speech in text 3. Text is analyzed 4. Discourse state is analyzed 5. Response is selected 6. Response is formulated in text 7. Text is interpreted in speech From “A Survey of Available Corpora for Building Data-Driven Dialogue Systems” by Iulian Vlad Serban et al
  9. Speech != Text inputs People don’t talk like they write

    (sometimes for the better… ) People don’t talk the same way when they see each other People don’t talk the same way when they can draw/make signs People don’t talk to robot like to other people (or do they?) The dialogue model tries to abstract the modality of the conversation the agent is in (but really it can’t).
  10. Types of conversations Not possible (so far) Super hard (GAI?)

    Easy with rule sets Possible with ML IR based ML based Closed domain Open domain “Openness” of the discourse topics is key To start simple: - Choose a closed domain - Focus on specific goal/task - Limit to 2 people interacting... To make it hard: - Add chit chat - Add visual - Add context - Add personalization, personality...
  11. Chatbot implementations/approaches Discriminative Model Architectures - dialogue act/intend classification +

    state tracking Dialogue retrieval model - re-ranking responses from “learned” conversations Dialogue generation model - produce utterances by composing text
  12. Chatbot implementations/approaches Discriminative Model Architectures - dialogue act/intend classification +

    state tracking Dialogue retrieval model - re-ranking responses from “learned” conversations Dialogue generation model - produce utterances by composing text What about end-to-end systems?
  13. One implementation framework : Rasa Core Architecture: typically a Neural

    Network
  14. One implementation framework : Rasa Core Key elements: Domain →

    define intents, entities, context slots, actions… NLU data → training corpus for intent/entity recognition Stories (optional) → typical conversation examples to start learning Custom actions (optional) → code e.g. for a query or service call
  15. One implementation framework : Rasa Core Various ways of learning

    a dialog policy: Supervised learning → based on “stories” or online Interactive learning → correct the bot only when wrong Perspective: Reinforcement Learning (pros & cons)
  16. Time for a demo? Connecting trained Rasa Core/NLU to a

    custom web front-end...
  17. Evaluating chatbot performances

  18. Chatbot evaluation models Computer-science and AI perspective Information Retrieval (IR)

    approach Linguistic perspective (NLP) User experience (UX) methodology
  19. Chatbot evaluation models Computer-science and AI perspective Turing test mode

    - interesting theoretical approach, no clear performance metric Information Retrieval (IR) approach Linguistic perspective (NLP) User experience (UX) methodology
  20. Computer-science and AI perspective Information Retrieval (IR) approach Focus on

    utility/relevance/timeliness of answers + well understood metrics Linguistic perspective (NLP) User experience (UX) methodology Chatbot evaluation models
  21. Computer-science and AI perspective Information Retrieval (IR) approach Linguistic perspective

    (NLP) Quality and coherence of the discourse against topical categorization User experience (UX) methodology Chatbot evaluation models
  22. Computer-science and AI perspective Information Retrieval (IR) approach Linguistic perspective

    (NLP) User experience (UX) methodology Human factors perspective and usability from the user point-of-view Chatbot evaluation models
  23. Chatbot evaluation metrics Bleu, Rouge, METEOR, embeddings based metrics Precision,

    Recall (for top k + combined metrics: accuracy, F1-score) Word perplexity (model dependant - limited use) Engagement, coherence, conversational depth
  24. Chatbot evaluation (NLP) metrics BLEU ROUGE +Embedding-based metrics: Greedy matching,

    Embedding average, Vector exterma... How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
  25. Chatbot evaluation metrics BLEU, ROUGE, METEOR, embeddings based metrics Precision,

    Recall (for top k + combined metrics: accuracy, F1-score) Word perplexity (model dependant - limited use) Engagement, coherence, conversational depth … and don’t forget (classic) business metrics: Retention, Engagement, Churn, MRR, stuff from A/B testing...
  26. Why not going directly to real users?

  27. Why not going directly to real users?

  28. Why not going directly to real users? "Tay" after the

    acronym "thinking about you" Chatter bot release on Twitter - by Microsoft's Technology & Research and Bing divisions - based on Xiaoice, a similar Microsoft project in China - online learning dialog agent - released on 23th March 2016
  29. Why not going directly to real users? Chatter bot release

    on Twitter - by Microsoft's Technology & Research and Bing divisions - based on Xiaoice, a similar Microsoft project in China - online learning dialog agent - released on 23th March 2016 - Microsoft to shut down the service only 16 hours after its launch "Tay" after the acronym "thinking about you"
  30. Why not going directly to real users? "Tay" after the

    acronym "thinking about you" Chatter bot release on Twitter - by Microsoft's Technology & Research and Bing divisions - based on Xiaoice, a similar Microsoft project in China - online learning dialog agent - released on 23th March 2016
  31. Why not going directly to real users? Lessons learned: -

    Interactive user experiments are the best - But users are hard… - A well-meaning experiment can go wrong very rapidly - Prepare and evaluate before launch! "Tay" after the acronym "thinking about you"
  32. Which task to evaluate your bot? • SQuAD - Stanford

    Question Answering Dataset • bAbI tasks • MS MARCO - Microsoft MAchine Reading COmprehension Dataset • WikiQA • dialog-bAbI • Movie Dialog dataset • Ubuntu dialog corpus • Textbook Question Answering (TQA) dataset And many more...
  33. And do we write all the code to evaluate? Some

    evaluation frameworks: • Alexa prize • bAbI tasks • CLEVR • CommAI-env • ParlAI
  34. And do we write all the code to evaluate? Some

    evaluation frameworks: • Alexa prize (Amazon) • bAbI tasks (Facebook) • CLEVR (Facebook) • CommAI-env (Facebook) • ParlAI (Facebook)
  35. ParlAI

  36. ParlAI “A unified platform for sharing, training and evaluating dialog

    models across many tasks.” First release in May 2017 Github repo + good documentation + good tutorials Regularly updated: - more than 15 agent implementations - doubled the number of tasks since first release
  37. ParlAI (from documentation) It’s a python-based platform for enabling dialog

    AI research. • a unified framework for sharing, training and testing dialog models • many popular datasets available all in one place, with the ability to multi-task over them • seamless integration of Amazon Mechanical Turk for data collection and human evaluation • integration with Facebook Messenger to connect agents with humans in a chat interface
  38. Top contributors Most of them are from FAIR

  39. ParlAI tasks Originally: 20 tasks from toy problems to benchmarks

    Today: almost 50 tasks (with possibly subtasks) Various categories: • Q/A • Sentence completion • Dialogue • Visual dialogue Auto setup (download the data on demand) and standardized format • Chit-chat • Negotiation • Machine translation • Any task you want...
  40. ParlAI tasks Many subtasks: already 20 bAbI tasks!

  41. Teacher: { 'text': 'Sam went to the kitchen\nPat gave Sam

    the milk. Where is the milk?', 'labels': ['kitchen'], 'label_candidates': ['hallway', 'kitchen', 'bathroom'], 'episode_done': False } Student: { 'text': 'hallway' } Teacher: { 'text': 'Sam went to the hallway\nPat went to the bathroom. Where is the milk?', 'labels': ['hallway'], 'label_candidates': ['hallway', 'kitchen', 'bathroom'], 'episode_done': True } Student: { 'text': 'hallway' } Teacher: { ... # starts next episode } ...
  42. Amazon mechanical turk integration ParlAI supports integration with Mechanical Turk

    for data collection, training, and evaluation. Cheap, easy and safe interactive evaluations!
  43. Coding time: implement & evaluate

  44. ParlAI - implementation basics Agent Teacher observation action train dev

    test
  45. Implementation basics Observation & action share the same “meta” structure.

    Key stuff: - text - id - reward - labels
  46. Rasa core agent implementation Adapt training data for dialog-bAbI (task

    5): Domain → change utterance templates to better fit original data Stories → add different dialog endings + a few bug adaptations/fixes Limitations: adapted stories still do not perfectly reflect the original data, which hurts benchmark score (e.g. no “dialog interruptions”)
  47. Code, code, code... (live examples in VisualCode) Task selection: Dialog-bAbI

    - task 5 full dialogue Agent implementation: - Implement TDS_MEGA_BOT! (crossing fingers) - rasa core (it’s all ready!) Run an agent: - Play: $python examples/display_model.py -m rasa -t "dialog_babi:Task:5" -dt valid - Evaluate: $python examples/eval_model.py -m rasa -t "dialog_babi:Task:5" -dt valid
  48. Evaluation results Not so bright… but wait! Context: - IR

    baseline: not adapted to dialog case - Seq2seq training ~10h tuned by FAIR on this task - Rasa-core training ~2min + 4h of personal work to adapt to dialog-bAbI-task5 (room for improvements)
  49. Evaluation results Fixing some (not all) cases not covered originally

    by rasa training: x3 perf (and we are still missing some dialog cases) Context: - IR baseline: not adapted to dialog case - Seq2seq training ~10h tuned by FAIR on this task - Rasa-core training ~2min + 6h of personal work to adapt to dialog-bAbI-task5 (room for improvements)
  50. Take-away Chatbot are fun! Lots of room for research &

    tinkering - not only for the big players. We are far from getting the long promised sci-fi personalized agent…
  51. Take-away Chatbot are fun! Lots of room for research &

    tinkering - not only for the big players. We are far from getting the long promised sci-fi personalized agent…
  52. Q/A time? Central Research & Technology Gérard Dupont gerard.dupont@airbus.com Alexandre

    Arnold alexandre.arnold@airbus.com
  53. BONUS?

  54. https://aigym.airbus.com Central Research & Technology