$30 off During Our Annual Pro Sale. View Details »

#30 Back to the chatbot - for real

#30 Back to the chatbot - for real

2018, c'est bon la folie des chatbots est enfin derrière nous. On va enfin pouvoir... s'y remettre sérieusement. Au delà du hype, l'interaction en language naturel (à l'oral ou bien à l'écrit) restera une approche naturelle et humaine. Mais comment dépasser la pile if/else d'un bot de base ? Comment rendre le chatbot utile ? Comment avancer vers le fameux agent personnel interactif promis par tous les bouquins de SF ?

Ce talk proposera d'explorer l'état actuel des choses en s'attachant - en plus de la mise un place d'un bot - au problème de la gestion des modèles ML sous-jacent et de leur évaluation par rapport à un cas d'utilisation réaliste. En particulier nous présenterons ParlAI, le projet open source issue de Facebook qui s'attaque à ce problème de la mesure de performance de chatbots.

Bio:
- Gérard DUPONT - senior data scientist - AIRBUS research
- Alexandre ARNOLD - senior data scientist - AIRBUS research

Toulouse Data Science

May 15, 2018
Tweet

More Decks by Toulouse Data Science

Other Decks in Technology

Transcript

  1. https://code.garron.us/css/BTTF_logo/
    by Alexandre Arnold & Gerard Dupont

    View Slide

  2. Gérard Dupont
    [email protected]
    More than 10 years on research
    projects with NLP, IR/search, ML, RL
    and massive data processing.
    Who we are?
    Alexandre Arnold
    [email protected]
    Virtual assistant experience wrt. to
    architecture, NLP, S2T &
    planning/reasoning capabilities via RL.
    Central Research &
    Technology

    View Slide

  3. Chatbot?

    View Slide

  4. Chatbot?

    View Slide

  5. Chatbot?
    https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

    View Slide

  6. Where are the bots?
    Chatbots use-cases
    1. Customer services/support
    2. Automating simple transactions
    3. User engagement/branding
    4. Social media activity
    5. User retention
    6. Personal assistant

    View Slide

  7. From https://blog.keyreply.com/the-chatbot-landscape-2017-edition-ff2e3d2a0bdb

    View Slide

  8. Chatbot implementation
    high level overview
    1. User speaks
    2. ASR transform speech in text
    3. Text is analyzed
    4. Discourse state is analyzed
    5. Response is selected
    6. Response is formulated in text
    7. Text is interpreted in speech From “A Survey of Available Corpora for
    Building Data-Driven Dialogue Systems” by
    Iulian Vlad Serban et al

    View Slide

  9. Speech != Text inputs
    People don’t talk like they write
    (sometimes for the better… )
    People don’t talk the same way when
    they see each other
    People don’t talk the same way when
    they can draw/make signs
    People don’t talk to robot like to other
    people (or do they?)
    The dialogue model tries to
    abstract the modality of the
    conversation the agent is in
    (but really it can’t).

    View Slide

  10. Types of conversations
    Not
    possible
    (so far)
    Super
    hard
    (GAI?)
    Easy with
    rule sets
    Possible
    with ML
    IR
    based
    ML
    based
    Closed
    domain
    Open
    domain
    “Openness” of the discourse topics is key
    To start simple:
    - Choose a closed domain
    - Focus on specific goal/task
    - Limit to 2 people interacting...
    To make it hard:
    - Add chit chat
    - Add visual
    - Add context
    - Add personalization, personality...

    View Slide

  11. Chatbot implementations/approaches
    Discriminative Model Architectures - dialogue act/intend classification + state tracking
    Dialogue retrieval model - re-ranking responses from “learned” conversations
    Dialogue generation model - produce utterances by composing text

    View Slide

  12. Chatbot implementations/approaches
    Discriminative Model Architectures - dialogue act/intend classification + state tracking
    Dialogue retrieval model - re-ranking responses from “learned” conversations
    Dialogue generation model - produce utterances by composing text
    What about end-to-end systems?

    View Slide

  13. One implementation framework : Rasa Core
    Architecture:
    typically a
    Neural Network

    View Slide

  14. One implementation framework : Rasa Core
    Key elements:
    Domain → define intents, entities, context slots, actions…
    NLU data → training corpus for intent/entity recognition
    Stories (optional) → typical conversation examples to start learning
    Custom actions (optional) → code e.g. for a query or service call

    View Slide

  15. One implementation framework : Rasa Core
    Various ways of learning a dialog policy:
    Supervised learning → based on “stories” or online
    Interactive learning → correct the bot only when wrong
    Perspective: Reinforcement Learning (pros & cons)

    View Slide

  16. Time for a demo?
    Connecting trained Rasa Core/NLU
    to a custom web front-end...

    View Slide

  17. Evaluating chatbot
    performances

    View Slide

  18. Chatbot evaluation models
    Computer-science and AI perspective
    Information Retrieval (IR) approach
    Linguistic perspective (NLP)
    User experience (UX) methodology

    View Slide

  19. Chatbot evaluation models
    Computer-science and AI perspective
    Turing test mode - interesting theoretical approach, no clear performance metric
    Information Retrieval (IR) approach
    Linguistic perspective (NLP)
    User experience (UX) methodology

    View Slide

  20. Computer-science and AI perspective
    Information Retrieval (IR) approach
    Focus on utility/relevance/timeliness of answers + well understood metrics
    Linguistic perspective (NLP)
    User experience (UX) methodology
    Chatbot evaluation models

    View Slide

  21. Computer-science and AI perspective
    Information Retrieval (IR) approach
    Linguistic perspective (NLP)
    Quality and coherence of the discourse against topical categorization
    User experience (UX) methodology
    Chatbot evaluation models

    View Slide

  22. Computer-science and AI perspective
    Information Retrieval (IR) approach
    Linguistic perspective (NLP)
    User experience (UX) methodology
    Human factors perspective and usability from the user point-of-view
    Chatbot evaluation models

    View Slide

  23. Chatbot evaluation metrics
    Bleu, Rouge, METEOR, embeddings based metrics
    Precision, Recall (for top k + combined metrics: accuracy, F1-score)
    Word perplexity (model dependant - limited use)
    Engagement, coherence, conversational depth

    View Slide

  24. Chatbot evaluation (NLP) metrics
    BLEU ROUGE
    +Embedding-based metrics: Greedy matching,
    Embedding average, Vector exterma...
    How NOT To Evaluate Your Dialogue System: An Empirical Study of
    Unsupervised Evaluation Metrics for Dialogue Response Generation

    View Slide

  25. Chatbot evaluation metrics
    BLEU, ROUGE, METEOR, embeddings based metrics
    Precision, Recall (for top k + combined metrics: accuracy, F1-score)
    Word perplexity (model dependant - limited use)
    Engagement, coherence, conversational depth
    … and don’t forget (classic) business metrics:
    Retention, Engagement, Churn, MRR, stuff from A/B testing...

    View Slide

  26. Why not going directly to real users?

    View Slide

  27. Why not going directly to real users?

    View Slide

  28. Why not going directly to real users?
    "Tay" after the acronym
    "thinking about you"
    Chatter bot release on Twitter
    - by Microsoft's Technology & Research and
    Bing divisions
    - based on Xiaoice, a similar Microsoft project
    in China
    - online learning dialog agent
    - released on 23th March 2016

    View Slide

  29. Why not going directly to real users?
    Chatter bot release on Twitter
    - by Microsoft's Technology & Research and
    Bing divisions
    - based on Xiaoice, a similar Microsoft project
    in China
    - online learning dialog agent
    - released on 23th March 2016
    - Microsoft to shut down the service
    only 16 hours after its launch
    "Tay" after the acronym
    "thinking about you"

    View Slide

  30. Why not going directly to real users?
    "Tay" after the acronym
    "thinking about you"
    Chatter bot release on Twitter
    - by Microsoft's Technology & Research and Bing
    divisions
    - based on Xiaoice, a similar Microsoft project in
    China
    - online learning dialog agent
    - released on 23th March 2016

    View Slide

  31. Why not going directly to real users?
    Lessons learned:
    - Interactive user experiments are
    the best
    - But users are hard…
    - A well-meaning experiment can go
    wrong very rapidly
    - Prepare and evaluate before
    launch!
    "Tay" after the acronym
    "thinking about you"

    View Slide

  32. Which task to evaluate your bot?
    ● SQuAD - Stanford Question Answering Dataset
    ● bAbI tasks
    ● MS MARCO - Microsoft MAchine Reading COmprehension Dataset
    ● WikiQA
    ● dialog-bAbI
    ● Movie Dialog dataset
    ● Ubuntu dialog corpus
    ● Textbook Question Answering (TQA) dataset
    And many more...

    View Slide

  33. And do we write all the code to evaluate?
    Some evaluation frameworks:
    ● Alexa prize
    ● bAbI tasks
    ● CLEVR
    ● CommAI-env
    ● ParlAI

    View Slide

  34. And do we write all the code to evaluate?
    Some evaluation frameworks:
    ● Alexa prize (Amazon)
    ● bAbI tasks (Facebook)
    ● CLEVR (Facebook)
    ● CommAI-env (Facebook)
    ● ParlAI (Facebook)

    View Slide

  35. ParlAI

    View Slide

  36. ParlAI
    “A unified platform for sharing, training and
    evaluating dialog models across many tasks.”
    First release in May 2017
    Github repo + good documentation + good tutorials
    Regularly updated:
    - more than 15 agent implementations
    - doubled the number of tasks since first release

    View Slide

  37. ParlAI
    (from documentation)
    It’s a python-based platform for enabling dialog AI research.
    ● a unified framework for sharing, training and testing dialog models
    ● many popular datasets available all in one place, with the ability to
    multi-task over them
    ● seamless integration of Amazon Mechanical Turk for data
    collection and human evaluation
    ● integration with Facebook Messenger to connect agents with
    humans in a chat interface

    View Slide

  38. Top
    contributors
    Most of them are
    from FAIR

    View Slide

  39. ParlAI tasks
    Originally: 20 tasks from toy problems to benchmarks
    Today: almost 50 tasks (with possibly subtasks)
    Various categories:
    ● Q/A
    ● Sentence completion
    ● Dialogue
    ● Visual dialogue
    Auto setup (download the data on demand) and standardized format
    ● Chit-chat
    ● Negotiation
    ● Machine translation
    ● Any task you want...

    View Slide

  40. ParlAI tasks
    Many subtasks: already 20 bAbI tasks!

    View Slide

  41. Teacher: {
    'text': 'Sam went to the kitchen\nPat gave Sam the milk. Where is the milk?',
    'labels': ['kitchen'],
    'label_candidates': ['hallway', 'kitchen', 'bathroom'],
    'episode_done': False
    }
    Student: {
    'text': 'hallway'
    }
    Teacher: {
    'text': 'Sam went to the hallway\nPat went to the bathroom. Where is the milk?',
    'labels': ['hallway'],
    'label_candidates': ['hallway', 'kitchen', 'bathroom'],
    'episode_done': True
    }
    Student: {
    'text': 'hallway'
    }
    Teacher: {
    ... # starts next episode
    }
    ...

    View Slide

  42. Amazon mechanical turk integration
    ParlAI supports integration with
    Mechanical Turk for data
    collection, training, and
    evaluation.
    Cheap, easy and safe interactive
    evaluations!

    View Slide

  43. Coding time: implement & evaluate

    View Slide

  44. ParlAI - implementation basics
    Agent
    Teacher
    observation
    action
    train
    dev
    test

    View Slide

  45. Implementation basics
    Observation & action share the
    same “meta” structure.
    Key stuff:
    - text
    - id
    - reward
    - labels

    View Slide

  46. Rasa core agent implementation
    Adapt training data for dialog-bAbI (task 5):
    Domain → change utterance templates to better fit original data
    Stories → add different dialog endings + a few bug adaptations/fixes
    Limitations: adapted stories still do not perfectly reflect the original
    data, which hurts benchmark score (e.g. no “dialog interruptions”)

    View Slide

  47. Code, code, code...
    (live examples in VisualCode)
    Task selection: Dialog-bAbI - task 5 full dialogue
    Agent implementation:
    - Implement TDS_MEGA_BOT! (crossing fingers)
    - rasa core (it’s all ready!)
    Run an agent:
    - Play: $python examples/display_model.py -m rasa -t "dialog_babi:Task:5" -dt valid
    - Evaluate: $python examples/eval_model.py -m rasa -t "dialog_babi:Task:5" -dt valid

    View Slide

  48. Evaluation results
    Not so bright… but wait!
    Context:
    - IR baseline: not adapted to
    dialog case
    - Seq2seq training ~10h
    tuned by FAIR on this task
    - Rasa-core training ~2min +
    4h of personal work to
    adapt to dialog-bAbI-task5
    (room for improvements)

    View Slide

  49. Evaluation results
    Fixing some (not all) cases not covered originally by rasa training: x3 perf
    (and we are still missing some dialog cases)
    Context:
    - IR baseline: not adapted to
    dialog case
    - Seq2seq training ~10h
    tuned by FAIR on this task
    - Rasa-core training ~2min +
    6h of personal work to
    adapt to dialog-bAbI-task5
    (room for improvements)

    View Slide

  50. Take-away
    Chatbot are fun!
    Lots of room for research &
    tinkering - not only for the big
    players.
    We are far from getting the long
    promised sci-fi personalized
    agent…

    View Slide

  51. Take-away
    Chatbot are fun!
    Lots of room for research &
    tinkering - not only for the big
    players.
    We are far from getting the long
    promised sci-fi personalized
    agent…

    View Slide

  52. Q/A time?
    Central Research &
    Technology
    Gérard Dupont
    [email protected]
    Alexandre Arnold
    [email protected]

    View Slide

  53. BONUS?

    View Slide

  54. https://aigym.airbus.com
    Central Research &
    Technology

    View Slide