#30 Back to the chatbot - for real

https://code.garron.us/css/BTTF_logo/ by Alexandre Arnold & Gerard Dupont

Gérard Dupont [email protected] More than 10 years on research projects
with NLP, IR/search, ML, RL and massive data processing. Who we are? Alexandre Arnold [email protected] Virtual assistant experience wrt. to architecture, NLP, S2T & planning/reasoning capabilities via RL. Central Research & Technology

Chatbot?

Chatbot? https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html

Where are the bots? Chatbots use-cases 1. Customer services/support 2.
Automating simple transactions 3. User engagement/branding 4. Social media activity 5. User retention 6. Personal assistant

From https://blog.keyreply.com/the-chatbot-landscape-2017-edition-ff2e3d2a0bdb

Chatbot implementation high level overview 1. User speaks 2. ASR
transform speech in text 3. Text is analyzed 4. Discourse state is analyzed 5. Response is selected 6. Response is formulated in text 7. Text is interpreted in speech From “A Survey of Available Corpora for Building Data-Driven Dialogue Systems” by Iulian Vlad Serban et al

Speech != Text inputs People don’t talk like they write
(sometimes for the better… ) People don’t talk the same way when they see each other People don’t talk the same way when they can draw/make signs People don’t talk to robot like to other people (or do they?) The dialogue model tries to abstract the modality of the conversation the agent is in (but really it can’t).

Types of conversations Not possible (so far) Super hard (GAI?)
Easy with rule sets Possible with ML IR based ML based Closed domain Open domain “Openness” of the discourse topics is key To start simple: - Choose a closed domain - Focus on specific goal/task - Limit to 2 people interacting... To make it hard: - Add chit chat - Add visual - Add context - Add personalization, personality...

Chatbot implementations/approaches Discriminative Model Architectures - dialogue act/intend classification +
state tracking Dialogue retrieval model - re-ranking responses from “learned” conversations Dialogue generation model - produce utterances by composing text

Chatbot implementations/approaches Discriminative Model Architectures - dialogue act/intend classification +
state tracking Dialogue retrieval model - re-ranking responses from “learned” conversations Dialogue generation model - produce utterances by composing text What about end-to-end systems?

One implementation framework : Rasa Core Architecture: typically a Neural
Network

One implementation framework : Rasa Core Key elements: Domain →
define intents, entities, context slots, actions… NLU data → training corpus for intent/entity recognition Stories (optional) → typical conversation examples to start learning Custom actions (optional) → code e.g. for a query or service call

One implementation framework : Rasa Core Various ways of learning
a dialog policy: Supervised learning → based on “stories” or online Interactive learning → correct the bot only when wrong Perspective: Reinforcement Learning (pros & cons)

Time for a demo? Connecting trained Rasa Core/NLU to a
custom web front-end...

Evaluating chatbot performances

Chatbot evaluation models Computer-science and AI perspective Information Retrieval (IR)
approach Linguistic perspective (NLP) User experience (UX) methodology

Chatbot evaluation models Computer-science and AI perspective Turing test mode
- interesting theoretical approach, no clear performance metric Information Retrieval (IR) approach Linguistic perspective (NLP) User experience (UX) methodology

Computer-science and AI perspective Information Retrieval (IR) approach Focus on
utility/relevance/timeliness of answers + well understood metrics Linguistic perspective (NLP) User experience (UX) methodology Chatbot evaluation models

Computer-science and AI perspective Information Retrieval (IR) approach Linguistic perspective
(NLP) Quality and coherence of the discourse against topical categorization User experience (UX) methodology Chatbot evaluation models

Computer-science and AI perspective Information Retrieval (IR) approach Linguistic perspective
(NLP) User experience (UX) methodology Human factors perspective and usability from the user point-of-view Chatbot evaluation models

Chatbot evaluation metrics Bleu, Rouge, METEOR, embeddings based metrics Precision,
Recall (for top k + combined metrics: accuracy, F1-score) Word perplexity (model dependant - limited use) Engagement, coherence, conversational depth

Chatbot evaluation (NLP) metrics BLEU ROUGE +Embedding-based metrics: Greedy matching,
Embedding average, Vector exterma... How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Chatbot evaluation metrics BLEU, ROUGE, METEOR, embeddings based metrics Precision,
Recall (for top k + combined metrics: accuracy, F1-score) Word perplexity (model dependant - limited use) Engagement, coherence, conversational depth … and don’t forget (classic) business metrics: Retention, Engagement, Churn, MRR, stuff from A/B testing...

Why not going directly to real users?

Why not going directly to real users? "Tay" after the
acronym "thinking about you" Chatter bot release on Twitter - by Microsoft's Technology & Research and Bing divisions - based on Xiaoice, a similar Microsoft project in China - online learning dialog agent - released on 23th March 2016

Why not going directly to real users? Chatter bot release
on Twitter - by Microsoft's Technology & Research and Bing divisions - based on Xiaoice, a similar Microsoft project in China - online learning dialog agent - released on 23th March 2016 - Microsoft to shut down the service only 16 hours after its launch "Tay" after the acronym "thinking about you"

Why not going directly to real users? "Tay" after the
acronym "thinking about you" Chatter bot release on Twitter - by Microsoft's Technology & Research and Bing divisions - based on Xiaoice, a similar Microsoft project in China - online learning dialog agent - released on 23th March 2016

Why not going directly to real users? Lessons learned: -
Interactive user experiments are the best - But users are hard… - A well-meaning experiment can go wrong very rapidly - Prepare and evaluate before launch! "Tay" after the acronym "thinking about you"

Which task to evaluate your bot? • SQuAD - Stanford
Question Answering Dataset • bAbI tasks • MS MARCO - Microsoft MAchine Reading COmprehension Dataset • WikiQA • dialog-bAbI • Movie Dialog dataset • Ubuntu dialog corpus • Textbook Question Answering (TQA) dataset And many more...

And do we write all the code to evaluate? Some
evaluation frameworks: • Alexa prize • bAbI tasks • CLEVR • CommAI-env • ParlAI

And do we write all the code to evaluate? Some
evaluation frameworks: • Alexa prize (Amazon) • bAbI tasks (Facebook) • CLEVR (Facebook) • CommAI-env (Facebook) • ParlAI (Facebook)

ParlAI

ParlAI “A unified platform for sharing, training and evaluating dialog
models across many tasks.” First release in May 2017 Github repo + good documentation + good tutorials Regularly updated: - more than 15 agent implementations - doubled the number of tasks since first release

ParlAI (from documentation) It’s a python-based platform for enabling dialog
AI research. • a unified framework for sharing, training and testing dialog models • many popular datasets available all in one place, with the ability to multi-task over them • seamless integration of Amazon Mechanical Turk for data collection and human evaluation • integration with Facebook Messenger to connect agents with humans in a chat interface

Top contributors Most of them are from FAIR

ParlAI tasks Originally: 20 tasks from toy problems to benchmarks
Today: almost 50 tasks (with possibly subtasks) Various categories: • Q/A • Sentence completion • Dialogue • Visual dialogue Auto setup (download the data on demand) and standardized format • Chit-chat • Negotiation • Machine translation • Any task you want...

ParlAI tasks Many subtasks: already 20 bAbI tasks!

Teacher: { 'text': 'Sam went to the kitchen\nPat gave Sam
the milk. Where is the milk?', 'labels': ['kitchen'], 'label_candidates': ['hallway', 'kitchen', 'bathroom'], 'episode_done': False } Student: { 'text': 'hallway' } Teacher: { 'text': 'Sam went to the hallway\nPat went to the bathroom. Where is the milk?', 'labels': ['hallway'], 'label_candidates': ['hallway', 'kitchen', 'bathroom'], 'episode_done': True } Student: { 'text': 'hallway' } Teacher: { ... # starts next episode } ...

Amazon mechanical turk integration ParlAI supports integration with Mechanical Turk
for data collection, training, and evaluation. Cheap, easy and safe interactive evaluations!

Coding time: implement & evaluate

ParlAI - implementation basics Agent Teacher observation action train dev
test

Implementation basics Observation & action share the same “meta” structure.
Key stuff: - text - id - reward - labels

Rasa core agent implementation Adapt training data for dialog-bAbI (task
5): Domain → change utterance templates to better fit original data Stories → add different dialog endings + a few bug adaptations/fixes Limitations: adapted stories still do not perfectly reflect the original data, which hurts benchmark score (e.g. no “dialog interruptions”)

Code, code, code... (live examples in VisualCode) Task selection: Dialog-bAbI
- task 5 full dialogue Agent implementation: - Implement TDS_MEGA_BOT! (crossing fingers) - rasa core (it’s all ready!) Run an agent: - Play: $python examples/display_model.py -m rasa -t "dialog_babi:Task:5" -dt valid - Evaluate: $python examples/eval_model.py -m rasa -t "dialog_babi:Task:5" -dt valid

Evaluation results Not so bright… but wait! Context: - IR
baseline: not adapted to dialog case - Seq2seq training ~10h tuned by FAIR on this task - Rasa-core training ~2min + 4h of personal work to adapt to dialog-bAbI-task5 (room for improvements)

Evaluation results Fixing some (not all) cases not covered originally
by rasa training: x3 perf (and we are still missing some dialog cases) Context: - IR baseline: not adapted to dialog case - Seq2seq training ~10h tuned by FAIR on this task - Rasa-core training ~2min + 6h of personal work to adapt to dialog-bAbI-task5 (room for improvements)

Take-away Chatbot are fun! Lots of room for research &
tinkering - not only for the big players. We are far from getting the long promised sci-fi personalized agent…

Q/A time? Central Research & Technology Gérard Dupont [email protected] Alexandre
Arnold [email protected]

BONUS?

https://aigym.airbus.com Central Research & Technology

#30 Back to the chatbot - for real

#30 Back to the chatbot - for real

More Decks by Toulouse Data Science

Other Decks in Technology

Featured

Transcript