Bring your chatbots to production

Bring your bots to production by using continuous integration pipelines
Lee Boonstra Sales engineer Google Cloud

Bring your bots to production by using continuous integration pipelines
During Google Cloud Next 2019; ING has seen the presentation of credit card company: Discover on bringing virtual assistants to production by using continuous integration / development approaches. This helps Discover to enable DF agents to focus on more complex interactions over multiple channels. Discover showed how they are making use of metrics. And afterwards they gave a demo of their staging portal, which they have created for their product owners / scrum team. They explained the ﬂow of bringing chatbot model updates from Dev to staging to production. ING expressed interest to crack the same problem. In the next 20 min, I will explain to you, how you can collect metrics and automate the process of bringing chatbot model updates to production by building a continuous integration pipeline for Dialogﬂow. Introduction

Enterprise teams working with Dialogflow • IT department ◦ Setup
the cloud environment, IAM roles, network, rights/roles for usage of Dialogflow, Compute (Kubernetes), ML APIs, Pub/Sub, BigQuery... • Data Scientists ◦ Collect metrics and analytics of frequent asked questions and customer experiences. ◦ Test the conversation. • UX Conversational Designers & Content Writers ◦ Write the conversation. • Engineers ◦ Building fulfillments. ◦ Integrate with web services & APIs. ◦ Configure the chatbot output channels. Typical enterprise organization

Typical flow, of building a chatbot IT department Setup GCP
Account UX Conversation Designers Create Conversation in Dialogflow UI Engineers Integrate the channels with Dialogflow SDK Engineers (optional) Build fulfillment UX Conversation Designers / Engineers Deploy Agents Data Scientists / UX Conversation Designers Test Agents in Production Channel Data Scientists / UX Conversation Designers Agent Training in Dialogflow UI Data Scientists / UX Conversation Designers Gather Metrics Dialogflow UI / BigQuery UX Conversation Designers Optimize Conversation in Dialogflow UI

Use Case: Discover Discover presented their use case live at
Google Cloud Next 2019 in San Francisco. See here there recordings: https://www.youtube.com/watch?v=L7nbmHPbrEo

Bring your Dialogflow experience to the next level Learnings Discover
shared on Google Cloud Next 2019: • Dev Environment ◦ Creation/Edits of Conversations in Dialogflow UI • Staging Environment ◦ Export the Dev Intent Changes, run a Diff to see what’s newly added. ◦ Validation ◦ Run Unit Test with Test User Queries - Request Intent Matching Confidence Score ◦ Regression Tests, to ensure previous created intents aren’t broken ◦ Metrics / Confusion Matrix • Production Environment ◦ Enable / Disable Intents ◦ Per Intent Threshold ◦ Overrule Intents ◦ Live Analytics

IT department Setup GCP Account UX Conversation Designers Create Conversation
in Dialogflow UI in Dev Engineers Integrate the channels with Dialogflow SDK Engineers (optional) Build fulfillment UX Conversation Designers / Engineers Deploy to Staging Data Scientists / UX Conversation Designers Test Agents in Staging Environment Data Scientists / UX Conversation Designers Agent Training in Dialogflow UI Data Scientists / UX Conversation Designers Gather Metrics Dialogflow UI / BigQuery UX Conversation Designers Optimize Conversation in Dialogflow UI UX Conversation Designers / Engineers Deploy to Production

Example “ The mobile app is not working? Is the
service down?” MATCHED INTENT: -service down- Response: “Thank you for your comment. We will log this issue.” But when there is a known issue with the app, the business can overrule without modifying the Dialogflow model, by turning the intent off: Response: “Yes. The network is down. We are currently under maintenance.”

Example General Confidence Threshold set to 20% “ I want
to block my card.” “ This is my account, I want to block card IBAN...” MATCHED INTENT: -blockcard- Except when asked: “ I want to block my account.” Confidence Threshold 90% MATCHED INTENT: -blockaccount-

Dialogflow Features

Dialogflow Intent Detection Card Intent Account Intent Mortgage Intent User
says: <“I want to disable my account.”> Dialogflow searches for the highest intent “match”. It returns a confidence level. Dialogflow returns response: <“You account has been disabled.”> 1. 2. 3.

Dialogflow Intent Detection • Programmatically you can retrieve the queryResult
◦ projects.agent.sessions.detectIntent ◦ Requires: queryInput ▪ Can be text, spoken text, event trigger ◦ Returns: detectntentQueryResponse ▪ Contains queryResult that contains the matched intent and intentDetectionConfidence

Dialogflow Confidence Level Intent Detection Confidence Score – Percentage of
how confident the model was in the Intent detection Minimum Confidence Threshold – Configurable value in Dialogflow Settings.

Import & Exports • Dialogflow Agents can be exported to
a zip file, which contains JSON files for each intent and entities. • Programmatically you can export, import and train agents by using the API: ◦ projects.agent.export ◦ projects.agent.import ◦ projects.agent.train

Dialogflow Environments • There’s a default feature in Dialogflow to
create multiple environments. • Unique Webhooks per environment. • Switch versions / rollback options

Demo’s

Babs the Banking Bot Web Chat Google Assistant Hey Google,
let me talk to Babs The Banking Bot Welcome, how can I help you? I want to transfer money. Let’s get Babs the Banking Bot How much do you want to transfer? 100 euro.

Which customers are unhappy and why? (Analytics)

How can I improve the user experience? (Analytics)

Collect real-time chats from Dialogﬂow SDK

Mask sensitive Information with DLP API

Understand the text with NLP API

Store all data in a data-warehouse

Optimize your agent

Confidential + Proprietary Advanced Chatflow with machine learning bot analytics
User types to custom UI or channel Chatbot replies Dialogflow Enterprise Customer Client JS Angular 5 web front-end Kubernetes Engine Chat Server Dialogflow SDK / socket.io Kubernetes Engine Back-end CRM Python / Django Kubernetes Engine Container Registry Containers images can be stored in the Container Registry Messaging Publisher Pub/Sub Webhook Router Cloud Function Webhook Container Builder Building Dev Pipelines

Confidential + Proprietary Advanced Chatflow with machine learning and bot
analytics User types to custom UI or channel Chatbot replies Dialogflow Enterprise Customer Client JS Angular 5 web front-end Kubernetes Engine Chat Server Dialogflow SDK / socket.io Kubernetes Engine Back-end CRM Python / Django Kubernetes Engine Subscription Cloud Function Sensitivity Filter DLP API Sentiment Detector NLP API Data Warehouse BigQuery Messaging Publisher Pub/Sub Webhook Router Cloud Function Webhook

Metrics Once you have built your chatbot. The most important
question that arises is; how good is your ML model?

Test Datasets • UX / Content writers create the validation
data set. They use this to train the Dialogflow agent model by entering it as user phrases. • To create test data sets that aren’t biased. Use logs from Chat / IVR / Virtual Assistants, seperate from the intent user phrases that are created. User PII data can be anonymized / masked. (Note the FutureBank.nl BigQuery/Dashboard demo) • Create a unit test, that passes in the anonymized test phrase to the detectIntent API method. The detected intent and the confidence score can be evaluated with your validation dataset.

Example: True Positive (TP) A true positive is an outcome
where the chatbot correctly detects the right (positive) intent. • Dialogflow User Phrases / Data to train the Dialogflow Agent Model ◦ “Did my salary came in yet?” ◦ “Have I received my salary?” ◦ INTENT: Salary Intent • Test Data: ◦ “My salary, when will I receive it?” ◦ Expected Intent: Salary Intent ◦ Detected Intent: Salary Intent

Example: True Negative (TN) / Unsupported Request Similarly, a true
negative is an outcome where the chatbot correctly mapped the user phrase to a fallback. • Dialogflow User Phrases / Data to train the Dialogflow Agent Model ◦ Everything that can’t be mapped. ◦ Global Fallback • Test Data: ◦ “My salary, when will I receive it?” ◦ Expected Intent: Global Fallback ◦ Detected Intent: Global Fallback

Example: False Positive (FP) / Missed Understood Request A false
positive is an outcome where the chatbot matches the wrong intent. (It should have been a different intent or in case it didn’t exist, a fallback intent.) • Dialogflow User Phrases / Data to train the Dialogflow Agent Model ◦ I want to block my card. ◦ INTENT: Block Card ◦ I want to renew my card. ◦ INTENT: Renew Card • Test Data: ◦ “My account is blocked, can I get a new card?” ◦ Should be: “Renew Card” ◦ Instead returned: Block Card Missed understood Request

Example: False Negative (FN) / Missed Request And a False
negative is an outcome where the intent exists, but the chatbot didn’t detect it and therefore a fallback was triggered. • Dialogflow User Phrases / Data to train the Dialogflow Agent Model ◦ “Did my salary came in yet?” ◦ INTENT: Salary Intent • Test Data: ◦ “My salary, when will I receive it?” ◦ Should be: INTENT: Salary Intent ◦ Instead returned: Fallback Message

Calculate Accuracy Is a ratio of correctly predicted observation to
the total observations. (Ratio of all correct handled intents.) total correct = total TP + total TN. total incorrect = total FP + total FN. accuracy = correct / correct + incorrect

Calculate Precision Is a ratio of positive prediction values. (To
determine if there are problems with False Positives / misunderstood Requests. The higher the precision the lower the FP rate.) precision = total TP / total TP + total FP

Calculate Recall A sensitivity ratio. (To determine if intents are
too narrowly defined and missed requests. When it’s above 0.5 it can be considered good.) recall = total TP / total TP + total FN

Calculate F1 Score The weighted average score of precision and
recall. (To determine if intents are too narrowly defined and missed requests. When it’s above 0.5 it can be considered good.) f1 score = 2 * (recall * precision) / (recall + precision)

True Positive False Negative True Negative False Positive Detected Intent
by Dialogﬂow Expected Intent by you. Metrics True positive and true negatives are the observations that are correctly detected and therefore shown in green. We want to minimize the false positives and false negatives. (red).

Confusion Matrix A confusion matrix is a table that is
often used to describe the performance of a classification model on a set of test data for which the true values are known.

AOC - ROC curve ROC tells us how good the
model is for distinguishing the given intents, in terms of the detected probability. The steeper the line, the better. Using this info, you can make a decision on how you want to set the confidence thresholds.

Build • You will need to know the intent name,
and the user phrases it was trained on. • Write your own phrases • Run unit tests on your phrases it('TP', () => { let myUserPhrase = ‘Can I block my card?’; let myUserPhrase2 = ‘Please cancel my pass.’; let intentName = ‘BLOCK_CARD’; expect(detectIntent(myUserPhrase).intent).toBe(intentName); expect(detectIntent(myUserPhrase2).intent).toBe(intentName); }); • Count the total TP, TN, FP, FN, F1, Precision and Recall • Based on these generate a confusion matrix

Solving this programmatically.

Dialogflow Agents • Development / Training – All manual agent
updates happen here. • Staging – Export from dev to perform all acceptance and regression testing. Artifact is created and versioned from the process. API access only. • Production – Only artifacts are deployed to prod. API access only.

Acceptance Dashboard 1. Export the Dev Intent Changes, and run
a diff to collect the new intents, and it’s user / training phrases. 2. Upload / Create a validation set based on collected metrics / or create your own for the new intents. 3. Run a a Unit Test and test it against intent and confidence score, by making detectIntent API calls. 4. Run regression test to ensure previous created intents aren’t broken. 5. Plot the results in Confusion Matrix. 6. Compare Metrics against previous version 7. Export a summary report 8. When tests approved, push intents to production environment. (import) 9. When tests are disapproved, send message to development team?

44 Production flow Current. User types user queries in a
chatbot. Website Dialogflow matches an intent and replies to a user session Dialogflow Enterprise Customer Client JS Angular 5 web front-end Kubernetes Engine Chat Server Dialogflow SDK Kubernetes Engine

45 Production flow Advanced flow. User types user queries in
a chatbot. Website The Chatbot admin server overrules response with custom messages, threshold or fallbacks. Dialogflow Enterprise Customer Client JS Angular 5 web front-end Kubernetes Engine Chat Server Dialogflow SDK Kubernetes Engine Admin Server Dialogflow SDK Kubernetes Engine Dialogflow matches an intent and checks the results with production config.

Admin Dashboard 1. All controlled intents need to have fulfillment
enabled. 2. List all intents, with switches to enabled / disable intents. a. When enabled, it gets the response from DF UI b. When disabled, catch and respond nothing / overrule responses. 3. Live Analytics Dashboard (like futurebank.nl example)

Confidential + Proprietary Dialogflow DEV Customer Client JS Angular 5
web front-end Chat Server Dialogflow SDK Acceptance Board JS Angular 5 web front-end Admin Panel JS Angular 5 web front-end Dialogflow Acceptance Dialogflow Production Production Config Dialogflow SDK Dev Acceptance Prod. export dev intents import to acceptance agent run unit test / metrics push to production

Thanks

Bring your chatbots to production

Bring your chatbots to production

More Decks by Lee Boonstra

Other Decks in Technology

Featured

Transcript