Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Build Mandarin AI Conversational Agent with Rasa

circlelychen
November 23, 2019

Build Mandarin AI Conversational Agent with Rasa

Rasa 為開源的機器學習框架,提供 Python 開發者一個便於打造基於文本和語音互動的智慧型對話代理人,然而,在中文的場景下,Rasa 既有的元件往往表現不出令人滿意的品質。
本次分享將以 Rasa 範例「餐廳搜尋助理」 出發,讓我們透過 ckiptagger 及 客製化元件,看著他如何學會繁體中文。

circlelychen

November 23, 2019
Tweet

More Decks by circlelychen

Other Decks in Technology

Transcript

  1. 打造面向金融場景 的中文自然語言理 解引擎 v Speaker @ PyCon Taiwan 2019 Ø

    https://speakerdeck.com/circlel ychen/da-zao-mian-xiang-jin- rong-chang-jing-de-zhong-wen- zi-ran-yu-yan-li-jie-yin-qing Ø https://www.youtube.com/watch? v=o7DMrWVMZCA Community Experiences
  2. Outline Introduction to Conversational Agent Rasa framework • Introduction •

    Simple Tutorial Build Rasa custom components based on ckiptagger • Motivation and mechanism • Introduction to ckiptagger • Components Implementation Demo
  3. Levels of Conversational Assistants Level 1: Notifications Level 2 FAQs

    Level 4 Personal Assistants Level 5 Autonomous Organization of Assistants Level 3 Contextual Assistants Intelligence Time
  4. Notifications Hao Yuan Chen (circlelychen) opened !417 update nlu corpus

    in drd-ctbc/nlp Hao Yuan Chen pushed to branch master of drd- ctbc/nlp (Compare changes) • Push notification • Users passively receive notifications • Nothing happens when users reply FAQs I need to renew my renter's insurance. How much will it be ? You can calculate your renewal price on our website: https://xxx.bbb/site • Users get a response by asking simple question • Stimulate basic FAQ pages with a search tool • Most common type of assistant right now
  5. Contextual Assistants Yes Thanks! Your renew rate from Sept. 1st

    onwards would be $10 /month Yes Great – so just confirming it’s 980 sq ft ? I need to renew my renter's insurance. How much will it be ? I’d be happy to check for you. Firstly, are you still living in the same apartment ? • Allow users freely chat endless as expected • Be capable of understanding and responding with multiple follow-up questions • Context • what the user has said before is expected knowledge
  6. Outline Introduction to Conversational Agent Rasa framework • Introduction •

    Simple tutorial Build Rasa custom components based on ckiptagger • Motivation and mechanism on customizing Rasa NLU pipeline • Introduction to ckiptagger • Components Implementation Demo
  7. Performance Benchmark on Conversational Agents Benchmarking Natural Language Understanding Services

    for building Conversational Agents Xingkun Liu, Arash Eshghi, Pawel Swietojanski, Verena Rieser Accepted by IWSDS2019
  8. Rasa Modules Rasa An open source machine learning tools For

    Developers For Conversational AI NLU Natural language processing • Intent classification • Entity extraction Integration Knowledge base interaction • Knowledge base interaction • Language generation Channel User Interface • Message • Voice Core Contextual Dialogue management • Decision making • Context tracking
  9. Rasa Dialogue Flow I am sick. I need GP in

    94310. Context Should we look for the earlier appointment ?
  10. Project Structure QuickStart • config.yml • Pipeline for NLU •

    Policies for Core • nlu.md • Training data for NLU $> pip3 install rasa[tensorflow] $> rasa init • stories.md • Training data for Core • domain.yml • Chatbot’s domain • Actions
  11. Steps for Preparing NLU Training Data Step1: Collect dialogue samples

    Step2: Define labels Step3: Compose data/nlu.md and data/domain.yml • Leverage the knowledge of domain experts • Check the most common search queries and questions • Define intents by observing dialogue set • Define entities by checking search queries • Annotate samples with intents and entities in data/nlu.md • Dump intents and entities in data/domain.yml
  12. Step 1: Collect dialogue samples Good morning how can I

    help you? Thank you Bye-bye I want a british restaurant in the east part of town here's what I found: xxx, yyy hello there how can I help you? can you book a table in london in a expensive price range with spanish cuisine for two ok let me see what I can find hey bot how can I help you? west part of town for how many people? find me a cheap vietnamese restaurant where? … … … … …
  13. Step 2: Define labels im looking for an expensive restaurant

    in the east town want something in the south side of town thats moderately priced good morning hello there hey bot good evening good afternoon hey okay thank you thank you goodbye thanks goodbye thank you good bye thank you goodbye you rock
  14. Step 2: Define labels im looking for an expensive restaurant

    in the east town want something in the south side of town thats moderately priced good morning hello there hey bot good evening good afternoon hey okay thank you thank you goodbye thanks goodbye thank you good bye thank you goodbye you rock • Intent: Inform • Intent: Greet • Intent: Thankyou Entity price location
  15. ## intent:greet - good morning - hello there … ##

    intent:thankyou - okay thank you - thank you bye … ## intent:inform - im looking for an [expensive](price) restaurant in the [east](location) town - want something in the [south](location) side of town that’s [moderately](price:moderate) priced - what about [italian](location) … Step 3: Compose data/nlu.md and data/domain.yml data/nlu.md …. …. entities: - Location - Price - Cuisine Intents: - great - thankyou - info … … data/domain.yml
  16. Setup Rasa NLU Pipeline Step1: Choose built-in pipeline or configure

    them by components Step2: Compose in data/config.md language: "en” pipeline: "supervised_embeddings" language: "en” pipeline: "pretrained_embeddings_spacy" Built-in pipeline language: “en” pipeline: - name: "SpacyNLP" - name: "SpacyTokenizer" - name: "SpacyFeaturizer" - name: "SklearnIntentClassifier" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" language: “en” pipeline: - name: "WhitespaceTokenizer" - name: "RegexFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "CountVectorsFeaturizer” analyze’: ‘word’- name token_pattern: ‘(?u)\b\w+\b’ - name: "EmbeddingIntentClassifier" Pipeline-embedded components
  17. Pros and Cons between two Built-in Pipeline supervised_embeddings pretrained_embeddings_spacy Pros

    • The model pick up domain specific vocabulary • Support any language that can be tokenized Cons • Plenty of data required • More training time indeed Pros • Better model performance with less training data required • Faster training time Cons • pre-trained word embeddings • No specific domain vocabulary
  18. Rasa NLU Pipeline Anatomy language: “en” pipeline: - name: "WhitespaceTokenizer"

    - name: "RegexFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "CountVectorsFeaturizer” analyze’: ‘word’- name token_pattern: ‘(?u)\b\w+\b’ - name: "EmbeddingIntentClassifier" supervised_embeddings language: “en” pipeline: - name: "WhitespaceTokenizer" - name: "RegexFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "CountVectorsFeaturizer” analyze’: ‘word’- name token_pattern: ‘(?u)\b\w+\b’ - name: "EmbeddingIntentClassifier" Components for entity extraction Components for intent recognition
  19. Rasa NLU Demo Commands # Evaluation based on confusion matrix

    $> rasa test # NLU model training $> rasa train nlu # NLU model Inference via shell $> rasa shell nlu
  20. Steps for Prepare Rasa Core Training Data Step1: Design dialogue

    flow Step2: Design dialogue flow interns of intents and entities Step2: Compose data/stories.md and domain.yml Good morning how can I help you? afghan food for how many people? I want a british restaurant in the east part of town what kind of cuisine would you like? … …
  21. Steps for Prepare Rasa Core Training Data Step1: Design dialogue

    flow Step2: Design dialogue flow interns of intents and entities Step2: Compose data/stories.md and domain.yml Good morning how can I help you? afghan food for how many people? I want a british restaurant in the east part of town what kind of cuisine would you like? … … ## story_1 * greet - utter_ask_howcanhelp *inform{"location": "london"} - utter_ask_cuisine * inform{"cuisine": "spanish"} - utter_ask_numpeople …
  22. Steps for Prepare Rasa Core Training Data Step1: Design dialogue

    flow Step2: Design dialogue flow interns of intents and entities Step2: Compose data/stories.md and domain.yml Good morning how can I help you? afghan food for how many people? I want a british restaurant in the east part of town what kind of cuisine would you like? … … ## story_1 * greet - utter_ask_howcanhelp *inform{"location": "london"} - utter_ask_cuisine * inform{"cuisine": "spanish"} - utter_ask_numpeople … template utter_ask_cuisine: - text: "what kind of cuisine would you like?" utter_ask_howcanhelp: - text: "how can I help you?” utter_ask_numpeople: - text: "for how many people?” …
  23. Steps Rasa Core Policy Step1: Design dialogue flow Step2: Choose

    built-in or custom policies Step2: Compose domain.yml policies: - name: ”FallbackPolicy" - name: ”MappingPolicy" - name: ”KerasPolicy" - name: ”MemoizationPolicy” Priority 5 EmbeddingPolicy KeraPolicy SklearnPoklicy Priority on rule matching Priority 4 Mapping Policy Priority 3 MemoizationPolicy AugmentedMemoizationPolicy Priority 2 FallbackPolicy TwoStageFallbackPolicy Priority 1 FormPolicy https://rasa.com/docs/rasa/core/policies/
  24. Rasa NLU Demo Commands # NLU model training $> rasa

    train # NLU model Inference via shell $> rasa shell
  25. Outline Introduction to Conversational Agent Rasa framework • Introduction •

    Simple tutorial Build Rasa custom components based on ckiptagger • Motivation and mechanism • Introduction to ckiptagger • Components Implementation Demo
  26. Motivation Challenge on bad performance in Mandarin Reasons • Word

    segmentation is a hard problem instead of white space delimiter • Token- based features extraction on Mandarin is unique skill 寫個能幹的中⽂斷詞系統 @ PyCon Taiwan 2019 https://tw.pycon.org/2019/en-us/events/talk/852751430614778081/
  27. Proposed Solution Modify supervised_embeddings with custom components language: "zh" pipeline:

    - name: ”CKIPTokenizer" - name: ”CKIPFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "CountVectorsFeaturizer” analyze’: ‘word’- name token_pattern: ‘(?u)\b\w+\b’ - name: "EmbeddingIntentClassifier" • Create pipeline for zh • Implement tokenizer for Mandarin • Implement featurizer for Mandarin tokens Empower Mandarin capability to rasa-based chatbot
  28. Mechanism Step1: Create component skeleton Step2: Define attributes Step3: Implement

    required methods • __init__ • train • process • persist • load • name • provides • requires • defaults • language_list from … import Component Class Tokenizer(Component): “”” Build our own custom component “”” # define attributes … # implement methods …
  29. ckiptagger • Deep learning based tool for • Word segmentation

    • POS tagging • Named entity recognition • Pure python package with simple dependency • Tensorflow >= 1.13 • GPL-3.0 license https://github.com/ckiplab/ckiptagger
  30. Demo • Intent recognition • CKIPTokenizer (customized) • EmbeddingIntentClassifier (built-in)

    • Named Entity Recognition • CKIPTokenizer (customized) • CKIPFeaturizer (customized) Rasa NLU + Rasa Core + rukip + Google assistant