Upgrade to Pro — share decks privately, control downloads, hide ads and more …

打造面向金融場景的中文自然語言理解引擎

circlelychen
September 20, 2019

 打造面向金融場景的中文自然語言理解引擎

自然語言理解(NLU)為建構問答系統的核心。要讓智慧代理人能夠以對話的方式來協助人類完成各式各樣目標, 就需要一個具有意圖及實體識別能力的自然語言理解引擎。

在這次演講中,講者將以一個後端工程師初次踏入自然語言處理領域的姿態,介紹實作NLU模組所需要使用的NLP技術與相對應機器學習的方法,接著分享基於RasaNLU開源專案來達成目標的過程與採用此方案的優點。

最後,除了透過統計來驗證模型能力以外,也搭配Rasa Core建構一個智慧型對話代理人,跟市面上應用在金融場景的智慧客服做個比較。

circlelychen

September 20, 2019
Tweet

More Decks by circlelychen

Other Decks in Technology

Transcript

  1. About me • Member of AI group, CTBC Data R&D

    Center • Past experience on • Cyber security and defense industry • Smartphone industry • Familiar with • Machine learning • Natural language processing • Software development • Cloud native architecture design
  2. Team • CTBC Data R&D Center AI group is founded

    in 2018 • AI group is composed of data scientists and software developers • Our mission is to realize AI-based solution in banking scenario • We currently focus on • Computer Vision (CV) • Natural Language Processing (NLP) Retrieved from https://www.ithome.com.tw/news/131697
  3. Achievement NLP • Pluto: A Deep Learning based Watchdog for

    Anti Money Laundering • First Vertical AI paradigm in RegTech field in CTBC globally • Daily reduce 67% human effort on adverse media screening • Publication • https://www.aclweb.org/anthology/W19-5515 CV • NIST Face Recognition Verification Test (FRVT) • Rank 35th globally • Rank 2nd in Taiwan industry • X-ATM for fraud avoidance 名次 企業名稱 國家 FRR 10 Sensetine(商湯) 中國 0.0092 18 Face++(曠視) 中國 0.0145 26 CyberLink (訊連) 台灣 0.0195 29 Tencent Deepsea (騰訊) 中國 0.0215 35 CTBC BANK (中國信託) 台灣 0.0250 39 Gorilla Technology(大猩猩) 台灣 0.0291 55 Kneron Inc. (耐能) 台灣 0.0902
  4. • A task-oriented dialogue system • Chat in natural language

    • Be realized on Amazon Alexa Eno, your Capital One dialogue assistant
  5. Motivation • Realize a task-oriented dialogue system on heterogeneous conversational

    platforms in Mandarin to serve customers facing banking scenario Prerequisite • A natural language understanding (NLU) • intent recognition (IR) • named entity recognition (NER) NLU IR NER 美元定存六個月期的利率是多少 • Intent • 查詢利率 • Entity • 幣別:美元 • 帳戶類型:定存 • 期數:六個月
  6. Key Components in NLU • Deep Neural Networks (DNN) •

    Conditional Random Field (CRF) • Recurrent Neural Network (RNN) Preprocessing Tokenizer POS tagger Modeling Modeling Embeddings Supervised learning method vectorization • Intent Recognizer • Classification problem • Named Entity Extractor • Sequence labeling problem Approach
  7. Data Preparation • Intent dataset • 1016 samples over 3

    distinct classes • 試算匯兌, 查詢存款利率, 查詢台外幣餘額 • Named entity dataset • 977 samples over 6 distinct entities • amount, money, duration, currency, acnt_type, timestamp Great acknowledgment for 數位金融處 and 個金數位營運處
  8. Intent Classification Techniques • Preprocessing • Tokenization (ckiptagger) • Feature

    extraction • Bag of Word (scikit-learn) Vocabulary [ “現在”, “台幣”,”美金”, “日圓”,“一 年期”, “定存”,“是”, “多少”] 現在美金一年期定存是多少 Text 現在 美金 一年期 定存 是 多少 Tokens • Model • Deep Neural Network (DNN) (tensorflow) [ 1 , 0 , 1 , 0 , 1 , 1 ] Feature vector Word Count encoding Feature engineering Model Training
  9. Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) •

    POS tagging (ckiptagger) • Feature extraction • Text and POS tags within context Model I : CRF for Word-Level Feature 現在美金一年期定存是多少 Text 現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa) Tokens …, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), … Feature vector Context windows: 3 tokens • Model • Conditional Random Field (CRF) (scikit-learn) Feature engineering Model Training
  10. Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) Model

    II : Bi-LSTM-CRF for Word-Level Embedding 現在美金一年期定存是多少 Text 現在 美金 一年期 定存 是 多少 Tokens • Model • Embedding Layer (keras) • Long Short-Term Memory (LSTM) layer (keras) • CRF layer (keras) Embedding learning Features learning Model training
  11. Evaluation Methodology Metrics Precision Recall F1-Score Confusion Matrix 實際 Yes

    實際 No 預測 Yes True Positive (TP) False Positive (FP) 預測 No False Negative (FN) True Negative (TN) Reference: https://en.wikipedia.org/wiki/Confusion_matrix + + 2 ∗ ∗ +
  12. Evaluation Precision and Recall Intent classification 0.91 0.98 0.97 0.94

    0.95 0.96 0.93 0.96 0.96 0.88 0.90 0.92 0.94 0.96 0.98 1.00 查詢台外幣餘額 查詢存款利率 試算匯兌 Precision Recall F1-Score
  13. Evaluation Precision Named Entity Recognition 0.79 0.75 0.85 0.74 0.55

    0.90 0.98 0.93 0.80 0.89 0.81 0.96 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  14. Evaluation Recall Named Entity Recognition 0.82 0.55 0.78 0.67 0.52

    0.94 0.95 0.67 0.79 0.80 0.89 0.72 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  15. Evaluation F1-Score Named Entity Recognition 0.81 0.64 0.82 0.68 0.52

    0.92 0.97 0.71 0.72 0.84 0.88 0.82 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  16. Prototype Why Rasa ? Extendible Architecture Open source Own Our

    Data • Preserve privacy • Do not hand data over to big tech company • Transparency • Community support • Task-oriented dialogue architecture • Customizable components Rasa characteristics CTBC strategy • Customize Mandarin- based component • Integration on core technology • Compliance on Security and Regulation • Customized scenario • Ownership on core technology
  17. Prototype • Intent recognition • CKIP Tokenizer (customized) • EmbeddingIntentClassifier

    (built-in) • Named Entity Recognition • CKIP Tokenizer (customized) • Bi-LSTM-CRF for Word-Level Embedding (customized)
  18. Conclusion • NLU is a key module in task-oriented dialogue

    systems • Intent recognizer and entity extractor are key components to realize NLU by machine learning techniques and annotated data • DNN performs generally better than traditional method but not for all tasks • Rasa powered by open source offers a framework for conversational assistant development from scratch Summary
  19. Conclusion • Transfer learning based on pre-trained word embeddings initialization

    • Word-based embeddings vs. char-based embeddings • Model engineering What’s next
  20. Q&A