打造面向金融場景的中文自然語言理解引擎

Slide 1

Slide 1 text

打造面向金融場景的中文自然語言理解引擎數據研究發展中心陳皓遠

Slide 2

Slide 2 text

About me • Member of AI group, CTBC Data R&D Center • Past experience on • Cyber security and defense industry • Smartphone industry • Familiar with • Machine learning • Natural language processing • Software development • Cloud native architecture design

Slide 3

Slide 3 text

Team • CTBC Data R&D Center AI group is founded in 2018 • AI group is composed of data scientists and software developers • Our mission is to realize AI-based solution in banking scenario • We currently focus on • Computer Vision (CV) • Natural Language Processing (NLP) Retrieved from https://www.ithome.com.tw/news/131697

Slide 4

Slide 4 text

Achievement NLP • Pluto: A Deep Learning based Watchdog for Anti Money Laundering • First Vertical AI paradigm in RegTech field in CTBC globally • Daily reduce 67% human effort on adverse media screening • Publication • https://www.aclweb.org/anthology/W19-5515 CV • NIST Face Recognition Verification Test (FRVT) • Rank 35th globally • Rank 2nd in Taiwan industry • X-ATM for fraud avoidance 名次企業名稱國家 FRR 10 Sensetine(商湯) 中國 0.0092 18 Face++(曠視) 中國 0.0145 26 CyberLink (訊連) 台灣 0.0195 29 Tencent Deepsea (騰訊) 中國 0.0215 35 CTBC BANK (中國信託) 台灣 0.0250 39 Gorilla Technology(大猩猩) 台灣 0.0291 55 Kneron Inc. (耐能) 台灣 0.0902

Slide 5

Slide 5 text

Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion

Slide 6

Slide 6 text

Digitalized channel plays an important role 遠見雜誌 - 2018數位⾦融⼒調查 Retrieved from https://www.gvm.com.tw/article.html?id=54981

Slide 7

Slide 7 text

Abundant Platform for Conversational Assistants messaging platform Google Home Amazon Echo

Slide 8

Slide 8 text

• A task-oriented dialogue system • Chat in natural language • Be realized on Amazon Alexa Eno, your Capital One dialogue assistant

Slide 9

Slide 9 text

Motivation • Realize a task-oriented dialogue system on heterogeneous conversational platforms in Mandarin to serve customers facing banking scenario Prerequisite • A natural language understanding (NLU) • intent recognition (IR) • named entity recognition (NER) NLU IR NER 美元定存六個月期的利率是多少 • Intent • 查詢利率 • Entity • 幣別：美元 • 帳戶類型：定存 • 期數：六個月

Slide 10

Slide 10 text

Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion

Slide 11

Slide 11 text

Key Components in NLU • Deep Neural Networks (DNN) • Conditional Random Field (CRF) • Recurrent Neural Network (RNN) Preprocessing Tokenizer POS tagger Modeling Modeling Embeddings Supervised learning method vectorization • Intent Recognizer • Classification problem • Named Entity Extractor • Sequence labeling problem Approach

Slide 12

Slide 12 text

Data Preparation • Intent dataset • 1016 samples over 3 distinct classes • 試算匯兌, 查詢存款利率, 查詢台外幣餘額 • Named entity dataset • 977 samples over 6 distinct entities • amount, money, duration, currency, acnt_type, timestamp Great acknowledgment for 數位金融處 and 個金數位營運處

Slide 13

Slide 13 text

Intent Classification Techniques • Preprocessing • Tokenization (ckiptagger) • Feature extraction • Bag of Word (scikit-learn) Vocabulary [ “現在”, “台幣”,”美金”, “日圓”,“一年期”, “定存”,“是”, “多少”] 現在美金一年期定存是多少 Text 現在美金一年期定存是多少 Tokens • Model • Deep Neural Network (DNN) (tensorflow) [ 1 , 0 , 1 , 0 , 1 , 1 ] Feature vector Word Count encoding Feature engineering Model Training

Slide 14

Slide 14 text

Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) • POS tagging (ckiptagger) • Feature extraction • Text and POS tags within context Model I : CRF for Word-Level Feature 現在美金一年期定存是多少 Text 現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa) Tokens …, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), … Feature vector Context windows: 3 tokens • Model • Conditional Random Field (CRF) (scikit-learn) Feature engineering Model Training

Slide 15

Slide 15 text

Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) Model II : Bi-LSTM-CRF for Word-Level Embedding 現在美金一年期定存是多少 Text 現在美金一年期定存是多少 Tokens • Model • Embedding Layer (keras) • Long Short-Term Memory (LSTM) layer (keras) • CRF layer (keras) Embedding learning Features learning Model training

Slide 16

Slide 16 text

Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion

Slide 17

Slide 17 text

Evaluation Methodology Metrics Precision Recall F1-Score Confusion Matrix 實際 Yes 實際 No 預測 Yes True Positive (TP) False Positive (FP) 預測 No False Negative (FN) True Negative (TN) Reference: https://en.wikipedia.org/wiki/Confusion_matrix + + 2 ∗ ∗ +

Slide 18

Slide 18 text

Evaluation Precision and Recall Intent classification 0.91 0.98 0.97 0.94 0.95 0.96 0.93 0.96 0.96 0.88 0.90 0.92 0.94 0.96 0.98 1.00 查詢台外幣餘額查詢存款利率試算匯兌 Precision Recall F1-Score

Slide 19

Slide 19 text

Evaluation Precision Named Entity Recognition 0.79 0.75 0.85 0.74 0.55 0.90 0.98 0.93 0.80 0.89 0.81 0.96 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別期數時間點帳戶類型錢⾦額 CRF BiLSTM+CRF

Slide 20

Slide 20 text

Evaluation Recall Named Entity Recognition 0.82 0.55 0.78 0.67 0.52 0.94 0.95 0.67 0.79 0.80 0.89 0.72 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 幣別期數時間點帳戶類型錢⾦額 CRF BiLSTM+CRF

Slide 21

Slide 21 text

Evaluation F1-Score Named Entity Recognition 0.81 0.64 0.82 0.68 0.52 0.92 0.97 0.71 0.72 0.84 0.88 0.82 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別期數時間點帳戶類型錢⾦額 CRF BiLSTM+CRF

Slide 22

Slide 22 text

Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion

Slide 23

Slide 23 text

Prototype Conversational AI with Rasa framework: https://github.com/RasaHQ/rasa NLU

Slide 24

Slide 24 text

Prototype Why Rasa ? Extendible Architecture Open source Own Our Data • Preserve privacy • Do not hand data over to big tech company • Transparency • Community support • Task-oriented dialogue architecture • Customizable components Rasa characteristics CTBC strategy • Customize Mandarin- based component • Integration on core technology • Compliance on Security and Regulation • Customized scenario • Ownership on core technology

Slide 25

Slide 25 text

Prototype • Intent recognition • CKIP Tokenizer (customized) • EmbeddingIntentClassifier (built-in) • Named Entity Recognition • CKIP Tokenizer (customized) • Bi-LSTM-CRF for Word-Level Embedding (customized)

Slide 26

Slide 26 text

Prototype Demo

Slide 27

Slide 27 text

Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion

Slide 28

Slide 28 text

Conclusion • NLU is a key module in task-oriented dialogue systems • Intent recognizer and entity extractor are key components to realize NLU by machine learning techniques and annotated data • DNN performs generally better than traditional method but not for all tasks • Rasa powered by open source offers a framework for conversational assistant development from scratch Summary

Slide 29

Slide 29 text

Conclusion • Transfer learning based on pre-trained word embeddings initialization • Word-based embeddings vs. char-based embeddings • Model engineering What’s next

Slide 30

Slide 30 text

Q&A