打造面向金融場景的中文自然語言理解引擎

打造面向金融場景的中文自然語言理解引擎數據研究發展中心陳皓遠

About me • Member of AI group, CTBC Data R&D
Center • Past experience on • Cyber security and defense industry • Smartphone industry • Familiar with • Machine learning • Natural language processing • Software development • Cloud native architecture design

Team • CTBC Data R&D Center AI group is founded
in 2018 • AI group is composed of data scientists and software developers • Our mission is to realize AI-based solution in banking scenario • We currently focus on • Computer Vision (CV) • Natural Language Processing (NLP) Retrieved from https://www.ithome.com.tw/news/131697

Achievement NLP • Pluto: A Deep Learning based Watchdog for
Anti Money Laundering • First Vertical AI paradigm in RegTech field in CTBC globally • Daily reduce 67% human effort on adverse media screening • Publication • https://www.aclweb.org/anthology/W19-5515 CV • NIST Face Recognition Verification Test (FRVT) • Rank 35th globally • Rank 2nd in Taiwan industry • X-ATM for fraud avoidance 名次企業名稱國家 FRR 10 Sensetine(商湯) 中國 0.0092 18 Face++(曠視) 中國 0.0145 26 CyberLink (訊連) 台灣 0.0195 29 Tencent Deepsea (騰訊) 中國 0.0215 35 CTBC BANK (中國信託) 台灣 0.0250 39 Gorilla Technology(大猩猩) 台灣 0.0291 55 Kneron Inc. (耐能) 台灣 0.0902

Outline • Background • Proposed Solution • Evaluation • Prototype
• Conclusion

Digitalized channel plays an important role 遠見雜誌 - 2018數位⾦融⼒調查 Retrieved
from https://www.gvm.com.tw/article.html?id=54981

Abundant Platform for Conversational Assistants messaging platform Google Home Amazon
Echo

• A task-oriented dialogue system • Chat in natural language
• Be realized on Amazon Alexa Eno, your Capital One dialogue assistant

Motivation • Realize a task-oriented dialogue system on heterogeneous conversational
platforms in Mandarin to serve customers facing banking scenario Prerequisite • A natural language understanding (NLU) • intent recognition (IR) • named entity recognition (NER) NLU IR NER 美元定存六個月期的利率是多少 • Intent • 查詢利率 • Entity • 幣別：美元 • 帳戶類型：定存 • 期數：六個月

• Conclusion

Key Components in NLU • Deep Neural Networks (DNN) •
Conditional Random Field (CRF) • Recurrent Neural Network (RNN) Preprocessing Tokenizer POS tagger Modeling Modeling Embeddings Supervised learning method vectorization • Intent Recognizer • Classification problem • Named Entity Extractor • Sequence labeling problem Approach

Data Preparation • Intent dataset • 1016 samples over 3
distinct classes • 試算匯兌, 查詢存款利率, 查詢台外幣餘額 • Named entity dataset • 977 samples over 6 distinct entities • amount, money, duration, currency, acnt_type, timestamp Great acknowledgment for 數位金融處 and 個金數位營運處

Intent Classification Techniques • Preprocessing • Tokenization (ckiptagger) • Feature
extraction • Bag of Word (scikit-learn) Vocabulary [ “現在”, “台幣”,”美金”, “日圓”,“一年期”, “定存”,“是”, “多少”] 現在美金一年期定存是多少 Text 現在美金一年期定存是多少 Tokens • Model • Deep Neural Network (DNN) (tensorflow) [ 1 , 0 , 1 , 0 , 1 , 1 ] Feature vector Word Count encoding Feature engineering Model Training

Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) •
POS tagging (ckiptagger) • Feature extraction • Text and POS tags within context Model I : CRF for Word-Level Feature 現在美金一年期定存是多少 Text 現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa) Tokens …, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), … Feature vector Context windows: 3 tokens • Model • Conditional Random Field (CRF) (scikit-learn) Feature engineering Model Training

Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) Model
II : Bi-LSTM-CRF for Word-Level Embedding 現在美金一年期定存是多少 Text 現在美金一年期定存是多少 Tokens • Model • Embedding Layer (keras) • Long Short-Term Memory (LSTM) layer (keras) • CRF layer (keras) Embedding learning Features learning Model training

• Conclusion

Evaluation Methodology Metrics Precision Recall F1-Score Confusion Matrix 實際 Yes
實際 No 預測 Yes True Positive (TP) False Positive (FP) 預測 No False Negative (FN) True Negative (TN) Reference: https://en.wikipedia.org/wiki/Confusion_matrix + + 2 ∗ ∗ +

Evaluation Precision and Recall Intent classification 0.91 0.98 0.97 0.94
0.95 0.96 0.93 0.96 0.96 0.88 0.90 0.92 0.94 0.96 0.98 1.00 查詢台外幣餘額查詢存款利率試算匯兌 Precision Recall F1-Score

Evaluation Precision Named Entity Recognition 0.79 0.75 0.85 0.74 0.55
0.90 0.98 0.93 0.80 0.89 0.81 0.96 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別期數時間點帳戶類型錢⾦額 CRF BiLSTM+CRF

Evaluation Recall Named Entity Recognition 0.82 0.55 0.78 0.67 0.52
0.94 0.95 0.67 0.79 0.80 0.89 0.72 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 幣別期數時間點帳戶類型錢⾦額 CRF BiLSTM+CRF

Evaluation F1-Score Named Entity Recognition 0.81 0.64 0.82 0.68 0.52
0.92 0.97 0.71 0.72 0.84 0.88 0.82 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別期數時間點帳戶類型錢⾦額 CRF BiLSTM+CRF

• Conclusion

Prototype Conversational AI with Rasa framework: https://github.com/RasaHQ/rasa NLU

Prototype Why Rasa ? Extendible Architecture Open source Own Our
Data • Preserve privacy • Do not hand data over to big tech company • Transparency • Community support • Task-oriented dialogue architecture • Customizable components Rasa characteristics CTBC strategy • Customize Mandarin- based component • Integration on core technology • Compliance on Security and Regulation • Customized scenario • Ownership on core technology

Prototype • Intent recognition • CKIP Tokenizer (customized) • EmbeddingIntentClassifier
(built-in) • Named Entity Recognition • CKIP Tokenizer (customized) • Bi-LSTM-CRF for Word-Level Embedding (customized)

Prototype Demo

• Conclusion

Conclusion • NLU is a key module in task-oriented dialogue
systems • Intent recognizer and entity extractor are key components to realize NLU by machine learning techniques and annotated data • DNN performs generally better than traditional method but not for all tasks • Rasa powered by open source offers a framework for conversational assistant development from scratch Summary

Conclusion • Transfer learning based on pre-trained word embeddings initialization
• Word-based embeddings vs. char-based embeddings • Model engineering What’s next

打造面向金融場景的中文自然語言理解引擎

打造面向金融場景的中文自然語言理解引擎

circlelychen

More Decks by circlelychen

Other Decks in Technology

Featured

Transcript