家 正 規 表 達 式 理 由 ⽰ 意 圖 訊 息 中 會 看 到 以 D E A L 或 A C T 為 動 詞 為 開 頭 , 接 著 會 看 到 I N A C C O R D A N C E W I T H 這 個 ⽚ 語 , 最 後 會 ⽤ S A N C T I O N L A W S 結 束 句 ⼦ , 這 就 是 要 抓 出 來 的 訊 息 \ b ( ? : D E A L | A C T ) ( ? : [ \ w \ . \ - , ( ) \ / ] + \ s ) { 0 , 3 } I N A C C O R D A N C E W I T H ( ? : [ \ w \ . \ - , ( ) \ / ] + \ s ) { 0 , 1 5 } S A N C T I O N S L A W S \ . ? \ b • Rule-based ⼿法在 特定領域 能貢獻極佳的效能 • 在⾃然語⾔中最直觀且簡單的 Rule-based 規則就是 正規表達式 (並⾮唯⼀) 步 驟 • 將領域專家描述的若⼲則經驗法則轉化成對應的 正規表達式,⽤ 表⽰ 𝑅𝐸!,..,$,..
託 收 ⾏ 訊 息 XXX WILL ACT IN ACCORDANCE WITH ANY APPLICABLE XXX/YYY OR ZZZZZZZZ, WWWWW SANCTIONS LAWS. WE WERE IN THE PROCESSING OF COMMUNICATING WITH THE REMITTING BANK OF … 步 驟 ⽰ 意 圖 XXX WILL {RE 1} WE WERE IN THE PROCESS OF COMMUNICATING WITH THE REMITTING BANK FOR THE ADDITIONAL DOCUMENTS THAT HAVE BEEN REQUESTED. … 清 理 後 託 收 ⾏ 訊 息 1 𝑅𝐸!,..,$,.. • 將託收⾏訊息⽤ 進⾏數據清理,將命中的字串做正規化表⽰
託 收 ⾏ 訊 息 ⽰ 意 圖 XXX WILL {RE 1} WE WERE IN THE PROCESS OF COMMUNICATING WITH THE REMITTING BANK FOR THE ADDITIONAL DOCUMENTS THAT HAVE BEEN REQUESTED. 詞性標記(POS Tagging) 詞型還原(Lemmatization) XXX WILL {RE 1} WE BE IN THE PROCESS OF COMMUNICATE WITH THE REMITTING BANK FOR THE ADDITIONAL DOCUMENT THAT HAVE BE REQUEST . 1 {RE 1} we(NOUN) be(VERB) communicate(VERB) bank(NOUN) document(NOUN) have(VERB) be(VERB) request(VERB) 詞 彙 特 徵 集 步 驟 • 將清理後的託收⾏訊息進⾏ 特徵詞彙 擷取,特徵詞彙之集合以 表⽰ 𝑑 !,..,%,.. 𝐷
標 注 ⾵ 險 特 徵 之 訊 息 FAILURE TO PROVIDE THE REQUESTED INFORMATION BY (APRIL X, 20XX) MAY RESULT IN THE FUNDS BEING REJECTED OR BLOCKED BY XXXX. WE RESERVE THE RIGHT TO CLAIM THE DELAY PAYMENT INTEREST. REGARDS RE 2 RESERVE RIGHT CLAIM 效度與解釋性驗證 以2021-02 的 6xxx 筆 訊息為數據 2x 筆 3x 筆 • ⾼分群 訊息數量佔據整體訊息 0.4% • 基於專家經驗的已知樣態(正則表達式) 特徵只顯⽰在⾼分群中
CERTAIN MEASURES TO PREVENT MONEY LAUNDERING AND TRANSFERS. THIS ENQUIRY IS MADE AS A PRECAUTION IN THE INTERESTS OF FRAUD PREVENTION AND OUR MUTUAL PROTECTION WE ALSO RESERVE THE RIGHT FOR OTHER DISCREPANCY,IF ANY. RESERVE LAUNDERING FRAUD 標 注 ⾵ 險 特 徵 之 訊 息 效度與解釋性驗證 以2021-02 的 6xxx 筆 訊息為數據 • 中分群 訊息數量佔據整體訊息 0.73% • 基於數據特性的潛藏樣態 特徵⼤量顯⽰在中分群中 中分群 ⾼分群 低分群 6xxx 筆 2x 筆 3x 筆 RIGHT
由⾼⾄低 XXXXXXX CLAIMS FOR CNYYYY.YY BEING THEIR PAYFULL CHGS THEREFORE,PLS FORWARD US FOR CNYXXX.XX QUOTING OUR ABOVE REF PLEASE AVOID ALL POSSIBLE DUPLICATION AND RESPOND TO THE ATTENTION OF THE UNDERSIGNED. CLAIMS 標 注 ⾵ 險 特 徵 之 訊 息 • 低分群 訊息數量佔據整體訊息約當 99% • 基於數據特性的潛藏樣態 特徵少量顯⽰在低分群中 中分群 ⾼分群 低分群 6xxx 筆 2x 筆 3x 筆