Regular expression in Python - From zero to hero

Slide 1

Slide 1 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Regular Expression in Python - From zero to hero 游騰林(TENG-LIN YU) 2023-03-27

Slide 2

Slide 2 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 2 游騰林 Data Scientist, Cathay United Bank • Facebook: 游騰林 • Talks: PyCon, MOPCON, COSCUP, … etc. • PyPI: facebook-crawler · PyPI • Mail: [email protected] Today’s tutorial • SpeakerDeck: • Github: DataScience/24_RegularExpression • Colab:

Slide 3

Slide 3 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Goals • 了解輪胎公司的商業需求 • 學習正則表達式的概念 • 動手寫正則表達式解決輪胎公司的需求 3

Slide 4

Slide 4 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Context • 2021 年 M公司發現合作的經銷商販售的輪胎價格低於 M公司的建議售價 • 對於經銷商來說這樣可以獲得更多的生意機會，但卻可能因此打壞市場行情 • 為了避免這種情況再次發生，M公司決定對經銷商的銷售行為進行監測 • 對內: 監測有沒有經銷商在做削價競爭，破壞市場行情 • 對外: 監測輪胎的市場價格，評估輪胎的定價與市場價的差距 4

Slide 5

Slide 5 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 How to solve this problem? • 作為一位資料科學家，你會如何幫助 M公司解決他們的需求? • 資料收集: 透過網路爬蟲程式自動從各大拍賣網站收集資料 • 資料分析: 通過正則表達式剖析非結構化的文本資料，萃取輪胎的規格資訊 • 資料收集的工項 • 要收集哪些網站的資料? • 根據網站的架構開發網路爬蟲程式 • 資料分析的工項 • 商品名稱、規格、品牌…等等資訊在不同網站會有很大的差異，需要進一步梳理 5

Slide 6

Slide 6 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Collecting data • M公司已經開發完網路爬蟲程式，可以定期把各大拍賣網站的資料爬下來 • 來自 8 個網站，共 8,125 個輪胎品項 6

Slide 7

Slide 7 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Parsing unstructured text data • 從商品名稱中抽取出商品規格的重要資訊 • 包含品牌、數量、寬度、比率、直徑、單價…等等規格 7

Slide 8

Slide 8 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Open Google Colab • 打開 Colab • 讀取輪胎資料集 • tlyu0419/DataScience: My Data Science Note (github.com) • Demo 8

Slide 9

Slide 9 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 What is Regular Expression? • 正則表達式是跨程式語言、通用的文本資料處理技術 • 透過正則表達式我們能夠用簡潔的程式碼取代死板的條列式邏輯，輕鬆、高效的對文本資料進行切割、轉換、比對、剖析、萃取出需要的資料 9

Slide 10

Slide 10 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Applications 1 – Email checking • 檢測帳號地址是否正確 • 開頭是英文、數字組合 • 有個小老鼠 • 英文、數組組合，重複 2-4 次 • 檢測密碼(萬惡的檢測規則) • 密碼長度不得低於 8 位數 • 必須同時包含大小寫英文、數字、符號 • 相同字元不得重複超過 3 次 • 英文或數字間不得連續⋯⋯ 10

Slide 11

Slide 11 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Applications 2 – Feature Extraction • 人力銀行職缺資料集 • 從文本中抽取出 • 公司名稱 • Mail • 電話 • 需要的程式語言技術 • … 11

Slide 12

Slide 12 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Applications 3 – Web Crawler • Facebook 貼文/留言資料 • 從文本中抽取 • 貼文人 • 時間 • 貼文內容 • Reactions: Like, Haha, Love, Sad, …, etc. • 分享數 • 留言數 • … 12

Slide 13

Slide 13 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Applications 4 – Parsing SQL • Check the tables in the project 13

Slide 14

Slide 14 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Applications 5 – 市場監測 • 商品名稱中包含了品排、花紋、規格、數量…等等資訊 • 通過正則來剖析資料資料 14

Slide 15

Slide 15 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 How to use Regular Expression? 15

Slide 16

Slide 16 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Basic conceptions - Concepts • 拆解 E-mail 的組成並轉化為正則表達式 16 Ref: re — Regular expression operations

Slide 17

Slide 17 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Character Classes 17 • \b • \s • \w • \d Meaning: Matches the empty string, but only at the beginning or end of a word. Meaning: Matches any decimal digit. Meaning: Matches Unicode whitespace characters. Meaning: Matches Unicode word characters.

Slide 18

Slide 18 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 • ^: Denote the beginning of a regular expression • ?: Check for zero or one occurrence of the preceding character. • |: Logical OR • $: Denote the end of a regular expression or ending of a line • +: Check for one or more occurrence of the preceding character • \: Escape from the normal way a subsequent character is interpreted. • []: Check for any single character in the character set specified in [] • *: Check for any number of occurrences of the preceding character. • !: Logical NOT • (): Check for a string. Create and store variables. • .: Check for a single character which is not the ending of a line • {}: Repeat preceding character. Basic conceptions – 符號 18

Slide 19

Slide 19 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Quick quiz • 習題1: 從商品名稱中抽取品牌 • 習題2: 從商品名稱中抽取花紋 • 習題3: • 從商品名稱中抽取出數量 19 • regex101: build, test, and debug regex • 米其林 PRIMACY 4 215-55-17四入組安靜舒適輪胎

Slide 20

Slide 20 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 • re.findall: Return all non-overlapping matches of pattern in string, as a list of strings or tuples. • re.search Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. • re.split Split string by the occurrences of pattern. • re.sub Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. Regular expression in Python 20 Ref: re — Regular expression operations — Python 3.11.2 documentation

Slide 21

Slide 21 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Regular expression in Practice 21

Slide 22

Slide 22 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Practice 1 • 已知前幾大輪胎製造商的名稱如下: • Michelin:米其林 • Continental:馬牌 • GOODYEAR:固特異 • FALKEN:飛隼 • DUNLOP:登祿普，登陸普 • PIRELLI:倍耐力 • Bridgestone:普利司通 22 • 請撰寫一個名為 check_prod_brand 的 function，當我們放入商品名稱的字串時會判斷並回傳是哪個輪胎製造商，若非前 7 大的製造商則回傳 Other

Slide 23

Slide 23 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Practice 2 23 • 已知組數量的名稱如下: • 4入: 四入, 4入, 4組,４入 • 2入: 二入, 2入, 2組, 兩入 • 1入: 一入, 1入 • 請撰寫一個名為 check_prod_cnt 的 function，當我們放入商品名稱的字串時會判斷並回傳商品是幾入的輪胎，當比對不到時則回傳 4入

Slide 24

Slide 24 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Practice 3 • 已知輪胎的尺寸的名稱如下: 1. 215-65-16 2. 215/65/16 3. 215/65-16 4. 215/65 R16 5. 2454520 6. 165-R13 7. 185R14 8. 165/13 9. 155R/13 10.155-13 24 • 請撰寫一個名為 check_prod_speck 的 function，當我們放入商品名稱的字串時會判斷並回傳商品的規格，並回傳比對到的結果，當比不到時則回傳 np.nan

Slide 25

Slide 25 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Practice 4 • 承3，觀察發現有些 pattern 可以寫在一起 1. 215-65-16, 2. 215/65/16, 3. 215/65-16, 4. 215/65 R16 5. 2454520 6. 165-R13 7. 185R14 8. 165/13 9. 155R/13 10.155-13 25 • 優化撰寫好的 check_prod_speck function，讓 function 中的判斷式濃縮至只需要 3 個判斷式

Slide 26

Slide 26 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Practice 5 • 承4，回傳的規格中包含了寬度、扁平比與直徑 1. 215-65-16, 2. 215/65/16, 3. 215/65-16, 4. 215/65 R16 5. 2454520 6. 165-R13 7. 185R14 8. 165/13 9. 155R/13 10.155-13 26 • 優化撰寫好的 check_prod_speck function，讓 function 中的判斷式能回傳 • width(寬度) • aspect_ratio(扁平比) • diameter(直徑)

Slide 27

Slide 27 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Practice 6 • 當我要將資料交給客戶前，想說先檢查一下資料確認沒問題 • 計算輪胎直徑的出現次數 • by domain 可以知道輪胎尺寸應介於 13-23 吋，而資料中居然出現高達65 的數字，顯然在處理資料的過程中是有問題的! • 發生了什麼事情? 27

Slide 28

Slide 28 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Practice 7 • 查詢異常輪胎直徑的資料，為什麼會抓錯? • 嘗試正則表達式的條件，重新抓取輪胎的尺寸(限定範圍) 28

Slide 29

Slide 29 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 Final • 最終交給客戶的資料樣貌 29

Slide 30

Slide 30 text

游騰林 TENG-LIN YU | Mail: [email protected] NCCU - 資料視覺化工作坊 More links • regex101: build, test, and debug regex • re — Regular expression operations — Python 3.11.2 documentation • Regular Expression (RegEx) in Python : The Basics – Towards AI • 常见的正则表达式及其应用 | 码农的自留地 (quxiaolei.github.io) • 一輩子受用的 Regular Expressions -- 兼談另類的電腦學習態度 (cyut.edu.tw) 30