Learning from Uncurated Regular Expressions for Semantic Type Classification - SiMoD 2023

Slide 1

Slide 1 text

Learning from Uncurated Regular Expressions for Semantic Type Classiﬁcation Michael Mior

Slide 2

Slide 2 text

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications • Future Work

Slide 3

Slide 3 text

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications • Future Work

Slide 4

Slide 4 text

Semantic Types ▸ Semantic types result in a more detailed schema than syntactic types ▸ Types should capture semantic similarity between sets of values ▸ Semantic types are much more diﬃcult to infer than syntactic types 4

Slide 5

Slide 5 text

Example 5 Person Dorothy Vaughn Alan Turing Josef Burg Settlement Pittsburgh Edinburgh Johannesburg Salzburg

Slide 6

Slide 6 text

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications • Future Work

Slide 7

Slide 7 text

Why not regexes? ▸ Regular expressions can’t match many semantic type classes (e.g. cities) ▸ Dirty data could cause a valid regular expression not to match ▸ Even when regexes might work, they can be complicated to write 7

Slide 8

Slide 8 text

Learning regexes ▸ Regular expressions can be learned from sets of examples ▸ Learning can be slow and expressions can be too precise 8 Source: Sherlock: A Deep Learning Approach to Semantic Data Type Detection, Hulsebos et al., 2019

Slide 9

Slide 9 text

Learning regexes 9 Source: Sherlock: A Deep Learning Approach to Semantic Data Type Detection, Hulsebos et al., 2019

Slide 10

Slide 10 text

Regex101 ▸ Users can enter and test regular expressions of various ﬂavors ▸ Importantly, there is a “library” of saved regular expressions ▸ These expressions can be anything 10

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Why use regexes anyway? ▸ Multiple regular expressions can extract useful information ▸ Each regular expression does not have to be perfect ▸ We don’t necessarily have to write them! 13

Slide 14

Slide 14 text

14 Multiple regex matching https?:\/\/(www\\.)? (([a-z\-]+)(?:\.com|\.vn|\.co.uk)) \.(jpg|jpeg) http://example.com/image.jpg

Slide 15

Slide 15 text

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications • Future Work

Slide 16

Slide 16 text

Regexes for learning ▸ Even if we don’t know why each regex is meaningful, we assume it is ▸ Whether a set of values matches a regex gives some semantic information ▸ 16

Slide 17

Slide 17 text

17 Feature extraction 3-5 K-2 9-12 … PK K-12 PK … KG - 05 PK - 05 KG - 05 … \d+(.*) [a-zA-Z]([a-zA-Z]|[0-9])+ … Input data Regexes Features < 0.8, 0.4, .. > < 0.84, 0.72, .. > < 1.0, 1.0, .. >

Slide 18

Slide 18 text

Supervised learning ▸ Extract regex features from a labeled set of column values ▸ Train a neural network to classify based on the extracted features ▸ Currently, weighted F-1 score is 0.75 compared to 0.9 for Sherlock 18

Slide 19

Slide 19 text

Supervised learning 19

Slide 20

Slide 20 text

Supervised learning 20

Slide 21

Slide 21 text

Unsupervised clustering ▸ Calculate feature vectors from each set of data values ▸ Cluster using DBSCAN ▸ Predeﬁned semantic classes are not required

Slide 22

Slide 22 text

Unsupervised clustering

Slide 23

Slide 23 text

Unsupervised clustering ▸ Consider all semantic classes where our F-1 score > 0.9 ▸ Clustering without supervision, yields an F-1 score of 0.86 ▸ Suggests minimal information is lost by discarding labels

Slide 24

Slide 24 text

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications • Future Work

Slide 25

Slide 25 text

Future Work ▸ More unsupervised learning ▸ Expanded regular expression corpus ▸ Explaining regular expression matches ▸ Learning expression fragments ▸ Blocking for entity resolution 25

Slide 26

Slide 26 text

Questions? cs.rit.edu/~dataunitylab

Slide 27

Slide 27 text

Feature Selection ▸ Many regexes are probably redundant ▸ Use the top 10 highest Shapley values for each class to select regexes ▸ Retrain the model with fewer regexes 27

Slide 28

Slide 28 text

Feature Selection 28 (\d\D?){10}\b \d\d.-?.\\w.-?.\d\d \b[^ aeiouAEIOU][a-zA-Z]*\b

Slide 29

Slide 29 text

Feature Selection ▸ Final model has 264 regexes ▸ Weighted performance is approximately the same ▸ Some classes have worse performance 29