Learning from Uncurated Regular Expressions for Semantic Type Classification - SiMoD 2023

Learning from Uncurated Regular Expressions for Semantic Type Classiﬁcation Michael
Mior

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications
• Future Work

Semantic Types ▸ Semantic types result in a more detailed
schema than syntactic types ▸ Types should capture semantic similarity between sets of values ▸ Semantic types are much more diﬃcult to infer than syntactic types 4

Example 5 Person Dorothy Vaughn Alan Turing Josef Burg Settlement
Pittsburgh Edinburgh Johannesburg Salzburg

• Future Work

Why not regexes? ▸ Regular expressions can’t match many semantic
type classes (e.g. cities) ▸ Dirty data could cause a valid regular expression not to match ▸ Even when regexes might work, they can be complicated to write 7

Learning regexes ▸ Regular expressions can be learned from sets
of examples ▸ Learning can be slow and expressions can be too precise 8 Source: Sherlock: A Deep Learning Approach to Semantic Data Type Detection, Hulsebos et al., 2019

Learning regexes 9 Source: Sherlock: A Deep Learning Approach to
Semantic Data Type Detection, Hulsebos et al., 2019

Regex101 ▸ Users can enter and test regular expressions of
various ﬂavors ▸ Importantly, there is a “library” of saved regular expressions ▸ These expressions can be anything 10

Why use regexes anyway? ▸ Multiple regular expressions can extract
useful information ▸ Each regular expression does not have to be perfect ▸ We don’t necessarily have to write them! 13

14 Multiple regex matching https?:\/\/(www\\.)? (([a-z\-]+)(?:\.com|\.vn|\.co.uk)) \.(jpg|jpeg) http://example.com/image.jpg

• Future Work

Regexes for learning ▸ Even if we don’t know why
each regex is meaningful, we assume it is ▸ Whether a set of values matches a regex gives some semantic information ▸ 16

17 Feature extraction 3-5 K-2 9-12 … PK K-12 PK
… KG - 05 PK - 05 KG - 05 … \d+(.*) [a-zA-Z]([a-zA-Z]|[0-9])+ … Input data Regexes Features < 0.8, 0.4, .. > < 0.84, 0.72, .. > < 1.0, 1.0, .. >

Supervised learning ▸ Extract regex features from a labeled set
of column values ▸ Train a neural network to classify based on the extracted features ▸ Currently, weighted F-1 score is 0.75 compared to 0.9 for Sherlock 18

Supervised learning 19

Supervised learning 20

Unsupervised clustering ▸ Calculate feature vectors from each set of
data values ▸ Cluster using DBSCAN ▸ Predeﬁned semantic classes are not required

Unsupervised clustering

Unsupervised clustering ▸ Consider all semantic classes where our F-1
score > 0.9 ▸ Clustering without supervision, yields an F-1 score of 0.86 ▸ Suggests minimal information is lost by discarding labels

• Future Work

Future Work ▸ More unsupervised learning ▸ Expanded regular expression
corpus ▸ Explaining regular expression matches ▸ Learning expression fragments ▸ Blocking for entity resolution 25

Questions? cs.rit.edu/~dataunitylab

Feature Selection ▸ Many regexes are probably redundant ▸ Use
the top 10 highest Shapley values for each class to select regexes ▸ Retrain the model with fewer regexes 27

Feature Selection 28 (\d\D?){10}\b \d\d.-?.\\w.-?.\d\d \b[^ aeiouAEIOU][a-zA-Z]*\b

Feature Selection ▸ Final model has 264 regexes ▸ Weighted
performance is approximately the same ▸ Some classes have worse performance 29

Learning from Uncurated Regular Expressions for...

Learning from Uncurated Regular Expressions for Semantic Type Classification - SiMoD 2023

Michael Mior

More Decks by Michael Mior

Other Decks in Technology

Featured

Transcript

Learning from Uncurated Regular Expressions for Semantic Type Classiﬁcation Michael

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications

Semantic Types ▸ Semantic types result in a more detailed

Example 5 Person Dorothy Vaughn Alan Turing Josef Burg Settlement

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications

Why not regexes? ▸ Regular expressions can’t match many semantic

Learning regexes ▸ Regular expressions can be learned from sets

Learning regexes 9 Source: Sherlock: A Deep Learning Approach to

Regex101 ▸ Users can enter and test regular expressions of

Why use regexes anyway? ▸ Multiple regular expressions can extract

14 Multiple regex matching https?:\/\/(www\\.)? (([a-z\-]+)(?:\.com|\.vn|\.co.uk)) \.(jpg|jpeg) http://example.com/image.jpg

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications

Regexes for learning ▸ Even if we don’t know why

17 Feature extraction 3-5 K-2 9-12 … PK K-12 PK

Supervised learning ▸ Extract regex features from a labeled set

Supervised learning 19

Supervised learning 20

Unsupervised clustering ▸ Calculate feature vectors from each set of

Unsupervised clustering

Unsupervised clustering ▸ Consider all semantic classes where our F-1

• Semantic Type Classiﬁcation • Uncurated Regular Expressions • Applications

Future Work ▸ More unsupervised learning ▸ Expanded regular expression

Questions? cs.rit.edu/~dataunitylab

Feature Selection ▸ Many regexes are probably redundant ▸ Use

Feature Selection 28 (\d\D?){10}\b \d\d.-?.\\w.-?.\d\d \b[^ aeiouAEIOU][a-zA-Z]*\b

Feature Selection ▸ Final model has 264 regexes ▸ Weighted