Learning from Uncurated
Regular Expressions
for Semantic Type Classification
Michael Mior
Slide 2
Slide 2 text
• Semantic Type Classification
• Uncurated Regular Expressions
• Applications
• Future Work
Slide 3
Slide 3 text
• Semantic Type Classification
• Uncurated Regular Expressions
• Applications
• Future Work
Slide 4
Slide 4 text
Semantic
Types
▸ Semantic types result in a more
detailed schema than syntactic types
▸ Types should capture semantic
similarity between sets of values
▸ Semantic types are much more difficult
to infer than syntactic types
4
Slide 5
Slide 5 text
Example
5
Person
Dorothy Vaughn
Alan Turing
Josef Burg
Settlement
Pittsburgh
Edinburgh
Johannesburg
Salzburg
Slide 6
Slide 6 text
• Semantic Type Classification
• Uncurated Regular Expressions
• Applications
• Future Work
Slide 7
Slide 7 text
Why not
regexes?
▸ Regular expressions can’t match many
semantic type classes (e.g. cities)
▸ Dirty data could cause a valid regular
expression not to match
▸ Even when regexes might work, they
can be complicated to write
7
Slide 8
Slide 8 text
Learning
regexes
▸ Regular expressions can be learned
from sets of examples
▸ Learning can be slow and expressions
can be too precise
8
Source: Sherlock: A Deep Learning Approach to Semantic Data Type
Detection, Hulsebos et al., 2019
Slide 9
Slide 9 text
Learning
regexes
9
Source: Sherlock: A Deep Learning Approach to Semantic Data Type
Detection, Hulsebos et al., 2019
Slide 10
Slide 10 text
Regex101
▸ Users can enter and test regular
expressions of various flavors
▸ Importantly, there is a “library” of saved
regular expressions
▸ These expressions can be anything
10
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
Why use
regexes
anyway?
▸ Multiple regular expressions can extract
useful information
▸ Each regular expression does not have
to be perfect
▸ We don’t necessarily have to write them!
13
• Semantic Type Classification
• Uncurated Regular Expressions
• Applications
• Future Work
Slide 16
Slide 16 text
Regexes
for
learning
▸ Even if we don’t know why each regex is
meaningful, we assume it is
▸ Whether a set of values matches a regex
gives some semantic information
▸
16
Supervised
learning
▸ Extract regex features from a labeled
set of column values
▸ Train a neural network to classify based
on the extracted features
▸ Currently, weighted F-1 score is 0.75
compared to 0.9 for Sherlock
18
Slide 19
Slide 19 text
Supervised
learning
19
Slide 20
Slide 20 text
Supervised
learning
20
Slide 21
Slide 21 text
Unsupervised
clustering
▸ Calculate feature vectors from
each set of data values
▸ Cluster using DBSCAN
▸ Predefined semantic classes are
not required
Slide 22
Slide 22 text
Unsupervised
clustering
Slide 23
Slide 23 text
Unsupervised
clustering
▸ Consider all semantic classes
where our F-1 score > 0.9
▸ Clustering without supervision,
yields an F-1 score of 0.86
▸ Suggests minimal information is
lost by discarding labels
Slide 24
Slide 24 text
• Semantic Type Classification
• Uncurated Regular Expressions
• Applications
• Future Work
Slide 25
Slide 25 text
Future
Work
▸ More unsupervised learning
▸ Expanded regular expression corpus
▸ Explaining regular expression matches
▸ Learning expression fragments
▸ Blocking for entity resolution
25
Slide 26
Slide 26 text
Questions?
cs.rit.edu/~dataunitylab
Slide 27
Slide 27 text
Feature
Selection
▸ Many regexes are probably redundant
▸ Use the top 10 highest Shapley values
for each class to select regexes
▸ Retrain the model with fewer regexes
27