Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning from Uncurated Regular Expressions for Semantic Type Classification - SiMoD 2023

Learning from Uncurated Regular Expressions for Semantic Type Classification - SiMoD 2023

Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can become either very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive when faced with a large number of values that must be matched.

As an alternative, we propose learning from a large corpus of manually authored, but uncurated regular expressions mined from a public repository. The advantage of this approach is that we are able to extract salient features from a set of strings with limited overhead to feature engineering. Since the set of regular expressions covers a wide range of application domains, we expect them to be widely applicable.

To demonstrate the potential effectiveness of our approach, we train a model using the extracted corpus of regular expressions for the class of semantic type classification. While our approach yields results that are overall inferior to the state-of-the-art, our feature extraction code is an order of magnitude smaller, and our model outperforms a popular existing approach on some classes. We also demonstrate the possibility of using uncurated regular expressions for unsupervised learning.

Michael Mior

June 23, 2023

More Decks by Michael Mior

Other Decks in Technology


  1. Semantic Types ▸ Semantic types result in a more detailed

    schema than syntactic types ▸ Types should capture semantic similarity between sets of values ▸ Semantic types are much more difficult to infer than syntactic types 4
  2. Example 5 Person Dorothy Vaughn Alan Turing Josef Burg Settlement

    Pittsburgh Edinburgh Johannesburg Salzburg
  3. Why not regexes? ▸ Regular expressions can’t match many semantic

    type classes (e.g. cities) ▸ Dirty data could cause a valid regular expression not to match ▸ Even when regexes might work, they can be complicated to write 7
  4. Learning regexes ▸ Regular expressions can be learned from sets

    of examples ▸ Learning can be slow and expressions can be too precise 8 Source: Sherlock: A Deep Learning Approach to Semantic Data Type Detection, Hulsebos et al., 2019
  5. Learning regexes 9 Source: Sherlock: A Deep Learning Approach to

    Semantic Data Type Detection, Hulsebos et al., 2019
  6. Regex101 ▸ Users can enter and test regular expressions of

    various flavors ▸ Importantly, there is a “library” of saved regular expressions ▸ These expressions can be anything 10
  7. Why use regexes anyway? ▸ Multiple regular expressions can extract

    useful information ▸ Each regular expression does not have to be perfect ▸ We don’t necessarily have to write them! 13
  8. Regexes for learning ▸ Even if we don’t know why

    each regex is meaningful, we assume it is ▸ Whether a set of values matches a regex gives some semantic information ▸ 16
  9. 17 Feature extraction 3-5 K-2 9-12 … PK K-12 PK

    … KG - 05 PK - 05 KG - 05 … \d+(.*) [a-zA-Z]([a-zA-Z]|[0-9])+ … Input data Regexes Features < 0.8, 0.4, .. > < 0.84, 0.72, .. > < 1.0, 1.0, .. >
  10. Supervised learning ▸ Extract regex features from a labeled set

    of column values ▸ Train a neural network to classify based on the extracted features ▸ Currently, weighted F-1 score is 0.75 compared to 0.9 for Sherlock 18
  11. Unsupervised clustering ▸ Calculate feature vectors from each set of

    data values ▸ Cluster using DBSCAN ▸ Predefined semantic classes are not required
  12. Unsupervised clustering ▸ Consider all semantic classes where our F-1

    score > 0.9 ▸ Clustering without supervision, yields an F-1 score of 0.86 ▸ Suggests minimal information is lost by discarding labels
  13. Future Work ▸ More unsupervised learning ▸ Expanded regular expression

    corpus ▸ Explaining regular expression matches ▸ Learning expression fragments ▸ Blocking for entity resolution 25
  14. Feature Selection ▸ Many regexes are probably redundant ▸ Use

    the top 10 highest Shapley values for each class to select regexes ▸ Retrain the model with fewer regexes 27
  15. Feature Selection ▸ Final model has 264 regexes ▸ Weighted

    performance is approximately the same ▸ Some classes have worse performance 29