$30 off During Our Annual Pro Sale. View Details »

Learning from Uncurated Regular Expressions for Semantic Type Classification - SiMoD 2023

Learning from Uncurated Regular Expressions for Semantic Type Classification - SiMoD 2023

Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can become either very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive when faced with a large number of values that must be matched.

As an alternative, we propose learning from a large corpus of manually authored, but uncurated regular expressions mined from a public repository. The advantage of this approach is that we are able to extract salient features from a set of strings with limited overhead to feature engineering. Since the set of regular expressions covers a wide range of application domains, we expect them to be widely applicable.

To demonstrate the potential effectiveness of our approach, we train a model using the extracted corpus of regular expressions for the class of semantic type classification. While our approach yields results that are overall inferior to the state-of-the-art, our feature extraction code is an order of magnitude smaller, and our model outperforms a popular existing approach on some classes. We also demonstrate the possibility of using uncurated regular expressions for unsupervised learning.

Michael Mior

June 23, 2023
Tweet

More Decks by Michael Mior

Other Decks in Technology

Transcript

  1. Learning from Uncurated
    Regular Expressions
    for Semantic Type Classification
    Michael Mior

    View Slide

  2. • Semantic Type Classification
    • Uncurated Regular Expressions
    • Applications
    • Future Work

    View Slide

  3. • Semantic Type Classification
    • Uncurated Regular Expressions
    • Applications
    • Future Work

    View Slide

  4. Semantic
    Types
    ▸ Semantic types result in a more
    detailed schema than syntactic types
    ▸ Types should capture semantic
    similarity between sets of values
    ▸ Semantic types are much more difficult
    to infer than syntactic types
    4

    View Slide

  5. Example
    5
    Person
    Dorothy Vaughn
    Alan Turing
    Josef Burg
    Settlement
    Pittsburgh
    Edinburgh
    Johannesburg
    Salzburg

    View Slide

  6. • Semantic Type Classification
    • Uncurated Regular Expressions
    • Applications
    • Future Work

    View Slide

  7. Why not
    regexes?
    ▸ Regular expressions can’t match many
    semantic type classes (e.g. cities)
    ▸ Dirty data could cause a valid regular
    expression not to match
    ▸ Even when regexes might work, they
    can be complicated to write
    7

    View Slide

  8. Learning
    regexes
    ▸ Regular expressions can be learned
    from sets of examples
    ▸ Learning can be slow and expressions
    can be too precise
    8
    Source: Sherlock: A Deep Learning Approach to Semantic Data Type
    Detection, Hulsebos et al., 2019

    View Slide

  9. Learning
    regexes
    9
    Source: Sherlock: A Deep Learning Approach to Semantic Data Type
    Detection, Hulsebos et al., 2019

    View Slide

  10. Regex101
    ▸ Users can enter and test regular
    expressions of various flavors
    ▸ Importantly, there is a “library” of saved
    regular expressions
    ▸ These expressions can be anything
    10

    View Slide

  11. View Slide

  12. View Slide

  13. Why use
    regexes
    anyway?
    ▸ Multiple regular expressions can extract
    useful information
    ▸ Each regular expression does not have
    to be perfect
    ▸ We don’t necessarily have to write them!
    13

    View Slide

  14. 14
    Multiple regex matching
    https?:\/\/(www\\.)?
    (([a-z\-]+)(?:\.com|\.vn|\.co.uk))
    \.(jpg|jpeg)
    http://example.com/image.jpg

    View Slide

  15. • Semantic Type Classification
    • Uncurated Regular Expressions
    • Applications
    • Future Work

    View Slide

  16. Regexes
    for
    learning
    ▸ Even if we don’t know why each regex is
    meaningful, we assume it is
    ▸ Whether a set of values matches a regex
    gives some semantic information

    16

    View Slide

  17. 17
    Feature extraction
    3-5
    K-2
    9-12

    PK
    K-12
    PK

    KG - 05
    PK - 05
    KG - 05

    \d+(.*)
    [a-zA-Z]([a-zA-Z]|[0-9])+

    Input data Regexes
    Features
    < 0.8, 0.4, .. >
    < 0.84, 0.72, .. >
    < 1.0, 1.0, .. >

    View Slide

  18. Supervised
    learning
    ▸ Extract regex features from a labeled
    set of column values
    ▸ Train a neural network to classify based
    on the extracted features
    ▸ Currently, weighted F-1 score is 0.75
    compared to 0.9 for Sherlock
    18

    View Slide

  19. Supervised
    learning
    19

    View Slide

  20. Supervised
    learning
    20

    View Slide

  21. Unsupervised
    clustering
    ▸ Calculate feature vectors from
    each set of data values
    ▸ Cluster using DBSCAN
    ▸ Predefined semantic classes are
    not required

    View Slide

  22. Unsupervised
    clustering

    View Slide

  23. Unsupervised
    clustering
    ▸ Consider all semantic classes
    where our F-1 score > 0.9
    ▸ Clustering without supervision,
    yields an F-1 score of 0.86
    ▸ Suggests minimal information is
    lost by discarding labels

    View Slide

  24. • Semantic Type Classification
    • Uncurated Regular Expressions
    • Applications
    • Future Work

    View Slide

  25. Future
    Work
    ▸ More unsupervised learning
    ▸ Expanded regular expression corpus
    ▸ Explaining regular expression matches
    ▸ Learning expression fragments
    ▸ Blocking for entity resolution
    25

    View Slide

  26. Questions?
    cs.rit.edu/~dataunitylab

    View Slide

  27. Feature
    Selection
    ▸ Many regexes are probably redundant
    ▸ Use the top 10 highest Shapley values
    for each class to select regexes
    ▸ Retrain the model with fewer regexes
    27

    View Slide

  28. Feature
    Selection
    28
    (\d\D?){10}\b
    \d\d.-?.\\w.-?.\d\d
    \b[^ aeiouAEIOU][a-zA-Z]*\b

    View Slide

  29. Feature
    Selection
    ▸ Final model has 264 regexes
    ▸ Weighted performance is
    approximately the same
    ▸ Some classes have worse performance
    29

    View Slide