$30 off During Our Annual Pro Sale. View Details »

Semantic Data Understanding with Character Level Learning

Semantic Data Understanding with Character Level Learning

Michael Mior

August 12, 2020
Tweet

More Decks by Michael Mior

Other Decks in Research

Transcript

  1. Semantic Data Understanding
    with Character Level Learning
    Michael Mior, Rochester Institute of Technology
    Ken Q. Pu, Ontario Tech University

    View Slide

  2. ● Motivation
    ● Neural Network Architecture
    ● Similarity Graph Generation
    ● Semantic Analysis of DBpedia
    ● Conclusion

    View Slide

  3. Motivation
    ● Assign semantic types to values in data lakes
    ● Types should capture semantic similarity
    ● Find anomalies in any database-assigned types
    ● Character-level reasoning without prior knowledge

    View Slide

  4. Example
    Person
    Dorothy Vaughn
    Alan Turing
    Josef Burg
    Settlement
    Pittsburgh
    Edinburgh
    Johannesburg
    Salzburg

    View Slide

  5. Example
    Person
    Dorothy Vaughn
    Alan Turing
    Josef Burg
    Settlement
    Pittsburgh
    Edinburgh
    Johannesburg
    Salzburg

    View Slide

  6. DBpedia
    Pattuelli 2013

    View Slide

  7. ● Motivation
    ● Neural Network Architecture
    ● Similarity Graph Generation
    ● Semantic Analysis of DBpedia
    ● Conclusion

    View Slide

  8. Network Architecture
    ● Use a convolution neural network to identify salient substrings
    in values that can be used for classification
    ● Start with an embedding layer for dimensionality reduction
    ● Use convolutional layers to capture substring patterns
    ● End with fully connected layers

    View Slide

  9. Network Architecture
    Pittsburgh 7155683942
    {
    {
    CNN
    Max pool
    Flatten
    Dense
    Softmax
    Settlement
    ⋮ ⋮ ⋮ ⋮ ⋮
    Embedding

    View Slide

  10. Applications of Semantic Typing
    ● Semantic similarity between database types
    ● Identify a hierarchy among database types
    ● Detect outliers and dirty data

    View Slide

  11. ● Motivation
    ● Neural Network Architecture
    ● Similarity Graph Generation
    ● Semantic Analysis of DBpedia
    ● Conclusion

    View Slide

  12. Semantic Similarity Between Database Types
    Many database types are semantically similar.
    ChemicalSubstance ChemicalCompound
    Biomolecule Protein
    NaturalPlace BodyOfWater
    Semantic types can be used to detect and measure the
    semantic similarities between database types.

    View Slide

  13. Confusion Based Similarity
    We hypothesize that semantic similarity is the main source of confusion of the
    neural network. Thus, we can infer the degree of semantic similarity between two
    database types t and t’ based on their mutual confusion.

    View Slide

  14. Semantic Graph
    ● The pairwise semantic similarity measure allows us to organize all the
    database types into a similarity graph.
    ● A logistic function is used to rescale the similarity measure so we can control
    the spatial layout of the graph with the parameters: a and b.

    View Slide

  15. Semantic Clustering
    ● Spectral clustering is a time-proven technique to analyze graphs.
    ● We apply spectral clustering to organize the database types into clusters
    which represent semantic topics in the database.
    ● Spectral clustering can be applied recursively so that a hierarchical
    organization can be obtained.

    View Slide

  16. Anomaly Detection
    ● Type anomaly is a form of dirty data where data values that are assigned
    incorrect types.
    Example: “Beethoven’s 9th Symphony” being classified as Artist.
    ● We can utilize the neural network to detect likely candidates of such dirty
    data.
    ● Using substring reasoning, we can also generate visual explanation of the
    abnormalities of the identified data values.

    View Slide

  17. Anomaly Detection
    Substring in the
    sliding window
    Semantic
    typing
    Assigned
    type
    Inferred
    type
    Substring location
    Certainty of inference
    “...phony...”
    “...hoven...”

    View Slide

  18. ● Motivation
    ● Neural Network Architecture
    ● Similarity Graph Generation
    ● Semantic Analysis of DBpedia
    ● Conclusion

    View Slide

  19. Semantic Clusters

    View Slide

  20. Semantic Graph
    Biology
    Chemicals
    Events
    Agent

    View Slide

  21. Semantic Clusters
    Subtype structure
    within a single cluster

    View Slide

  22. Semantic Clusters

    View Slide

  23. Semantic Clusters

    View Slide

  24. Anomalies
    model response

    View Slide

  25. Anomalies
    model response

    View Slide

  26. Conclusion
    ● Semantic information is difficult to infer from data lakes
    ● Similarity information derived from neural networks can
    be used to construct a hierarchy of semantic types
    ● Semantic types are useful for analysis and data cleaning

    View Slide

  27. Questions?

    View Slide