Semantic Data Understanding with Character Level Learning

Semantic Data Understanding with Character Level Learning

C1e4202cb329b1e3343a571dff94c68c?s=128

Michael Mior

August 12, 2020
Tweet

Transcript

  1. Semantic Data Understanding with Character Level Learning Michael Mior, Rochester

    Institute of Technology Ken Q. Pu, Ontario Tech University
  2. • Motivation • Neural Network Architecture • Similarity Graph Generation

    • Semantic Analysis of DBpedia • Conclusion
  3. Motivation • Assign semantic types to values in data lakes

    • Types should capture semantic similarity • Find anomalies in any database-assigned types • Character-level reasoning without prior knowledge
  4. Example Person Dorothy Vaughn Alan Turing Josef Burg Settlement Pittsburgh

    Edinburgh Johannesburg Salzburg
  5. Example Person Dorothy Vaughn Alan Turing Josef Burg Settlement Pittsburgh

    Edinburgh Johannesburg Salzburg
  6. DBpedia Pattuelli 2013

  7. • Motivation • Neural Network Architecture • Similarity Graph Generation

    • Semantic Analysis of DBpedia • Conclusion
  8. Network Architecture • Use a convolution neural network to identify

    salient substrings in values that can be used for classification • Start with an embedding layer for dimensionality reduction • Use convolutional layers to capture substring patterns • End with fully connected layers
  9. Network Architecture Pittsburgh 7155683942 { { CNN Max pool Flatten

    Dense Softmax Settlement ⋮ ⋮ ⋮ ⋮ ⋮ Embedding
  10. Applications of Semantic Typing • Semantic similarity between database types

    • Identify a hierarchy among database types • Detect outliers and dirty data
  11. • Motivation • Neural Network Architecture • Similarity Graph Generation

    • Semantic Analysis of DBpedia • Conclusion
  12. Semantic Similarity Between Database Types Many database types are semantically

    similar. ChemicalSubstance ChemicalCompound Biomolecule Protein NaturalPlace BodyOfWater Semantic types can be used to detect and measure the semantic similarities between database types.
  13. Confusion Based Similarity We hypothesize that semantic similarity is the

    main source of confusion of the neural network. Thus, we can infer the degree of semantic similarity between two database types t and t’ based on their mutual confusion.
  14. Semantic Graph • The pairwise semantic similarity measure allows us

    to organize all the database types into a similarity graph. • A logistic function is used to rescale the similarity measure so we can control the spatial layout of the graph with the parameters: a and b.
  15. Semantic Clustering • Spectral clustering is a time-proven technique to

    analyze graphs. • We apply spectral clustering to organize the database types into clusters which represent semantic topics in the database. • Spectral clustering can be applied recursively so that a hierarchical organization can be obtained.
  16. Anomaly Detection • Type anomaly is a form of dirty

    data where data values that are assigned incorrect types. Example: “Beethoven’s 9th Symphony” being classified as Artist. • We can utilize the neural network to detect likely candidates of such dirty data. • Using substring reasoning, we can also generate visual explanation of the abnormalities of the identified data values.
  17. Anomaly Detection Substring in the sliding window Semantic typing Assigned

    type Inferred type Substring location Certainty of inference “...phony...” “...hoven...”
  18. • Motivation • Neural Network Architecture • Similarity Graph Generation

    • Semantic Analysis of DBpedia • Conclusion
  19. Semantic Clusters

  20. Semantic Graph Biology Chemicals Events Agent

  21. Semantic Clusters Subtype structure within a single cluster

  22. Semantic Clusters

  23. Semantic Clusters

  24. Anomalies model response

  25. Anomalies model response

  26. Conclusion • Semantic information is difficult to infer from data

    lakes • Similarity information derived from neural networks can be used to construct a hierarchy of semantic types • Semantic types are useful for analysis and data cleaning
  27. Questions?