Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Semantic Data Understanding with Character Level Learning

Semantic Data Understanding with Character Level Learning

Michael Mior

August 12, 2020
Tweet

More Decks by Michael Mior

Other Decks in Research

Transcript

  1. Semantic Data Understanding with Character Level Learning Michael Mior, Rochester

    Institute of Technology Ken Q. Pu, Ontario Tech University
  2. Motivation • Assign semantic types to values in data lakes

    • Types should capture semantic similarity • Find anomalies in any database-assigned types • Character-level reasoning without prior knowledge
  3. Network Architecture • Use a convolution neural network to identify

    salient substrings in values that can be used for classification • Start with an embedding layer for dimensionality reduction • Use convolutional layers to capture substring patterns • End with fully connected layers
  4. Network Architecture Pittsburgh 7155683942 { { CNN Max pool Flatten

    Dense Softmax Settlement ⋮ ⋮ ⋮ ⋮ ⋮ Embedding
  5. Applications of Semantic Typing • Semantic similarity between database types

    • Identify a hierarchy among database types • Detect outliers and dirty data
  6. Semantic Similarity Between Database Types Many database types are semantically

    similar. ChemicalSubstance ChemicalCompound Biomolecule Protein NaturalPlace BodyOfWater Semantic types can be used to detect and measure the semantic similarities between database types.
  7. Confusion Based Similarity We hypothesize that semantic similarity is the

    main source of confusion of the neural network. Thus, we can infer the degree of semantic similarity between two database types t and t’ based on their mutual confusion.
  8. Semantic Graph • The pairwise semantic similarity measure allows us

    to organize all the database types into a similarity graph. • A logistic function is used to rescale the similarity measure so we can control the spatial layout of the graph with the parameters: a and b.
  9. Semantic Clustering • Spectral clustering is a time-proven technique to

    analyze graphs. • We apply spectral clustering to organize the database types into clusters which represent semantic topics in the database. • Spectral clustering can be applied recursively so that a hierarchical organization can be obtained.
  10. Anomaly Detection • Type anomaly is a form of dirty

    data where data values that are assigned incorrect types. Example: “Beethoven’s 9th Symphony” being classified as Artist. • We can utilize the neural network to detect likely candidates of such dirty data. • Using substring reasoning, we can also generate visual explanation of the abnormalities of the identified data values.
  11. Anomaly Detection Substring in the sliding window Semantic typing Assigned

    type Inferred type Substring location Certainty of inference “...phony...” “...hoven...”
  12. Conclusion • Semantic information is difficult to infer from data

    lakes • Similarity information derived from neural networks can be used to construct a hierarchy of semantic types • Semantic types are useful for analysis and data cleaning