Semantic Data Understanding with Character Level Learning

Semantic Data Understanding with Character Level Learning Michael Mior, Rochester
Institute of Technology Ken Q. Pu, Ontario Tech University

• Motivation • Neural Network Architecture • Similarity Graph Generation
• Semantic Analysis of DBpedia • Conclusion

Motivation • Assign semantic types to values in data lakes
• Types should capture semantic similarity • Find anomalies in any database-assigned types • Character-level reasoning without prior knowledge

Example Person Dorothy Vaughn Alan Turing Josef Burg Settlement Pittsburgh
Edinburgh Johannesburg Salzburg

DBpedia Pattuelli 2013

Network Architecture • Use a convolution neural network to identify
salient substrings in values that can be used for classiﬁcation • Start with an embedding layer for dimensionality reduction • Use convolutional layers to capture substring patterns • End with fully connected layers

Network Architecture Pittsburgh 7155683942 { { CNN Max pool Flatten
Dense Softmax Settlement ⋮ ⋮ ⋮ ⋮ ⋮ Embedding

Applications of Semantic Typing • Semantic similarity between database types
• Identify a hierarchy among database types • Detect outliers and dirty data

Semantic Similarity Between Database Types Many database types are semantically
similar. ChemicalSubstance ChemicalCompound Biomolecule Protein NaturalPlace BodyOfWater Semantic types can be used to detect and measure the semantic similarities between database types.

Confusion Based Similarity We hypothesize that semantic similarity is the
main source of confusion of the neural network. Thus, we can infer the degree of semantic similarity between two database types t and t’ based on their mutual confusion.

Semantic Graph • The pairwise semantic similarity measure allows us
to organize all the database types into a similarity graph. • A logistic function is used to rescale the similarity measure so we can control the spatial layout of the graph with the parameters: a and b.

Semantic Clustering • Spectral clustering is a time-proven technique to
analyze graphs. • We apply spectral clustering to organize the database types into clusters which represent semantic topics in the database. • Spectral clustering can be applied recursively so that a hierarchical organization can be obtained.

Anomaly Detection • Type anomaly is a form of dirty
data where data values that are assigned incorrect types. Example: “Beethoven’s 9th Symphony” being classiﬁed as Artist. • We can utilize the neural network to detect likely candidates of such dirty data. • Using substring reasoning, we can also generate visual explanation of the abnormalities of the identiﬁed data values.

Anomaly Detection Substring in the sliding window Semantic typing Assigned
type Inferred type Substring location Certainty of inference “...phony...” “...hoven...”

Semantic Clusters

Semantic Graph Biology Chemicals Events Agent

Semantic Clusters Subtype structure within a single cluster

Semantic Clusters

Anomalies model response

Conclusion • Semantic information is diﬃcult to infer from data
lakes • Similarity information derived from neural networks can be used to construct a hierarchy of semantic types • Semantic types are useful for analysis and data cleaning

Questions?

Semantic Data Understanding with Character Leve...

Semantic Data Understanding with Character Level Learning

Michael Mior

More Decks by Michael Mior

Other Decks in Research

Featured

Transcript