Slide 1

Slide 1 text

Semantic Data Understanding with Character Level Learning Michael Mior, Rochester Institute of Technology Ken Q. Pu, Ontario Tech University

Slide 2

Slide 2 text

● Motivation ● Neural Network Architecture ● Similarity Graph Generation ● Semantic Analysis of DBpedia ● Conclusion

Slide 3

Slide 3 text

Motivation ● Assign semantic types to values in data lakes ● Types should capture semantic similarity ● Find anomalies in any database-assigned types ● Character-level reasoning without prior knowledge

Slide 4

Slide 4 text

Example Person Dorothy Vaughn Alan Turing Josef Burg Settlement Pittsburgh Edinburgh Johannesburg Salzburg

Slide 5

Slide 5 text

Example Person Dorothy Vaughn Alan Turing Josef Burg Settlement Pittsburgh Edinburgh Johannesburg Salzburg

Slide 6

Slide 6 text

DBpedia Pattuelli 2013

Slide 7

Slide 7 text

● Motivation ● Neural Network Architecture ● Similarity Graph Generation ● Semantic Analysis of DBpedia ● Conclusion

Slide 8

Slide 8 text

Network Architecture ● Use a convolution neural network to identify salient substrings in values that can be used for classification ● Start with an embedding layer for dimensionality reduction ● Use convolutional layers to capture substring patterns ● End with fully connected layers

Slide 9

Slide 9 text

Network Architecture Pittsburgh 7155683942 { { CNN Max pool Flatten Dense Softmax Settlement ⋮ ⋮ ⋮ ⋮ ⋮ Embedding

Slide 10

Slide 10 text

Applications of Semantic Typing ● Semantic similarity between database types ● Identify a hierarchy among database types ● Detect outliers and dirty data

Slide 11

Slide 11 text

● Motivation ● Neural Network Architecture ● Similarity Graph Generation ● Semantic Analysis of DBpedia ● Conclusion

Slide 12

Slide 12 text

Semantic Similarity Between Database Types Many database types are semantically similar. ChemicalSubstance ChemicalCompound Biomolecule Protein NaturalPlace BodyOfWater Semantic types can be used to detect and measure the semantic similarities between database types.

Slide 13

Slide 13 text

Confusion Based Similarity We hypothesize that semantic similarity is the main source of confusion of the neural network. Thus, we can infer the degree of semantic similarity between two database types t and t’ based on their mutual confusion.

Slide 14

Slide 14 text

Semantic Graph ● The pairwise semantic similarity measure allows us to organize all the database types into a similarity graph. ● A logistic function is used to rescale the similarity measure so we can control the spatial layout of the graph with the parameters: a and b.

Slide 15

Slide 15 text

Semantic Clustering ● Spectral clustering is a time-proven technique to analyze graphs. ● We apply spectral clustering to organize the database types into clusters which represent semantic topics in the database. ● Spectral clustering can be applied recursively so that a hierarchical organization can be obtained.

Slide 16

Slide 16 text

Anomaly Detection ● Type anomaly is a form of dirty data where data values that are assigned incorrect types. Example: “Beethoven’s 9th Symphony” being classified as Artist. ● We can utilize the neural network to detect likely candidates of such dirty data. ● Using substring reasoning, we can also generate visual explanation of the abnormalities of the identified data values.

Slide 17

Slide 17 text

Anomaly Detection Substring in the sliding window Semantic typing Assigned type Inferred type Substring location Certainty of inference “...phony...” “...hoven...”

Slide 18

Slide 18 text

● Motivation ● Neural Network Architecture ● Similarity Graph Generation ● Semantic Analysis of DBpedia ● Conclusion

Slide 19

Slide 19 text

Semantic Clusters

Slide 20

Slide 20 text

Semantic Graph Biology Chemicals Events Agent

Slide 21

Slide 21 text

Semantic Clusters Subtype structure within a single cluster

Slide 22

Slide 22 text

Semantic Clusters

Slide 23

Slide 23 text

Semantic Clusters

Slide 24

Slide 24 text

Anomalies model response

Slide 25

Slide 25 text

Anomalies model response

Slide 26

Slide 26 text

Conclusion ● Semantic information is difficult to infer from data lakes ● Similarity information derived from neural networks can be used to construct a hierarchy of semantic types ● Semantic types are useful for analysis and data cleaning

Slide 27

Slide 27 text

Questions?