Mining Cardinalities from Knowledge Bases

Mining Cardinalities from Knowledge Bases Emir Muñoz and Matthias Nickles
Fujitsu Ireland Ltd. Insight Centre for Data Analytics, NUI Galway DEXA 2017, August 28-31, Lyon, France

Structured data Dynamic data Schema-less data Resource Description Framework (RDF)
is … Good for the Web (data integration, transfer, etc.) Bad for users (reusability, trust, understanding, etc.) Challenges arise due to the Open World Assumption (OWA) and non-Unique Name Assumption (nUNA) in OWL/RDF Motivation (1/4)

Motivation (2/4) ▷ Open World Assumption: The truth value of
an assertion is not necessarily known If an assertion is not in the knowledge base we cannot say it is negative ▷ No Unique Name Assumption: Individuals may have more than one name

Motivation (3/4) ▷ Domains, ranges, and cardinalities are usually not
defined :Ireland 6,378,000 idemo:population dbpedia- owl:Population 6,378,000 igeo:capitale :Dublin dbpedia- owl:Capital :Dublin No central schema! • Hard to write queries [1] • How am I suppose to reuse these data? Cardinalities! :Irlande igeo:capitale :Dublin Many different ontologies [1] Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT, ACM (2010) 4-33 Same entity

Motivation (4/4) ▷ Cardinalities indicate us the structure of things
(concepts) height width weight legs (2) arms (2) head (1) name address age … capital (1) counties (many) height weight rivers (0 to many) mountains population languages (1 to many) time zone

Related work (1/2) ▷ Cardinality constraints/bounds Constraint Languages for RDF:
ShEx[2], RDD[3], SHACL[4], SPIN[5], OSLC[6] ▷ Consistency in RDF KBs No work has focused on the extraction of cardinalities to detect inconsistencies in KBs. Previous work focused on property values missing, not cardinalities ▷ RDF schema discovery Use of rule mining to infer an ontology Use of SPARQL queries to mine simple cardinalities (issues) [2] https://www.w3.org/2013/ShEx/Primer [3] P. M. Fischer, G. Lausen, A. Schatzle, and M. Schmidt. RDF Constraint Checking. EDBT/ICDT Workshops 2015. [4] https://www.w3.org/TR/shacl/ [5] http://spinrdf.org/ [6] https://www.w3.org/Submission/2014/SUBM-shapes-20140211/

Related work (2/2) ▷ The cardinality query problem: how many
cities? :Ireland dbpedia-owl:city :Dublin :Irlande dbpedia-owl:city :Galway owl:sameAs SELECT COUNT(?city) WHERE { :Ireland dbpedia-owl:city ?city . } SELECT COUNT(?city) WHERE { :Irlande dbpedia-owl:city ?city . } 1 1 2

Preliminaries (1/2) ▷ Knowledge bases can be represented using RDF
model ▷ RDF does not assume unique names  we need UNA 2.0 RDF model  Set of resource: ℛ e.g.: ex:JonSnow  Set of blank nodes: ℬ e.g.: _:bnode  Set of predicates: e.g.: rdf:type  Set of literals: ℒ e.g.: “Francia@es”  ≡ { , , ∈ ∪ × × ( ∪ ∪ )}

Preliminaries (2/2) ▷ Knowledge bases can be represented using RDF
model ▷ RDF does not assume unique names  we need UNA 2.0

Cardinality bounds in RDF ▷ A cardinality bound in RDF
data restricts the number of properties P related with a resource in a given context ▷ Formally, ≡ , = (, ) ▷ Lower bound ∈ ℕ, and upper bound ∈ ℕ ∪ ∞

Mining cardinality patterns (1/6) ▷ In practice, a cardinality bound
could be validated using SPARQL 1.1 ▷ (1) But a normalization on equality is required Two implementations: SPARQL rewrite, and Programmatic rewrite

Mining cardinality patterns (2/6) ▷ owl:sameAs is reflexive, symmetric and
transitive owl:sameAs-cliques and data rewriting :Ireland :Irlande owl:sameAs :Irlanda owl:sameAs :Irlandia owl:sameAs :Ireland :Irlande owl:sameAs :Irlanda owl:sameAs :Irlandia owl:sameAs owl:sameAs owl:sameAs

Mining cardinality patterns (3/6) ▷ (2) After, cardinality can be
extracted ▷ (3) However, data are not always clean Outliers detection and filtering is required max min median box Q1 Q3 arms (4)?!?

Mining cardinality patterns (4/6)

Mining cardinality patterns (5/6) Representative element equivalence type induced by
owl:sameAs-cliques Very expensive query!

Mining cardinality patterns (6/6) Parallelism 

Evaluation (1/6) ▷ We took different syntactic and real-world knowledge
bases Good number of owl:sameAs axioms

Evaluation (2/6) ▷ Qualitative evaluation: runtime ▷ UOBM with owl:sameAs
axioms ▷ Mondial without owl:sameAs axioms SPARQL 253.908 sec Spark 15.634 sec SPARQL 117.739 sec Spark 2.948 sec 16x faster 40x faster

Evaluation (3/6) ▷ Quantitative evaluation: consistency and completeness ▷ Randomly
selected 1 class per dataset and 5 predicates ▷ A property in the context of a type is complete given a cardinality constraint if every entity of type has the ‘right number’ of triples (, , ); and incomplete otherwise ▷ A predicate in the context of a type is consistent if the triples with predicate and subject of type comply with the cardinality bounds; and inconsistent otherwise

Evaluation (4/6) ▷ For example: ▷ Completeness: All books should
have a property author, but not all should have a review property ▷ Consistency A single book should have between x and y authors

Evaluation (5/6) ▷ Synthetic datasets are more consistent than real-world

Evaluation (6/6) ▷ Subset of cardinality bounds from the Mondial
dataset ▷ card({mondial:hasCity}, mondial:Country) = (1, 31) Not satisfied by China (306), India (99), USA (250), Brazil (210) and Russia (171)

Thanks! Any questions? Emir Muñoz [email protected] Key points: - KBs
lack the description of cardinalities - A data normalization is required to extract accurate cardinalities - An outlier filtering is required to extract robust cardinalities - Cardinality bounds can help us to assess consistency and completeness

Mining Cardinalities from Knowledge Bases

Mining Cardinalities from Knowledge Bases

Emir Muñoz

More Decks by Emir Muñoz

Other Decks in Research

Featured

Transcript