Mining Cardinalities from Knowledge Bases

Mining Cardinalities from Knowledge Bases

Paper presented at DEXA 2017, Lyon, France

175389e8c3ad885108fc33f8f05ba9bd?s=128

Emir Muñoz

August 29, 2017
Tweet

Transcript

  1. Mining Cardinalities from Knowledge Bases Emir Muñoz and Matthias Nickles

    Fujitsu Ireland Ltd. Insight Centre for Data Analytics, NUI Galway DEXA 2017, August 28-31, Lyon, France
  2. Structured data Dynamic data Schema-less data Resource Description Framework (RDF)

    is … Good for the Web (data integration, transfer, etc.) Bad for users (reusability, trust, understanding, etc.) Challenges arise due to the Open World Assumption (OWA) and non-Unique Name Assumption (nUNA) in OWL/RDF Motivation (1/4)
  3. Motivation (2/4) ▷ Open World Assumption: The truth value of

    an assertion is not necessarily known If an assertion is not in the knowledge base we cannot say it is negative ▷ No Unique Name Assumption: Individuals may have more than one name
  4. Motivation (3/4) ▷ Domains, ranges, and cardinalities are usually not

    defined :Ireland 6,378,000 idemo:population dbpedia- owl:Population 6,378,000 igeo:capitale :Dublin dbpedia- owl:Capital :Dublin No central schema! • Hard to write queries [1] • How am I suppose to reuse these data? Cardinalities! :Irlande igeo:capitale :Dublin Many different ontologies [1] Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT, ACM (2010) 4-33 Same entity
  5. Motivation (4/4) ▷ Cardinalities indicate us the structure of things

    (concepts) height width weight legs (2) arms (2) head (1) name address age … capital (1) counties (many) height weight rivers (0 to many) mountains population languages (1 to many) time zone
  6. Related work (1/2) ▷ Cardinality constraints/bounds Constraint Languages for RDF:

    ShEx[2], RDD[3], SHACL[4], SPIN[5], OSLC[6] ▷ Consistency in RDF KBs No work has focused on the extraction of cardinalities to detect inconsistencies in KBs. Previous work focused on property values missing, not cardinalities ▷ RDF schema discovery Use of rule mining to infer an ontology Use of SPARQL queries to mine simple cardinalities (issues) [2] https://www.w3.org/2013/ShEx/Primer [3] P. M. Fischer, G. Lausen, A. Schatzle, and M. Schmidt. RDF Constraint Checking. EDBT/ICDT Workshops 2015. [4] https://www.w3.org/TR/shacl/ [5] http://spinrdf.org/ [6] https://www.w3.org/Submission/2014/SUBM-shapes-20140211/
  7. Related work (2/2) ▷ The cardinality query problem: how many

    cities? :Ireland dbpedia-owl:city :Dublin :Irlande dbpedia-owl:city :Galway owl:sameAs SELECT COUNT(?city) WHERE { :Ireland dbpedia-owl:city ?city . } SELECT COUNT(?city) WHERE { :Irlande dbpedia-owl:city ?city . } 1 1 2
  8. Preliminaries (1/2) ▷ Knowledge bases can be represented using RDF

    model ▷ RDF does not assume unique names  we need UNA 2.0 RDF model  Set of resource: ℛ e.g.: ex:JonSnow  Set of blank nodes: ℬ e.g.: _:bnode  Set of predicates: e.g.: rdf:type  Set of literals: ℒ e.g.: “Francia@es”  ≡ { , , ∈ ∪ × × ( ∪ ∪ )}
  9. Preliminaries (2/2) ▷ Knowledge bases can be represented using RDF

    model ▷ RDF does not assume unique names  we need UNA 2.0
  10. Cardinality bounds in RDF ▷ A cardinality bound in RDF

    data restricts the number of properties P related with a resource in a given context ▷ Formally, ≡ , = (, ) ▷ Lower bound ∈ ℕ, and upper bound ∈ ℕ ∪ ∞
  11. Mining cardinality patterns (1/6) ▷ In practice, a cardinality bound

    could be validated using SPARQL 1.1 ▷ (1) But a normalization on equality is required Two implementations: SPARQL rewrite, and Programmatic rewrite
  12. Mining cardinality patterns (2/6) ▷ owl:sameAs is reflexive, symmetric and

    transitive owl:sameAs-cliques and data rewriting :Ireland :Irlande owl:sameAs :Irlanda owl:sameAs :Irlandia owl:sameAs :Ireland :Irlande owl:sameAs :Irlanda owl:sameAs :Irlandia owl:sameAs owl:sameAs owl:sameAs
  13. Mining cardinality patterns (3/6) ▷ (2) After, cardinality can be

    extracted ▷ (3) However, data are not always clean Outliers detection and filtering is required max min median box Q1 Q3 arms (4)?!?
  14. Mining cardinality patterns (4/6)

  15. Mining cardinality patterns (5/6) Representative element equivalence type induced by

    owl:sameAs-cliques Very expensive query!
  16. Mining cardinality patterns (6/6) Parallelism 

  17. Evaluation (1/6) ▷ We took different syntactic and real-world knowledge

    bases Good number of owl:sameAs axioms
  18. Evaluation (2/6) ▷ Qualitative evaluation: runtime ▷ UOBM with owl:sameAs

    axioms ▷ Mondial without owl:sameAs axioms SPARQL 253.908 sec Spark 15.634 sec SPARQL 117.739 sec Spark 2.948 sec 16x faster 40x faster
  19. Evaluation (3/6) ▷ Quantitative evaluation: consistency and completeness ▷ Randomly

    selected 1 class per dataset and 5 predicates ▷ A property in the context of a type is complete given a cardinality constraint if every entity of type has the ‘right number’ of triples (, , ); and incomplete otherwise ▷ A predicate in the context of a type is consistent if the triples with predicate and subject of type comply with the cardinality bounds; and inconsistent otherwise
  20. Evaluation (4/6) ▷ For example: ▷ Completeness: All books should

    have a property author, but not all should have a review property ▷ Consistency A single book should have between x and y authors
  21. Evaluation (5/6) ▷ Synthetic datasets are more consistent than real-world

  22. Evaluation (6/6) ▷ Subset of cardinality bounds from the Mondial

    dataset ▷ card({mondial:hasCity}, mondial:Country) = (1, 31) Not satisfied by China (306), India (99), USA (250), Brazil (210) and Russia (171)
  23. Thanks! Any questions? Emir Muñoz emir@emunoz.org Key points: - KBs

    lack the description of cardinalities - A data normalization is required to extract accurate cardinalities - An outlier filtering is required to extract robust cardinalities - Cardinality bounds can help us to assess consistency and completeness