Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mining Cardinalities from Knowledge Bases

Emir Muñoz
August 29, 2017

Mining Cardinalities from Knowledge Bases

Paper presented at DEXA 2017, Lyon, France

Emir Muñoz

August 29, 2017
Tweet

More Decks by Emir Muñoz

Other Decks in Research

Transcript

  1. Mining Cardinalities from
    Knowledge Bases
    Emir Muñoz and Matthias Nickles
    Fujitsu Ireland Ltd.
    Insight Centre for Data Analytics, NUI Galway
    DEXA 2017, August 28-31, Lyon, France

    View Slide

  2. Structured data Dynamic data Schema-less data
    Resource Description Framework (RDF) is …
    Good for the Web (data integration, transfer, etc.)
    Bad for users (reusability, trust, understanding, etc.)
    Challenges arise due to the Open World Assumption (OWA) and
    non-Unique Name Assumption (nUNA) in OWL/RDF
    Motivation (1/4)

    View Slide

  3. Motivation (2/4)
    ▷ Open World Assumption:
    The truth value of an assertion is not necessarily known
    If an assertion is not in the knowledge base we cannot say
    it is negative
    ▷ No Unique Name Assumption:
    Individuals may have more than one name

    View Slide

  4. Motivation (3/4)
    ▷ Domains, ranges, and cardinalities are usually not defined
    :Ireland
    6,378,000
    idemo:population
    dbpedia-
    owl:Population
    6,378,000
    igeo:capitale
    :Dublin
    dbpedia-
    owl:Capital
    :Dublin
    No central schema!
    • Hard to write queries [1]
    • How am I suppose to reuse these data?
    Cardinalities!
    :Irlande
    igeo:capitale
    :Dublin
    Many different ontologies
    [1] Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT, ACM (2010) 4-33
    Same
    entity

    View Slide

  5. Motivation (4/4)
    ▷ Cardinalities indicate us the structure of things (concepts)
    height
    width
    weight
    legs
    (2)
    arms
    (2)
    head
    (1) name
    address
    age

    capital
    (1)
    counties
    (many)
    height
    weight
    rivers
    (0 to many)
    mountains
    population languages
    (1 to many)
    time
    zone

    View Slide

  6. Related work (1/2)
    ▷ Cardinality constraints/bounds
    Constraint Languages for RDF: ShEx[2], RDD[3], SHACL[4], SPIN[5], OSLC[6]
    ▷ Consistency in RDF KBs
    No work has focused on the extraction of cardinalities to detect
    inconsistencies in KBs.
    Previous work focused on property values missing, not cardinalities
    ▷ RDF schema discovery
    Use of rule mining to infer an ontology
    Use of SPARQL queries to mine simple cardinalities (issues)
    [2] https://www.w3.org/2013/ShEx/Primer
    [3] P. M. Fischer, G. Lausen, A. Schatzle, and M. Schmidt. RDF Constraint Checking. EDBT/ICDT Workshops 2015.
    [4] https://www.w3.org/TR/shacl/
    [5] http://spinrdf.org/
    [6] https://www.w3.org/Submission/2014/SUBM-shapes-20140211/

    View Slide

  7. Related work (2/2)
    ▷ The cardinality query problem: how many cities?
    :Ireland dbpedia-owl:city
    :Dublin
    :Irlande
    dbpedia-owl:city
    :Galway
    owl:sameAs
    SELECT COUNT(?city)
    WHERE {
    :Ireland dbpedia-owl:city ?city .
    }
    SELECT COUNT(?city)
    WHERE {
    :Irlande dbpedia-owl:city ?city .
    }
    1
    1
    2

    View Slide

  8. Preliminaries (1/2)
    ▷ Knowledge bases can be represented using RDF model
    ▷ RDF does not assume unique names  we need UNA 2.0
    RDF model
     Set of resource: ℛ e.g.: ex:JonSnow
     Set of blank nodes: ℬ e.g.: _:bnode
     Set of predicates: e.g.: rdf:type
     Set of literals: ℒ e.g.: “Francia@es”
     ≡ { , , ∈ ∪ × × ( ∪ ∪ )}

    View Slide

  9. Preliminaries (2/2)
    ▷ Knowledge bases can be represented using RDF model
    ▷ RDF does not assume unique names  we need UNA 2.0

    View Slide

  10. Cardinality bounds in RDF
    ▷ A cardinality bound in RDF data restricts the number of
    properties P related with a resource in a given context
    ▷ Formally, ≡ , = (, )
    ▷ Lower bound ∈ ℕ, and upper bound ∈ ℕ ∪ ∞

    View Slide

  11. Mining cardinality patterns (1/6)
    ▷ In practice, a cardinality bound could be validated using
    SPARQL 1.1
    ▷ (1) But a normalization on equality is required
    Two implementations: SPARQL rewrite, and Programmatic rewrite

    View Slide

  12. Mining cardinality patterns (2/6)
    ▷ owl:sameAs is reflexive, symmetric and transitive
    owl:sameAs-cliques and data rewriting
    :Ireland
    :Irlande
    owl:sameAs
    :Irlanda
    owl:sameAs
    :Irlandia
    owl:sameAs
    :Ireland
    :Irlande
    owl:sameAs
    :Irlanda
    owl:sameAs
    :Irlandia
    owl:sameAs
    owl:sameAs owl:sameAs

    View Slide

  13. Mining cardinality patterns (3/6)
    ▷ (2) After, cardinality can be extracted
    ▷ (3) However, data are not always clean
    Outliers detection and filtering is required
    max
    min
    median
    box
    Q1
    Q3
    arms (4)?!?

    View Slide

  14. Mining cardinality patterns (4/6)

    View Slide

  15. Mining cardinality patterns (5/6)
    Representative
    element equivalence
    type induced by
    owl:sameAs-cliques
    Very
    expensive
    query!

    View Slide

  16. Mining cardinality patterns (6/6)
    Parallelism 

    View Slide

  17. Evaluation (1/6)
    ▷ We took different syntactic and real-world knowledge
    bases
    Good number of
    owl:sameAs
    axioms

    View Slide

  18. Evaluation (2/6)
    ▷ Qualitative evaluation: runtime
    ▷ UOBM with owl:sameAs axioms
    ▷ Mondial without owl:sameAs axioms
    SPARQL 253.908 sec
    Spark 15.634 sec
    SPARQL 117.739 sec
    Spark 2.948 sec
    16x faster
    40x faster

    View Slide

  19. Evaluation (3/6)
    ▷ Quantitative evaluation: consistency and completeness
    ▷ Randomly selected 1 class per dataset and 5 predicates
    ▷ A property in the context of a type is complete given a
    cardinality constraint if every entity of type has the ‘right
    number’ of triples (, , ); and incomplete otherwise
    ▷ A predicate in the context of a type is consistent if the
    triples with predicate and subject of type comply with
    the cardinality bounds; and inconsistent otherwise

    View Slide

  20. Evaluation (4/6)
    ▷ For example:
    ▷ Completeness:
    All books should have a property author, but not all
    should have a review property
    ▷ Consistency
    A single book should have between x and y authors

    View Slide

  21. Evaluation (5/6)
    ▷ Synthetic datasets are more consistent than real-world

    View Slide

  22. Evaluation (6/6)
    ▷ Subset of cardinality bounds from the Mondial dataset
    ▷ card({mondial:hasCity}, mondial:Country) = (1, 31)
    Not satisfied by China (306), India (99), USA (250), Brazil (210)
    and Russia (171)

    View Slide

  23. Thanks!
    Any questions?
    Emir Muñoz
    [email protected]
    Key points:
    - KBs lack the description of cardinalities
    - A data normalization is required to
    extract accurate cardinalities
    - An outlier filtering is required to extract
    robust cardinalities
    - Cardinality bounds can help us to assess
    consistency and completeness

    View Slide