$30 off During Our Annual Pro Sale. View Details »

Learning Content Patterns From Linked Data

Emir Muñoz
October 20, 2014

Learning Content Patterns From Linked Data

Full paper LD4IE 2014 Workshop @ ISWC

Emir Muñoz

October 20, 2014
Tweet

More Decks by Emir Muñoz

Other Decks in Research

Transcript

  1. Emir Muñoz
    Fujitsu (Ireland) Limited
    National University of Ireland Galway
    LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20th, 2014
    http://bit.ly/1xYTR6Z
    (@emir_munoz)

    View Slide

  2. 2

    View Slide


  3. Domain(predicate)  ??
    Range(predicate)  ??
    3

    View Slide

  4. select distinct ?obj where
    {?sub ?obj}
    Let’s run the following SPARQL query over endpoint…
    And some more ...
    The endpoint response is a table with the values for the isbn property:
    So, what is the correct range for ?
    4
    0
    71090
    6176526
    2
    2.7073
    140043853
    1107020697
    2940013968264
    0978-02-02+02:00
    http://dbpedia.org/resource/N/a
    "?"@en
    "ISBN 0-312-85182-0"@en
    "See text"@en
    "various"@en
    "ISBN 978-0-465-02656-2, ISBN 0-14-017997-6"@en
    "ISBN 0-553-07875-5 & ISBN 0-553-56166-9"@en
    "The Claiming of Sleeping Beauty: ISBN 0-452-26656-4"@en
    "-2.0"^^
    "TBA"@en
    "not available"@en
    "[[#Bibliography"@en

    View Slide

  5.  LOV Statistics (by July 7th, 2014):
     446 vocabularies
     10 classes and 20 properties in average
    5
    range of isbn is
    http://schema.org/Text

    View Slide

  6. …but still, is it what I’m looking for? what is the syntax? 6

    View Slide

  7.  Etymology
     apo- + apsis
     Noun
     apoapsis (plural apoapsides)
     (astronomy) The point of a body's elliptical orbit about
    the system's centre of mass where the distance between the body and
    the centre of mass is at its maximum.
     Property: apoapsis
    [http://en.wiktionary.org/wiki/apoapsis]
    Earth
    Satellite
    dbr:17049_Miron dbo:apoapsis 4.01288e+11
    7

    View Slide

  8. 8
    https://github.com/dbpedia/extraction-framework/blob/master/
    core/src/main/scala/org/dbpedia/extraction/ontology/OntologyDatatypes.scala

    View Slide


  9. 1488-07-28+02:00 "September 2012"@en "--08-26+02:00"^^
    1982-05-23+02:00 "August 2012"@en "--01-24+02:00"^^
    2007-04-11+02:00 "July 2009"@en "--06-11+02:00"^^
    Lerman et al. (JAIR 2003)
    First column: [NUM-NUM-NUM+NUM:NUM] (plain literal)
    Second column: [ALPHANUM] (plain literal + lang)
    Third column: [--NUM-NUM+NUM:NUM] (typed literal)

    9

    View Slide

  10. Let be the set of
    content patterns.
    Lerman et al. (JAIR 2003)
    More specific categories
    For the input set:
    That generates the following patterns:
    Values are decomposed in tokens, and
    each token is represented by a syntactic
    class.
    10

    View Slide

  11.  2.4 billion RDF triples
     53,230 properties
    Version 3.9
    Split Method
     19.25% plain literals
     18.02% typed literals
     62.73% without lang or datatype (xsd:string)
    11

    View Slide

  12.  For apoapsis example, we extracted one pattern
     And we also found some other related properties:
     For date example, we extracted 7 patterns
    http://dbpedia.org/ontology/apoapsis LARGE/FLOAT_NUMBER 1.0
    http://dbpedia.org/ontology/Planet/apoapsis LARGE/FLOAT_NUMBER 1.0
    http://dbpedia.org/ontology/Spacecraft/apoapsis LARGE/FLOAT_NUMBER 1.0
    http://dbpedia.org/property/apoapsis NUMBER 0.9230769230769231
    http://dbpedia.org/property/apoapsis LARGE/FLOAT_NUMBER 0.75213675
    http://dbpedia.org/property/date -- SMALL_NUMBER - SMALL_NUMBER 0.2
    http://dbpedia.org/property/date ALPHANUMERIC MEDIUM_NUMBER 0.166
    http://dbpedia.org/property/date ALPHANUMERIC 2012 0.032
    http://dbpedia.org/property/date ALPHANUMERIC.ALPHANUMERIC 0.012
    And more …
    12

    View Slide

  13.  The user has this value: “2014-10-20”.
     What property can he use?
     dbp:dateCreated, dbp:dateOfProduction, dbp:dateOpened,
    dbp:dateSigned, dbp:dateOfPremiere, dbp:date, among
    others.
     What is the property dbp:admCtrOf used for?
     "town of republic significance of Meleuz"@en
    (http://dbpedia.org/resource/Meleuz)
     "town of oblast significance of Oktyabrsk"@en
    (http://dbpedia.org/resource/Oktyabrsk)
     "town of republic significance of Sortavala"@en
    (http://dbpedia.org/resource/Sortavala)
      it is used to declare Administrative Control Of
    13

    View Slide

  14.  Check for atypical values (outliers)
     Close look into the most (in)frequent patterns
     Possible errors during automatic extraction
     For the dbp:isbn property we can find the following
    values:
    "summer or autumn 380"@en "Late November"@en
    "Fall 1040"@en 680
    "December, 67 BC"@en "April-July 1799"@en
    http://dbpedia.org/resource/New_Year's_Day http://dbpedia.org/resource/Second_Interm
    ediate_Period_of_Egypt
    "New moon day of Kartika, celebrations begin two
    days prior and end two days after that date"@en
    Are they or values?
    14

    View Slide

  15. E-mail: [email protected]
    Given name: John
    Surname: Snow
    Birthday: 1986-02-14
    A vCard, may be annotated
    with microformat hCard
    LD4IE Challenge
    2014
    vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE 0.82
    vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com 0.69
    vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE 0.54
    vcard:email mailto : ALPHA @ ALPHANUMERIC . com 0.46
    vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE 0.36
    We can use our database to extract and validate the email:
    vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5
    vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5
    …also the birthday
    15

    View Slide

  16.  Extraction of lexico-syntactic patterns from LD datasets
     Different use cases:
     Search for properties
     Validation of values
     Information extraction based on patterns
     Future work:
     Study of consistency analysis of knowledge bases
     Extension of patterns to cover other knowledge bases
    Among others
    16
    500,000 content patterns

    View Slide

  17. http://emunoz.org
    @emir_munoz
    [email protected]
    https://github.com/emir-munoz/ld-patterns/

    View Slide