Learning Content Patterns From Linked Data

175389e8c3ad885108fc33f8f05ba9bd?s=47 Emir Muñoz
October 20, 2014

Learning Content Patterns From Linked Data

Full paper LD4IE 2014 Workshop @ ISWC

175389e8c3ad885108fc33f8f05ba9bd?s=128

Emir Muñoz

October 20, 2014
Tweet

Transcript

  1. 1.

    Emir Muñoz Fujitsu (Ireland) Limited National University of Ireland Galway

    LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20th, 2014 http://bit.ly/1xYTR6Z (@emir_munoz)
  2. 2.

    2

  3. 4.

    select distinct ?obj where {?sub <http://dbpedia.org/property/isbn> ?obj} Let’s run the

    following SPARQL query over endpoint… And some more ... The endpoint response is a table with the values for the isbn property: So, what is the correct range for ? 4 0 71090 6176526 2 2.7073 140043853 1107020697 2940013968264 0978-02-02+02:00 http://dbpedia.org/resource/N/a "?"@en "ISBN 0-312-85182-0"@en "See text"@en "various"@en "ISBN 978-0-465-02656-2, ISBN 0-14-017997-6"@en "ISBN 0-553-07875-5 & ISBN 0-553-56166-9"@en "The Claiming of Sleeping Beauty: ISBN 0-452-26656-4"@en "-2.0"^^<http://dbpedia.org/datatype/second> "TBA"@en "not available"@en "[[#Bibliography"@en
  4. 5.

     LOV Statistics (by July 7th, 2014):  446 vocabularies

     10 classes and 20 properties in average 5 range of isbn is http://schema.org/Text
  5. 7.

     Etymology  apo- + apsis  Noun  apoapsis

    (plural apoapsides)  (astronomy) The point of a body's elliptical orbit about the system's centre of mass where the distance between the body and the centre of mass is at its maximum.  Property: apoapsis [http://en.wiktionary.org/wiki/apoapsis] Earth Satellite dbr:17049_Miron dbo:apoapsis 4.01288e+11 7
  6. 9.

    <subject, predicate, object> 1488-07-28+02:00 "September 2012"@en "--08-26+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> 1982-05-23+02:00 "August 2012"@en

    "--01-24+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> 2007-04-11+02:00 "July 2009"@en "--06-11+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> Lerman et al. (JAIR 2003) First column: [NUM-NUM-NUM+NUM:NUM] (plain literal) Second column: [ALPHA<space>NUM] (plain literal + lang) Third column: [--NUM-NUM+NUM:NUM] (typed literal) <http://dbpedia.org/property/date> 9
  7. 10.

    Let be the set of content patterns. Lerman et al.

    (JAIR 2003) More specific categories For the input set: That generates the following patterns: Values are decomposed in tokens, and each token is represented by a syntactic class. 10
  8. 11.

     2.4 billion RDF triples  53,230 properties Version 3.9

    Split Method  19.25% plain literals  18.02% typed literals  62.73% without lang or datatype (xsd:string) 11
  9. 12.

     For apoapsis example, we extracted one pattern  And

    we also found some other related properties:  For date example, we extracted 7 patterns http://dbpedia.org/ontology/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/ontology/Planet/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/ontology/Spacecraft/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/property/apoapsis NUMBER 0.9230769230769231 http://dbpedia.org/property/apoapsis LARGE/FLOAT_NUMBER 0.75213675 http://dbpedia.org/property/date -- SMALL_NUMBER - SMALL_NUMBER 0.2 http://dbpedia.org/property/date ALPHANUMERIC MEDIUM_NUMBER 0.166 http://dbpedia.org/property/date ALPHANUMERIC 2012 0.032 http://dbpedia.org/property/date ALPHANUMERIC.ALPHANUMERIC 0.012 And more … 12
  10. 13.

     The user has this value: “2014-10-20”.  What property

    can he use?  dbp:dateCreated, dbp:dateOfProduction, dbp:dateOpened, dbp:dateSigned, dbp:dateOfPremiere, dbp:date, among others.  What is the property dbp:admCtrOf used for?  "town of republic significance of Meleuz"@en (http://dbpedia.org/resource/Meleuz)  "town of oblast significance of Oktyabrsk"@en (http://dbpedia.org/resource/Oktyabrsk)  "town of republic significance of Sortavala"@en (http://dbpedia.org/resource/Sortavala)   it is used to declare Administrative Control Of 13
  11. 14.

     Check for atypical values (outliers)  Close look into

    the most (in)frequent patterns  Possible errors during automatic extraction  For the dbp:isbn property we can find the following values: "summer or autumn 380"@en "Late November"@en "Fall 1040"@en 680 "December, 67 BC"@en "April-July 1799"@en http://dbpedia.org/resource/New_Year's_Day http://dbpedia.org/resource/Second_Interm ediate_Period_of_Egypt "New moon day of Kartika, celebrations begin two days prior and end two days after that date"@en Are they or values? 14
  12. 15.

    E-mail: user1@domain.com Given name: John Surname: Snow Birthday: 1986-02-14 A

    vCard, may be annotated with microformat hCard LD4IE Challenge 2014 vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE 0.82 vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com 0.69 vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE 0.54 vcard:email mailto : ALPHA @ ALPHANUMERIC . com 0.46 vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE 0.36 We can use our database to extract and validate the email: vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5 vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5 …also the birthday 15
  13. 16.

     Extraction of lexico-syntactic patterns from LD datasets  Different

    use cases:  Search for properties  Validation of values  Information extraction based on patterns  Future work:  Study of consistency analysis of knowledge bases  Extension of patterns to cover other knowledge bases Among others 16 500,000 content patterns