Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning Content Patterns From Linked Data

Avatar for Emir Muñoz Emir Muñoz
October 20, 2014

Learning Content Patterns From Linked Data

Full paper LD4IE 2014 Workshop @ ISWC

Avatar for Emir Muñoz

Emir Muñoz

October 20, 2014
Tweet

More Decks by Emir Muñoz

Other Decks in Research

Transcript

  1. Emir Muñoz Fujitsu (Ireland) Limited National University of Ireland Galway

    LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20th, 2014 http://bit.ly/1xYTR6Z (@emir_munoz)
  2. 2

  3. select distinct ?obj where {?sub <http://dbpedia.org/property/isbn> ?obj} Let’s run the

    following SPARQL query over endpoint… And some more ... The endpoint response is a table with the values for the isbn property: So, what is the correct range for ? 4 0 71090 6176526 2 2.7073 140043853 1107020697 2940013968264 0978-02-02+02:00 http://dbpedia.org/resource/N/a "?"@en "ISBN 0-312-85182-0"@en "See text"@en "various"@en "ISBN 978-0-465-02656-2, ISBN 0-14-017997-6"@en "ISBN 0-553-07875-5 & ISBN 0-553-56166-9"@en "The Claiming of Sleeping Beauty: ISBN 0-452-26656-4"@en "-2.0"^^<http://dbpedia.org/datatype/second> "TBA"@en "not available"@en "[[#Bibliography"@en
  4.  LOV Statistics (by July 7th, 2014):  446 vocabularies

     10 classes and 20 properties in average 5 range of isbn is http://schema.org/Text
  5.  Etymology  apo- + apsis  Noun  apoapsis

    (plural apoapsides)  (astronomy) The point of a body's elliptical orbit about the system's centre of mass where the distance between the body and the centre of mass is at its maximum.  Property: apoapsis [http://en.wiktionary.org/wiki/apoapsis] Earth Satellite dbr:17049_Miron dbo:apoapsis 4.01288e+11 7
  6. <subject, predicate, object> 1488-07-28+02:00 "September 2012"@en "--08-26+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> 1982-05-23+02:00 "August 2012"@en

    "--01-24+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> 2007-04-11+02:00 "July 2009"@en "--06-11+02:00"^^<http://www.w3.org/2001/XMLSchema#gMonthDay> Lerman et al. (JAIR 2003) First column: [NUM-NUM-NUM+NUM:NUM] (plain literal) Second column: [ALPHA<space>NUM] (plain literal + lang) Third column: [--NUM-NUM+NUM:NUM] (typed literal) <http://dbpedia.org/property/date> 9
  7. Let be the set of content patterns. Lerman et al.

    (JAIR 2003) More specific categories For the input set: That generates the following patterns: Values are decomposed in tokens, and each token is represented by a syntactic class. 10
  8.  2.4 billion RDF triples  53,230 properties Version 3.9

    Split Method  19.25% plain literals  18.02% typed literals  62.73% without lang or datatype (xsd:string) 11
  9.  For apoapsis example, we extracted one pattern  And

    we also found some other related properties:  For date example, we extracted 7 patterns http://dbpedia.org/ontology/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/ontology/Planet/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/ontology/Spacecraft/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/property/apoapsis NUMBER 0.9230769230769231 http://dbpedia.org/property/apoapsis LARGE/FLOAT_NUMBER 0.75213675 http://dbpedia.org/property/date -- SMALL_NUMBER - SMALL_NUMBER 0.2 http://dbpedia.org/property/date ALPHANUMERIC MEDIUM_NUMBER 0.166 http://dbpedia.org/property/date ALPHANUMERIC 2012 0.032 http://dbpedia.org/property/date ALPHANUMERIC.ALPHANUMERIC 0.012 And more … 12
  10.  The user has this value: “2014-10-20”.  What property

    can he use?  dbp:dateCreated, dbp:dateOfProduction, dbp:dateOpened, dbp:dateSigned, dbp:dateOfPremiere, dbp:date, among others.  What is the property dbp:admCtrOf used for?  "town of republic significance of Meleuz"@en (http://dbpedia.org/resource/Meleuz)  "town of oblast significance of Oktyabrsk"@en (http://dbpedia.org/resource/Oktyabrsk)  "town of republic significance of Sortavala"@en (http://dbpedia.org/resource/Sortavala)   it is used to declare Administrative Control Of 13
  11.  Check for atypical values (outliers)  Close look into

    the most (in)frequent patterns  Possible errors during automatic extraction  For the dbp:isbn property we can find the following values: "summer or autumn 380"@en "Late November"@en "Fall 1040"@en 680 "December, 67 BC"@en "April-July 1799"@en http://dbpedia.org/resource/New_Year's_Day http://dbpedia.org/resource/Second_Interm ediate_Period_of_Egypt "New moon day of Kartika, celebrations begin two days prior and end two days after that date"@en Are they or values? 14
  12. E-mail: user1@domain.com Given name: John Surname: Snow Birthday: 1986-02-14 A

    vCard, may be annotated with microformat hCard LD4IE Challenge 2014 vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE 0.82 vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com 0.69 vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE 0.54 vcard:email mailto : ALPHA @ ALPHANUMERIC . com 0.46 vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE 0.36 We can use our database to extract and validate the email: vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5 vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5 …also the birthday 15
  13.  Extraction of lexico-syntactic patterns from LD datasets  Different

    use cases:  Search for properties  Validation of values  Information extraction based on patterns  Future work:  Study of consistency analysis of knowledge bases  Extension of patterns to cover other knowledge bases Among others 16 500,000 content patterns