Slide 1

Slide 1 text

Emir Muñoz Fujitsu (Ireland) Limited National University of Ireland Galway LD4IE 2014 @ ISWC, Riva del Garda, Trentino, Italy. Oct 20th, 2014 http://bit.ly/1xYTR6Z (@emir_munoz)

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

Domain(predicate)  ?? Range(predicate)  ?? 3

Slide 4

Slide 4 text

select distinct ?obj where {?sub ?obj} Let’s run the following SPARQL query over endpoint… And some more ... The endpoint response is a table with the values for the isbn property: So, what is the correct range for ? 4 0 71090 6176526 2 2.7073 140043853 1107020697 2940013968264 0978-02-02+02:00 http://dbpedia.org/resource/N/a "?"@en "ISBN 0-312-85182-0"@en "See text"@en "various"@en "ISBN 978-0-465-02656-2, ISBN 0-14-017997-6"@en "ISBN 0-553-07875-5 & ISBN 0-553-56166-9"@en "The Claiming of Sleeping Beauty: ISBN 0-452-26656-4"@en "-2.0"^^ "TBA"@en "not available"@en "[[#Bibliography"@en

Slide 5

Slide 5 text

 LOV Statistics (by July 7th, 2014):  446 vocabularies  10 classes and 20 properties in average 5 range of isbn is http://schema.org/Text

Slide 6

Slide 6 text

…but still, is it what I’m looking for? what is the syntax? 6

Slide 7

Slide 7 text

 Etymology  apo- + apsis  Noun  apoapsis (plural apoapsides)  (astronomy) The point of a body's elliptical orbit about the system's centre of mass where the distance between the body and the centre of mass is at its maximum.  Property: apoapsis [http://en.wiktionary.org/wiki/apoapsis] Earth Satellite dbr:17049_Miron dbo:apoapsis 4.01288e+11 7

Slide 8

Slide 8 text

8 https://github.com/dbpedia/extraction-framework/blob/master/ core/src/main/scala/org/dbpedia/extraction/ontology/OntologyDatatypes.scala

Slide 9

Slide 9 text

1488-07-28+02:00 "September 2012"@en "--08-26+02:00"^^ 1982-05-23+02:00 "August 2012"@en "--01-24+02:00"^^ 2007-04-11+02:00 "July 2009"@en "--06-11+02:00"^^ Lerman et al. (JAIR 2003) First column: [NUM-NUM-NUM+NUM:NUM] (plain literal) Second column: [ALPHANUM] (plain literal + lang) Third column: [--NUM-NUM+NUM:NUM] (typed literal) 9

Slide 10

Slide 10 text

Let be the set of content patterns. Lerman et al. (JAIR 2003) More specific categories For the input set: That generates the following patterns: Values are decomposed in tokens, and each token is represented by a syntactic class. 10

Slide 11

Slide 11 text

 2.4 billion RDF triples  53,230 properties Version 3.9 Split Method  19.25% plain literals  18.02% typed literals  62.73% without lang or datatype (xsd:string) 11

Slide 12

Slide 12 text

 For apoapsis example, we extracted one pattern  And we also found some other related properties:  For date example, we extracted 7 patterns http://dbpedia.org/ontology/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/ontology/Planet/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/ontology/Spacecraft/apoapsis LARGE/FLOAT_NUMBER 1.0 http://dbpedia.org/property/apoapsis NUMBER 0.9230769230769231 http://dbpedia.org/property/apoapsis LARGE/FLOAT_NUMBER 0.75213675 http://dbpedia.org/property/date -- SMALL_NUMBER - SMALL_NUMBER 0.2 http://dbpedia.org/property/date ALPHANUMERIC MEDIUM_NUMBER 0.166 http://dbpedia.org/property/date ALPHANUMERIC 2012 0.032 http://dbpedia.org/property/date ALPHANUMERIC.ALPHANUMERIC 0.012 And more … 12

Slide 13

Slide 13 text

 The user has this value: “2014-10-20”.  What property can he use?  dbp:dateCreated, dbp:dateOfProduction, dbp:dateOpened, dbp:dateSigned, dbp:dateOfPremiere, dbp:date, among others.  What is the property dbp:admCtrOf used for?  "town of republic significance of Meleuz"@en (http://dbpedia.org/resource/Meleuz)  "town of oblast significance of Oktyabrsk"@en (http://dbpedia.org/resource/Oktyabrsk)  "town of republic significance of Sortavala"@en (http://dbpedia.org/resource/Sortavala)   it is used to declare Administrative Control Of 13

Slide 14

Slide 14 text

 Check for atypical values (outliers)  Close look into the most (in)frequent patterns  Possible errors during automatic extraction  For the dbp:isbn property we can find the following values: "summer or autumn 380"@en "Late November"@en "Fall 1040"@en 680 "December, 67 BC"@en "April-July 1799"@en http://dbpedia.org/resource/New_Year's_Day http://dbpedia.org/resource/Second_Interm ediate_Period_of_Egypt "New moon day of Kartika, celebrations begin two days prior and end two days after that date"@en Are they or values? 14

Slide 15

Slide 15 text

E-mail: [email protected] Given name: John Surname: Snow Birthday: 1986-02-14 A vCard, may be annotated with microformat hCard LD4IE Challenge 2014 vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE 0.82 vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com 0.69 vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE 0.54 vcard:email mailto : ALPHA @ ALPHANUMERIC . com 0.46 vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE 0.36 We can use our database to extract and validate the email: vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5 vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER 0.5 …also the birthday 15

Slide 16

Slide 16 text

 Extraction of lexico-syntactic patterns from LD datasets  Different use cases:  Search for properties  Validation of values  Information extraction based on patterns  Future work:  Study of consistency analysis of knowledge bases  Extension of patterns to cover other knowledge bases Among others 16 500,000 content patterns

Slide 17

Slide 17 text

http://emunoz.org @emir_munoz [email protected] https://github.com/emir-munoz/ld-patterns/