data DBpedia provides RDF version of all wikipedia structured data (infoboxes) But not yet a version of all normal Wikipedia tables or wikitables October 12, 2012 -- E. Muñoz
row Column header represents types of information The values represent instances of that types http://en.wikipedia.org/wiki/Galway Infoboxes (attr-value) October 12, 2012 -- E. Muñoz Tables are inherently concise as well as information rich
semantics by themselves. • Main issues: – Complex tables with spans – Captions inside the table as another row – Not well-formed tables (i.e., not a matrix) – We need filters (e.g., min 2 columns, 2 rows) • We are extracting relations at row level and between the main entity and the table resources October 12, 2012 -- E. Muñoz
the entities in the table dbpedia.org/resource/AFC_Ajax 14 dbpedia.org/ontology/team 14 dbpedia.org/property/clubs 11 dbpedia.org/property/currentclub 3 dbpedia.org/property/youthclubs In his dbpedia page there is no mention to AFC Ajax http://en.wikipedia.org/wiki/AFC_Ajax 16 players October 12, 2012 -- E. Muñoz
GB of Wikipedia pages that comprise – 10,531,986 documents (HTML pages) – Only 413,256 HTML contains tables – 2,989,098 tables – 905,929 tables after the filter • 27.7% of the whole tables – 0.46 tables per page (or 2.15 discarding pages without tables) October 12, 2012 -- E. Muñoz
relations. • Store the 5.5M DBpedia (transitive) redirects locally (optimizing time). • Statistical analysis of Wikipedia tables – Number of columns, rows – Headers, Captions – External and internal links • The big following challenge is the evaluation. October 12, 2012 -- E. Muñoz
Improve the ranking function • Handle redirects before querying DBpedia • How to evaluate the outcome October 12, 2012 -- E. Muñoz Thanks! Q & A Thanks! Emir Muñoz Unit for Reasoning and Querying [email protected]