$30 off During Our Annual Pro Sale. View Details »

Using Linked Data to Mine RDF from Wikipedia's Tables

Emir Muñoz
February 27, 2014

Using Linked Data to Mine RDF from Wikipedia's Tables

In 7th ACM Web Search and Data Mining Conference (WSDM 2014), New York City, New York, 24-28 February

Emir Muñoz

February 27, 2014
Tweet

More Decks by Emir Muñoz

Other Decks in Research

Transcript

  1. Using Linked Data to Mine
    RDF from Wikipedia’s Tables
    http://emunoz.org/wikitables
    Emir Muñoz
    Fujitsu (Ireland) Limited
    National University of Ireland Galway
    Joint work with A. Hogan and A. Mileo
    WSDM 2014 @ New York City, February 24-28

    View Slide

  2. Emir M. - WSDM, New York City, USA, 27th February, 2014 2
    MOTIVATION
    (1/10)

    View Slide

  3. Emir M. - WSDM, New York City, USA, 27th February, 2014 3
    MOTIVATION
    The tables embedded in Wikipedia articles contain rich,
    semi-structured encyclopaedic content
    … BUT we cannot query all that content…
    A query example:
    (2/10)
    Wikipedia tables or tables in the body are ignored
    [Borrowed from Entity Linking tutorial]

    View Slide

  4. Emir M. - WSDM, New York City, USA, 27th February, 2014 4
    Results at
    25-02-2014

    View Slide

  5. Emir M. - WSDM, New York City, USA, 27th February, 2014 5
    First result

    View Slide

  6. Emir M. - WSDM, New York City, USA, 27th February, 2014 6
    Second result
    10
    Airlines

    View Slide

  7. Emir M. - WSDM, New York City, USA, 27th February, 2014 7
    Third result
    19
    Airlines

    View Slide

  8. • Same query in SPARQL over
    Emir M. - WSDM, New York City, USA, 27th February, 2014 8
    MOTIVATION
    SELECT ?p ?o WHERE
    { ?p ?o . }
    FAIL
    (7/10)

    View Slide

  9. Emir M. - WSDM, New York City, USA, 27th February, 2014 9

    View Slide

  10. Emir M. - WSDM, New York City, USA, 27th February, 2014 10
    No evidence of A380

    View Slide

  11. • We perform automatic facts extraction (RDF)
    from Wikipedia tables using KBs
    MOTIVATION
    Emir M. - WSDM, New York City, USA, 27th February, 2014 11
    Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
    (10/10)

    View Slide

  12. • As far as we know, DBpedia and YAGO
    ignore tables in article’s body
    – Mainly focused on info-boxes
    • Languages such as R2RML can express
    custom mappings from relational database
    tables to RDF
    – Each row as a subject, each column as a
    predicate and each cell as an object
    – Needs a mapping definition
    Emir M. - WSDM, New York City, USA, 27th February, 2014 12
    EXTRACTING RDF FROM TABLES (1/4)

    View Slide

  13. • [Limaye et al. 2010; Mulwad et al. 2010&2013]
    presented approaches using a in-house KB and
    small datasets for validation
    – Entity recognition/disambiguation
    – Determine types for each column
    – Determine relationships between columns
    • We focus on Wikipedia tables, running our
    algorithms over the entire corpus with
    “row-centric” features for Machine
    Learning models
    Emir M. - WSDM, New York City, USA, 27th February, 2014 13
    EXTRACTING RDF FROM TABLES (2/4)

    View Slide

  14. Emir M. - WSDM, New York City, USA, 27th February, 2014 14
    EXTRACTING RDF FROM TABLES
    • Extraction of two types of relationships
    – Between main entity and cell in the same columns,
    e.g., “Manchester United F.C.” and “David de Gea”
    – Between entities in different columns but same row
    (3/4)
    dbp:currentClub
    dbp:position

    View Slide

  15. Emir M. - WSDM, New York City, USA, 27th February, 2014 15
    EXTRACTING RDF FROM TABLES (4/4)

    View Slide

  16. • Wikipedia dump from February 13th 2013
    • Table taxonomy
    Emir M. - WSDM, New York City, USA, 27th February, 2014 16
    WIKITABLES SURVEY (1/2)
    1.14 million tables

    View Slide

  17. • Table model
    – Input: a source of tables (a set of tables)
    • E.g., a Wikipedia article
    • Each table belongs to is modeled as
    an matrix
    • We do normalize the tables and convert
    each HTML table into a matrix
    Emir M. - WSDM, New York City, USA, 27th February, 2014 17
    WIKITABLES SURVEY (2/2)

    View Slide

  18. • To extract RDF from Wikitables we rely on
    a reference knowledge base
    – Version 3.8
    Emir M. - WSDM, New York City, USA, 27th February, 2014 18
    MINING RDF FROM WIKITABLES
    Extract links in the cells
    Mapping links to DBpedia
    Lookups on DBpedia to find
    relationships between entities
    in the same row
    Candidate
    relationships
    Wikipedia
    table
    (1/6)

    View Slide

  19. • We aim to discover:
    – Relations between entities on the same row
    – Relations between entities in the table and the
    protagonist of the article
    • Map the links inside the cells to RDF
    resources
    • Get candidate relationships from the KB
    Emir M. - WSDM, New York City, USA, 27th February, 2014 19
    MINING RDF FROM WIKITABLES
    SELECT DISTINCT ?p1 ?p2
    WHERE { {} ?p1 } UNION { ?p2 } }
    (2/6)

    View Slide

  20. • We detected some weak relationships
    • … We need more filtering for relationships
    Emir M. - WSDM, New York City, USA, 27th February, 2014 20
    MINING RDF FROM WIKITABLES
    dbp:currentClub
    dbp:youthClubs
    (3/6)

    View Slide

  21. • Features at different levels used to train
    Machine Learning models
    • Article features (e.g., # of tables)
    • Table features (e.g., #rows, #columns, ratios)
    • Cell features (e.g., # of entities, string length, has
    format)
    • Column features (e.g., # of entities, # of unique
    entities)
    • Predicate/Column features (e.g., string similarity, # of
    rows where relation holds)
    • Predicate features (e.g., triple count, count unique)
    • Triple features (e.g., is the table from article or body)
    Emir M. - WSDM, New York City, USA, 27th February, 2014 21
    MINING RDF FROM WIKITABLES (4/6)

    View Slide

  22. • The experimentation set-up
    – Wikipedia dump from February 2013
    – DBpedia dump version 3.8
    – 8 machines (ca. 2005) with 4GB of RAM,
    2.2GHz single-core processors
    • After 12 days we got 34.9 million unique
    triples not in DBpedia
    • We manually annotated a sample of 750
    triples to train the ML models
    Emir M. - WSDM, New York City, USA, 27th February, 2014 22
    MINING RDF FROM WIKITABLES (5/6)

    View Slide

  23. Emir M. - WSDM, New York City, USA, 27th February, 2014 23
    MINING RDF FROM WIKITABLES (6/6)
    Bagging DT Simple Logistic SVM
    accuracy 78.1% 78.53% 72.6%
    precision 81.5% 79.62% 72.4%
    recall 77.4% 79.01% 75.8%

    View Slide

  24. • In this work we aimed to
    – Interpret the semantic of tables using KB’s
    – Enrich KB’s with new facts mined from tables
    • With the best model we got 7.9 million
    unique novel triples
    • We still don’t
    – consider literals/string values in the cells
    – Explode domain/range of predicates
    – Test other KBs like Freebase and YAGO
    Emir M. - WSDM, New York City, USA, 27th February, 2014 24
    CONCLUSION

    View Slide

  25. • Most of the related papers use some
    knowledge base, such as DBpedia
    – They can be benefited by new RDF triples
    extracted from Wikipedia tables
    • We can use the similarity proposed in
    Knowledge-based graph document modeling, by
    Schuhmacher and Ponzetto, to improve the
    relation extraction
    • And use the paper Trust, but Verify: Predicting
    Contribution Quality for Knowledge Base Construction
    and Curation, Chun How et al, to determine the
    correctness of the quality of the output triples
    Emir M. - WSDM, New York City, USA, 27th February, 2014
    CONTRAST WITH OTHER PAPERS
    25

    View Slide

  26. Thank you!
    Emir Muñoz
    SVM our third best model 
    http://emunoz.org/wikitables

    View Slide