$30 off During Our Annual Pro Sale. View Details »

What can we learn from entity extraction and topic modeling across 350M documents.

William Gunn
August 16, 2013

What can we learn from entity extraction and topic modeling across 350M documents.

My presentation at VIVO 2013 on topic modeling and entity extraction.

William Gunn

August 16, 2013
Tweet

More Decks by William Gunn

Other Decks in Research

Transcript

  1. What can we learn from topic
    modeling on 350M documents?
    William Gunn
    Head of Academic Outreach
    Mendeley
    @mrgunn – https://orcid.org/0000-0002-3555-2054

    View Slide

  2. Who am I?

    PhD Biomedical Science

    I've been active in online science
    communities since 1995

    Established the community program at
    Mendeley – 1700 advisors from 650
    schools in 60 countries.

    Lead the outreach to librarian, academic
    research, and tech communities

    View Slide

  3. Based in London, Mendeley is
    researchers, graduates and software
    developers from...

    View Slide

  4. Two new approaches
     Embed a tool within the researcher
    workflow to capture data
     Capture new kinds of data – usage
    of research objects, not just
    citations of papers.

    View Slide

  5. ...and aggregates
    data in the cloud
    Mendeley
    extracts
    research data…
    Collecting rich signals
    from domain experts.

    View Slide

  6. Rich user profile data

    View Slide

  7. TEAM Project
    academic knowledge management solutions
    • Algorithms to determine the content similarity of academic papers
    • Performing text disambiguation and entity recognition to
    differentiate between and relate similar in-text entities and authors
    of research papers.
    • Developing semantic technologies and semantic web languages with
    the focus of metadata integration/validation
    • Investigate profiling and user analysis technologies, e.g. based on
    search logs and document interaction.
    • We will also improve folksonomies and through that, ontologies of
    text.
    • Finally, tagging behaviour will be analysed to improve tag
    recommendations and strategies.
    • http://team-project.tugraz.at/blog/

    View Slide

  8. Semantics vs. Syntax
    • Language expresses semantics via syntax
    • Syntax is all a computer sees in a research
    article.
    • How do we get to semantics?
    •Topic Modeling!

    View Slide

  9. Distribution of Topics
    0%
    5%
    10%
    15%
    20%
    25%
    30%
    35%
    Bio Phys Engineer Comp
    Sci
    Psych &
    Edu
    Business Law Other

    View Slide

  10. Subcategories of Comp. Sci.
    0%
    5%
    10%
    15%
    20%
    AI HCI Info Sci Software
    Eng
    Networks

    View Slide

  11. View Slide

  12. Generated topics – Comp. Sci.

    View Slide

  13. Generated Topics - Biology

    View Slide

  14. Categorization As A Process
    Thing
    Process
    Reaction
    Catalysis
    Enzymatic

    View Slide

  15. Categorization As A Process
    Thing
    Process
    Reaction
    Catalysis
    Enzymatic

    View Slide

  16. Categorization is imperfect

    View Slide

  17. Cateories change over time

    View Slide

  18. Code Project
    Use case = mining research papers for facts
    to add to LOD repositories and light-weight
    ontologies.
    • Crowd-sourcing enabled semantic enrichment & integration
    techniques for integrating facts contained in unstructured
    information into the LOD cloud
    • Federated, provenance-enabled querying methods for fact
    discovery in LOD repositories
    • Web-based visual analysis interfaces to support human based
    analysis, integration and organisation of facts
    • Socio-economic factors – roles, revenue-models and value
    chains – realisable in the envisioned ecosystem.
    • http://code-research.eu/

    View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. Metrics as a discovery tool

    View Slide

  23. Google Analytics for Research

    View Slide

  24. Building a reproducibility dataset
    • Mendeley and Science Exchange have
    started the Reproducibility Initiative
    • working with Figshare & PLOS to host data
    & replication reports
    • building open datasets backing high-
    impact work
    • extending the “executable paper” concept
    to biomedical research

    View Slide

  25. Make it porous & part of the
    web.

    All these examples show that the main
    motivation for people to get data (pictures,
    bookmarks, etc) off their computers and
    on the web is because it helps them find
    more of the same.

    Communities must be open if they are to
    thrive.

    View Slide

  26. www.mendeley.com
    [email protected]
    @mrgunn

    View Slide