My presentation at VIVO 2013 on topic modeling and entity extraction.
What can we learn from topic
modeling on 350M documents?
Head of Academic Outreach
@mrgunn – https://orcid.org/0000-0002-3555-2054
Who am I?
PhD Biomedical Science
I've been active in online science
communities since 1995
Established the community program at
Mendeley – 1700 advisors from 650
schools in 60 countries.
Lead the outreach to librarian, academic
research, and tech communities
Based in London, Mendeley is
researchers, graduates and software
Two new approaches
Embed a tool within the researcher
workflow to capture data
Capture new kinds of data – usage
of research objects, not just
citations of papers.
data in the cloud
Collecting rich signals
from domain experts.
Rich user profile data
academic knowledge management solutions
• Algorithms to determine the content similarity of academic papers
• Performing text disambiguation and entity recognition to
differentiate between and relate similar in-text entities and authors
of research papers.
• Developing semantic technologies and semantic web languages with
the focus of metadata integration/validation
• Investigate profiling and user analysis technologies, e.g. based on
search logs and document interaction.
• We will also improve folksonomies and through that, ontologies of
• Finally, tagging behaviour will be analysed to improve tag
recommendations and strategies.
Semantics vs. Syntax
• Language expresses semantics via syntax
• Syntax is all a computer sees in a research
• How do we get to semantics?
Distribution of Topics
Bio Phys Engineer Comp
Business Law Other
Subcategories of Comp. Sci.
AI HCI Info Sci Software
Generated topics – Comp. Sci.
Generated Topics - Biology
Categorization As A Process
Categorization As A Process
Categorization is imperfect
Cateories change over time
Use case = mining research papers for facts
to add to LOD repositories and light-weight
• Crowd-sourcing enabled semantic enrichment & integration
techniques for integrating facts contained in unstructured
information into the LOD cloud
• Federated, provenance-enabled querying methods for fact
discovery in LOD repositories
• Web-based visual analysis interfaces to support human based
analysis, integration and organisation of facts
• Socio-economic factors – roles, revenue-models and value
chains – realisable in the envisioned ecosystem.
Metrics as a discovery tool
Google Analytics for Research
Building a reproducibility dataset
• Mendeley and Science Exchange have
started the Reproducibility Initiative
• working with Figshare & PLOS to host data
& replication reports
• building open datasets backing high-
• extending the “executable paper” concept
to biomedical research
Make it porous & part of the
All these examples show that the main
motivation for people to get data (pictures,
bookmarks, etc) off their computers and
on the web is because it helps them find
more of the same.
Communities must be open if they are to