What can we learn from entity extraction and topic modeling across 350M documents.

What can we learn from topic modeling on 350M documents?
William Gunn Head of Academic Outreach Mendeley @mrgunn – https://orcid.org/0000-0002-3555-2054

Who am I?  PhD Biomedical Science  I've been
active in online science communities since 1995  Established the community program at Mendeley – 1700 advisors from 650 schools in 60 countries.  Lead the outreach to librarian, academic research, and tech communities

Based in London, Mendeley is researchers, graduates and software developers
from...

Two new approaches  Embed a tool within the researcher
workflow to capture data  Capture new kinds of data – usage of research objects, not just citations of papers.

...and aggregates data in the cloud Mendeley extracts research data…
Collecting rich signals from domain experts.

Rich user profile data

TEAM Project academic knowledge management solutions • Algorithms to determine
the content similarity of academic papers • Performing text disambiguation and entity recognition to differentiate between and relate similar in-text entities and authors of research papers. • Developing semantic technologies and semantic web languages with the focus of metadata integration/validation • Investigate profiling and user analysis technologies, e.g. based on search logs and document interaction. • We will also improve folksonomies and through that, ontologies of text. • Finally, tagging behaviour will be analysed to improve tag recommendations and strategies. • http://team-project.tugraz.at/blog/

Semantics vs. Syntax • Language expresses semantics via syntax •
Syntax is all a computer sees in a research article. • How do we get to semantics? •Topic Modeling!

Distribution of Topics 0% 5% 10% 15% 20% 25% 30%
35% Bio Phys Engineer Comp Sci Psych & Edu Business Law Other

Subcategories of Comp. Sci. 0% 5% 10% 15% 20% AI
HCI Info Sci Software Eng Networks

Generated topics – Comp. Sci.

Generated Topics - Biology

Categorization As A Process Thing Process Reaction Catalysis Enzymatic

Categorization is imperfect

Cateories change over time

Code Project Use case = mining research papers for facts
to add to LOD repositories and light-weight ontologies. • Crowd-sourcing enabled semantic enrichment & integration techniques for integrating facts contained in unstructured information into the LOD cloud • Federated, provenance-enabled querying methods for fact discovery in LOD repositories • Web-based visual analysis interfaces to support human based analysis, integration and organisation of facts • Socio-economic factors – roles, revenue-models and value chains – realisable in the envisioned ecosystem. • http://code-research.eu/

Metrics as a discovery tool

Google Analytics for Research

Building a reproducibility dataset • Mendeley and Science Exchange have
started the Reproducibility Initiative • working with Figshare & PLOS to host data & replication reports • building open datasets backing high- impact work • extending the “executable paper” concept to biomedical research

Make it porous & part of the web.  All
these examples show that the main motivation for people to get data (pictures, bookmarks, etc) off their computers and on the web is because it helps them find more of the same.  Communities must be open if they are to thrive.

www.mendeley.com [email protected] @mrgunn

What can we learn from entity extraction and to...

What can we learn from entity extraction and topic modeling across 350M documents.

William Gunn

More Decks by William Gunn

Other Decks in Research

Featured

Transcript

What can we learn from topic modeling on 350M documents?

Who am I?  PhD Biomedical Science  I've been

Based in London, Mendeley is researchers, graduates and software developers

Two new approaches  Embed a tool within the researcher

...and aggregates data in the cloud Mendeley extracts research data…

Rich user profile data

TEAM Project academic knowledge management solutions • Algorithms to determine

Semantics vs. Syntax • Language expresses semantics via syntax •

Distribution of Topics 0% 5% 10% 15% 20% 25% 30%

Subcategories of Comp. Sci. 0% 5% 10% 15% 20% AI

Generated topics – Comp. Sci.

Generated Topics - Biology

Categorization As A Process Thing Process Reaction Catalysis Enzymatic

Categorization As A Process Thing Process Reaction Catalysis Enzymatic

Categorization is imperfect

Cateories change over time

Code Project Use case = mining research papers for facts

Metrics as a discovery tool

Google Analytics for Research

Building a reproducibility dataset • Mendeley and Science Exchange have

Make it porous & part of the web.  All

www.mendeley.com [email protected] @mrgunn