Refreerank

Refreerank Crowd-sourcing University Rankings with Scikit- Learn and NetworkX @rebecca_roisin
@gdanezis 21st PyData London Meetup Tuesday 1st March 2016

Data Science at Ocado Technology • Routing • Order prediction
• Customer segmentation • Robots …

Overview • What is the REF? • What did we
do? • How does it work? • How well does it work?

The Research Excellence Framework

The REF: A Brief Introduction • Research Excellence Framework •
Rank University departments based on research quality • Allocate research funding according to quality score less good

The REF: How Departments Were Ranked (1)

The REF: How Departments Were Ranked (2)

The REF: Problems • Slow • Time-consuming • Expensive •
Subjective

Refreerank: Ranking Without Review

Refreerank: The Insight

Refreerank: Ranking Venues - Ranking Research • Which publication venues
are chosen for inclusion? • A directed graph from unselected to selected venues reveals: ◦ similar topics ◦ perceived venue quality • Use graph to rank venues • Use graph to rank departments

Refreerank: Strategy

Refreerank: Strategy • Data: ◦ The REF Submissions : all
submitted publications ◦ DBLP : all Computer Science publications ◦ Computer Science only • Match REF authors with DBLP authors ◦ fuzzy matching ◦ scikit-learn • Create and traverse directed graph ◦ NetworkX • Evaluate: ◦ compare Refreerank rankings with REF rankings

Refreerank: Data

Refreerank Data: REF Submissions • Simple CSV files • Simple
CSV parser

Refreerank Data: DBLP Entries (1) • Shallow XML structure •
Use xml.sax library (Nice xml parsing tutorial)

Refreerank Data: DBLP Entries (2) Store relevant info Only care
about relevant years Valid record

Matching Records

Matching Records: Overview • The Idea ◦ Find REF papers
in DBLP records ◦ Find other papers by REF authors in DBLP ◦ Build venue graph • The problems ◦ DBLP is very big ◦ Exact Matching? DBLP: SybilInfer: Detecting Sybil Nodes using Social Networks. REF: Sybilinfer: Detecting Sybil nodes using social networks

Matching Records: The Plan • Custom Locality Sensitive Hashing to
make feature vectors ◦ Similar records have similar vectors ◦ Robust to small discrepancies • Dimensionality reduction: Random Projection ◦ Keep most of variance ◦ Smaller feature space • Build KDtree ◦ fast nearest-neighbour lookup • Match REF papers with hashed DBLP paper titles

Matching Records: Locality Sensitive Hashing SybilInfer: Detecting Sybil Nodes using
Social Networks. 5-mer substrings: sybil ybili bilin … md5 Hash: 5 15 4 Feature vector: Similar strings: similar feature vectors 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1

Matching Records: Locality Sensitive Hashing Simple cleanup prepare feature vector

Matching Records: Dimensionality Reduction • Random projection (Gaussian) • Reduce
from 1024 to 7 dimensions • Retain > 90% of variance • Scikit Learn implementation

Matching Records: Nearest Neighbour Search • KDTree: partition the record
space: ◦ pick axis ◦ partition at median along axis ◦ repeat • Close neighbours in same compartment • Efficient nearest neighbour search: ◦ O (n log n) creation ◦ O (log n) lookup • Scikit Learn implementation Thank you Wikipedia!

Matching Records: The Pipeline

Matching Records: Performance • Authors: 93.6% • Papers: 77.8% •
Parsing ◦ From 1.4M to 11K papers ◦ Time: 12 min (+ 11 seconds)

Building the Ranking Graph

Building the Ranking Graph: Strategy • Get all relevant papers
from DBLP by matched REF author • Build directed graph from included to non-included venues • Compute stationary distribution over normalized graph to get venue ranking score • Combine venue rankings for all academics in an institution to produce new ranking ◦ top 4 papers (REF-like) ◦ all relevant papers

Building the Ranking Graph: NetworkX

Insights from the Ranking Graph

Insights: Identifying Research Communities (1)

Insights: Identifying Research Communities (2)

Insights: Parallel Research Communities?

From Graph to Ranking

From Graph to Ranking: Transition Matrix Homeo pathy J Neg
Res PLoS PLoS Comp Bio PNAS Nature Homeopathy 0.5 0.25 0.25 0.0 0.0 0.0 J. Neg. Res 0.16 0.08 0.33 0.08 0.17 0.17 PLoS 0.0 0.0 0.40 0.14 0.25 0.21 PLoS Comp Bio 0.0 0.0 0.0 0.0 0.0 0.0 PNAS 0.0 0.0 0.25 0.0 0.25 0.5 Nature 0.0 0.0 0.0 0.0 0.0 0.0

From Graph to Ranking: Stationary Distributions • Better venues: sink
nodes • Random walk on graph: move from low- to high-ranked venue • Many random walks: stationary distribution approximates venue quality score • Matrix dot product:

The Ranking: Venues

Venue Ranking: Top Venues Rank Venue Score 1 CHI 12.56
2 Media Forensics & Security 11.55 3 POPL 10.97 4 J. Comput. Physics 9.97 5 Theor. Comput. Sci. 9.72

Venue Ranking: Issues • All roads lead to CHI …
• Large venues have disproportionately high scores • Large research communities have disproportionately high scores

Venue Ranking: More Issues

Moving swiftly on …

The Ranking: Universities

University Ranking: The Results

University Ranking: Computing the Ranking • Combine venue scores of
submitted academics: • Papers considered: ◦ All ◦ Top 4 ◦ Top 12

University Ranking: The Results (All Departments)

University Ranking: The Results (Top 10) University College London University
of Oxford University of Edinburgh University of Nottingham Imperial College London King’s College London University of Southampton University of Glasgow University of Cambridge University of Liverpool REF Refreerank University of Warwick University College London University of Liverpool Imperial College London University of Oxford King’s College London University of Sheffield University of Cambridge University of Manchester Queen Mary University

Conclusions

Conclusions: What’s Good • It works! (Kind of) • Approximately
reproduce REF ranking • 2 days’ work: cheap and easy • REF selection: crowd-sourced quality score

Conclusions: What’s Not Good • Large venues / research communities
• Missing records • Some significant differences from REF score • Venue != paper • No DBLP for most disciplines (yet)

Thank You!

Questions?

Refreerank

Refreerank

More Decks by Rebecca Murphy

Other Decks in Programming

Featured

Transcript