Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Refreerank

 Refreerank

Crowd-sourcing University Rankings with ScikitLearn
and NetworkX

Rebecca Murphy

March 01, 2016
Tweet

More Decks by Rebecca Murphy

Other Decks in Programming

Transcript

  1. Overview • What is the REF? • What did we

    do? • How does it work? • How well does it work?
  2. The REF: A Brief Introduction • Research Excellence Framework •

    Rank University departments based on research quality • Allocate research funding according to quality score less good
  3. Refreerank: Ranking Venues - Ranking Research • Which publication venues

    are chosen for inclusion? • A directed graph from unselected to selected venues reveals: ◦ similar topics ◦ perceived venue quality • Use graph to rank venues • Use graph to rank departments
  4. Refreerank: Strategy • Data: ◦ The REF Submissions : all

    submitted publications ◦ DBLP : all Computer Science publications ◦ Computer Science only • Match REF authors with DBLP authors ◦ fuzzy matching ◦ scikit-learn • Create and traverse directed graph ◦ NetworkX • Evaluate: ◦ compare Refreerank rankings with REF rankings
  5. Refreerank Data: DBLP Entries (1) • Shallow XML structure •

    Use xml.sax library (Nice xml parsing tutorial)
  6. Matching Records: Overview • The Idea ◦ Find REF papers

    in DBLP records ◦ Find other papers by REF authors in DBLP ◦ Build venue graph • The problems ◦ DBLP is very big ◦ Exact Matching? DBLP: SybilInfer: Detecting Sybil Nodes using Social Networks. REF: Sybilinfer: Detecting Sybil nodes using social networks
  7. Matching Records: The Plan • Custom Locality Sensitive Hashing to

    make feature vectors ◦ Similar records have similar vectors ◦ Robust to small discrepancies • Dimensionality reduction: Random Projection ◦ Keep most of variance ◦ Smaller feature space • Build KDtree ◦ fast nearest-neighbour lookup • Match REF papers with hashed DBLP paper titles
  8. Matching Records: Locality Sensitive Hashing SybilInfer: Detecting Sybil Nodes using

    Social Networks. 5-mer substrings: sybil ybili bilin … md5 Hash: 5 15 4 Feature vector: Similar strings: similar feature vectors 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1
  9. Matching Records: Dimensionality Reduction • Random projection (Gaussian) • Reduce

    from 1024 to 7 dimensions • Retain > 90% of variance • Scikit Learn implementation
  10. Matching Records: Nearest Neighbour Search • KDTree: partition the record

    space: ◦ pick axis ◦ partition at median along axis ◦ repeat • Close neighbours in same compartment • Efficient nearest neighbour search: ◦ O (n log n) creation ◦ O (log n) lookup • Scikit Learn implementation Thank you Wikipedia!
  11. Matching Records: Performance • Authors: 93.6% • Papers: 77.8% •

    Parsing ◦ From 1.4M to 11K papers ◦ Time: 12 min (+ 11 seconds)
  12. Building the Ranking Graph: Strategy • Get all relevant papers

    from DBLP by matched REF author • Build directed graph from included to non-included venues • Compute stationary distribution over normalized graph to get venue ranking score • Combine venue rankings for all academics in an institution to produce new ranking ◦ top 4 papers (REF-like) ◦ all relevant papers
  13. From Graph to Ranking: Transition Matrix Homeo pathy J Neg

    Res PLoS PLoS Comp Bio PNAS Nature Homeopathy 0.5 0.25 0.25 0.0 0.0 0.0 J. Neg. Res 0.16 0.08 0.33 0.08 0.17 0.17 PLoS 0.0 0.0 0.40 0.14 0.25 0.21 PLoS Comp Bio 0.0 0.0 0.0 0.0 0.0 0.0 PNAS 0.0 0.0 0.25 0.0 0.25 0.5 Nature 0.0 0.0 0.0 0.0 0.0 0.0
  14. From Graph to Ranking: Stationary Distributions • Better venues: sink

    nodes • Random walk on graph: move from low- to high-ranked venue • Many random walks: stationary distribution approximates venue quality score • Matrix dot product:
  15. Venue Ranking: Top Venues Rank Venue Score 1 CHI 12.56

    2 Media Forensics & Security 11.55 3 POPL 10.97 4 J. Comput. Physics 9.97 5 Theor. Comput. Sci. 9.72
  16. Venue Ranking: Issues • All roads lead to CHI …

    • Large venues have disproportionately high scores • Large research communities have disproportionately high scores
  17. University Ranking: Computing the Ranking • Combine venue scores of

    submitted academics: • Papers considered: ◦ All ◦ Top 4 ◦ Top 12
  18. University Ranking: The Results (Top 10) University College London University

    of Oxford University of Edinburgh University of Nottingham Imperial College London King’s College London University of Southampton University of Glasgow University of Cambridge University of Liverpool REF Refreerank University of Warwick University College London University of Liverpool Imperial College London University of Oxford King’s College London University of Sheffield University of Cambridge University of Manchester Queen Mary University
  19. Conclusions: What’s Good • It works! (Kind of) • Approximately

    reproduce REF ranking • 2 days’ work: cheap and easy • REF selection: crowd-sourced quality score
  20. Conclusions: What’s Not Good • Large venues / research communities

    • Missing records • Some significant differences from REF score • Venue != paper • No DBLP for most disciplines (yet)