Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data and Bibliometrics: Crowdsourcing the World’s Largest Open Database of Research

Big Data and Bibliometrics: Crowdsourcing the World’s Largest Open Database of Research

This presentation was delivered at O'Reilly's Strata conference in Santa Clara on how Mendeley is using technologies such as Hadoop and Mahout for scaling and recommendation.


William Gunn

March 01, 2012

More Decks by William Gunn

Other Decks in Science


  1. Big Data and Bibliometrics William Gunn Head of Academic Outreach

    william.gunn@mendeley.com @mrgunn Crowdsourcing the World’s Largest Open Database of Research
  2. “The state of knowledge of the human race is sitting

    in the scientists’ computers, and is currently not shared […] We need to get it unlocked so we can tackle those huge problems.” A Big Problem
  3. https://secure.flickr.com/photos/mharvey75/2493468041/ https://secure.flickr.com/photos/bfishadow/4237025430/ $31.2B $????

  4. Journal Impact Factor Number of citations Citeable items = Impact

  5. It's inaccurate Problems with Impact Factor

  6. Problems with Impact Factor

  7. http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0030291 Problems with Impact Factor “Thomson Scientific, the sole arbiter

    of the impact factor game...has no obligation to be accountable to... the authors and readers of scientific research. During discussions with Thomson Scientific over which article types in PLoS Medicine the company deems as “citable,” it became clear that the process of determining a journal's impact factor is unscientific and arbitrary.”
  8. Highly Tweeted articles are 11x more likely to be highly

    cited. (Eysenbach 2011) http://www.jmir.org/2011/4/e123/ The higher the impact factor, the more likely the research is to be retracted, partly due to intense competition. http://bjoern.brembs.net/news766.html.11 What matters is who is reading your work!
  9. https://secure.flickr.com/photos/fireflythegreat/2845637227/

  10. Building the black box Watch research as it happens in

  11. ...and aggregates research data in the cloud Mendeley extracts research

    data… Mendeley makes science more collaborative and transparent: Install Mendeley Desktop Collecting rich signals from domain experts.
  12. 160 million documents uploaded 1.7 million users Cambridge Stanford University

    MIT Imperial College London University of Oxford Harvard University University of Michigan University College London University of California at Berkeley Columbia University The world's largest open database of research
  13. Rich user profile data

  14. Big data problems we've had to solve Metadata extraction –

    We trained a 2-stage SVM to achieve precision at .91 and recall at .94, beating all other approaches. Deduplication – To build the web catalog, we've got to cluster and de-duplicate 17TB+ of documents daily. Author name disambiguation – aka the “big Wang problem”.We tried a variety of approaches settling on a method of hierarchical agglomerative clustering.
  15. Solving our big data problems We needed something that provided

    scalable processing as well as data storage, which made HDFS + MapReduce on AWS a pretty obvious choice. Stats Search good user experience enterprise-class search with easy setup vibrant open source community Scale SSDs for catalog search Caching index in RAM
  16. None
  17. Solving our big data problems Deduplication PDFs are easier than

    pictures or audio because the descriptors are already text strings OCR doesn't work well Hashing works for trivial modifications, but not for discriminating pre-prints vs. post-prints. You don't necessarily know when you don't have a complete record. File hash check(SHA-1) Identifier check(e.g.PubMed id) Document fingerprint(fulltext) Metadata similarity check Update individual article page
  18. None
  19. None
  20. Recommendations “searches you haven't run yet”

  21. None
  22. Google Analytics for research

  23. None
  24. None
  25. Mendeley was 3rd largest UK OpenURL referrer in April 2011,

    beating Medline and Scopus, with 34k click- throughs.
  26. None
  27. None
  28. http://dev.mendeley.com Tim O’Reilly O’Reilly Media James Powell CTO Thomson Reuters

    Juan Enriquez MD Excel Venture Management John Wilbanks VP Science, Creative Commons Werner Vogels CTO Amazon.com Mendeley/PLoS API Binary Battle $16,001 for the best app
  29. None
  30. None
  31. None
  32. None
  33. None
  34. Select relation: supports refutes complements uses same method ... Result:

    A human-curated, constantly evolving semantic article database
  35. None
  36. None