Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ScholaRec

 ScholaRec

Recommendation Engine for Scholarly Articles.

Archit Sharma

May 06, 2014
Tweet

More Decks by Archit Sharma

Other Decks in Research

Transcript

  1. Recommender System
    for scholarly articles
    Archit Sharma
    [email protected]
    http://work.arcolife.in

    View Slide

  2. Introduction
     Recommender systems represent user preferences for the purpose of suggesting
    items to purchase or examine.
     Through this project we have tried to address this problem by providing
    recommendation results by using latent information about the user's research
    interests that exists in their publication list.
     The datasets can be used for other purposes such as classification, clustering,
    trend analysis.

    View Slide

  3. What is Scholarec ?
     Scholarec is a Recommender System for Scientific Documents
     It classifies documents and uses personalization features to suggest/recommend
    similar ones.

    View Slide

  4. Features
     Ability to search from a huge collection of Articles, Reports and other scholarly
    works.
     Seamless extension to current online repositories of Scholarly Articles
     Robust Back-end search engine
     Interactive User Interface
     Personalization through OpenID and Oauth integration
     Recommendations based on user's interests.

    View Slide

  5. How ScholaRec works ?

    View Slide

  6. Archive Dump
    Pdf to Text
    Keyword extraction
    User feedback rating and content based
    filtering
    Custom search
    algorithms
    Word Similarity
    Representation of
    recommendation

    View Slide

  7. Flowchart of the Scholarec

    View Slide

  8. Algorithms

    View Slide

  9. Content based filtering
     Recommendation after comparing items vs. user-profiles. Each item's content is a set of
    identifiers.
     Content-based Filtering tries to estimate ratings for the user based on user's history.
    This is the generalization of the aggregation
    functions used for content based filtering.

    View Slide

  10. Other Algorithms used
     Item based algorithm: Serves as the heart of recommendation
     Tf-IDf algorithm: Searching purpose
     Matrix factorization: Table generation / operations on matrix
     Bag of words Approach: Field suggestion
     Reg -ex based algorithm: Parsing through Lucene/ElasticSearch
     Word similarity/ implicit algorithms: Keyword suggestion

    View Slide

  11. Data Representation

    View Slide

  12. Technology Stack

    View Slide

  13. A wide variety of free and open sourced software tools and libraries
     Python programming & scripting language
     Django Web framework
     HTML5,CSS3 & jQuery
     D3.js(for visualizations)
     Twitter Bootstrap (Responsive UI)
     ArXiv API
     ElasticSearch & MongoDB
     GNU/Linux
     LaTeX
     Git

    View Slide

  14. Project Timeline

    View Slide

  15. Deciding
    right
    algorithms
    for task
    Task 4
    Understanding
    recommendation
    algorithms
    Task 3
    Data
    sources
    (Dblp,Arxiv)
    Task 2
    Finding
    application
    area &
    deciding
    academic
    research
    Task 1
    Dec’13 1st week Jan’14
    2nd and 3rd week Jan 2014

    View Slide

  16. Bug testing/
    user feedback
    Task 8
    Implementations
    &
    Web
    development
    Task 7
    Data
    structuring
    ,mining &
    analysis
    Task 6
    Deciding
    on
    technology
    stack
    Task 5
    - 28 April’14
    2 Mar’14 to 19
    April’14
    1 Feb’14 to Mar’14
    4th week Jan’14

    View Slide

  17. Market Research
    Existing products in the market, like Google Scholar, Microsoft Virtual Academy, Arxiv provide
    a way to search among the articles and rate them, but not recommend them.

    View Slide

  18. • dblp.uni-trier.de
    • more than 2.3 million
    articles on computer
    science in October 2013
    • Developer: Alexander
    Weber
    • Alexa Rank: 8,715 (April
    2014
    • Arxiv.org
    • 939,001 e-prints in
    Physics, Mathematics,
    Computer Science,
    Quantitative Biology,
    Quantitative Finance
    and Statistics
    • Creator: Paul Ginsberg
    • Owner: Cornell Library
    • Submission rate is more
    than 7000 per month.
    • scholar.google.com
    • bibliographic database
    • Owner : Google Inc
    • High weight on
    citation counts
    • First search results are
    often highly cited
    articles
    • Google Scholar index
    includes most peer-
    reviewed online journals
    Comparison

    View Slide

  19. Division of Work

    View Slide

  20. Browser Interface
    Creation
    Code Refactoring and
    Docstring
    GitHub page creation
    Entire Documentation
    & Market Research
    Front end
     Module A

    View Slide

  21. Data gathering &
    analysis
    Packaging & Testing
    Data structuring &
    transformation
    Python & shell
    scripting
    Backend
     Module B

    View Slide

  22. Demo

    View Slide

  23. Home Page

    View Slide

  24. Home Page

    View Slide

  25. Results

    View Slide

  26. Packaging & Testing

    View Slide

  27. GitHub Page

    View Slide

  28. View Slide

  29. Thank you!
    questions ?
    http://arcolife.github.io/scholarec

    View Slide