Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Recommender System for Medical Research Papers

Jill Cates
October 02, 2019

Building a Recommender System for Medical Research Papers

These slides walk through the steps involved in building a recommender system for QxMD's Read app.

Jill Cates

October 02, 2019
Tweet

More Decks by Jill Cates

Other Decks in Technology

Transcript

  1. Problem • QxMD’s Read app lets healthcare professionals stay up

    to date with medical research • Read has a feature where users can curate “collections” of papers to be shared with the community • Collections are currently under-utilized
  2. User Features Datatype Institution name str Specialty str Profession str

    Journal access bool Registration date datetime Last login datetime Paper Features Datatype Journal str Publication date datetime MeSH terms list Dataset
  3. User-paper interaction Datatype Read abstract? bool Dwell time on abstract

    (s) int Read full-text? bool Dwell time on full-text (s) int Shared paper? bool Thumbs-up paper? bool Thumbs-down paper? bool Dataset
  4. Dataset User-paper interaction Datatype Read abstract? bool Dwell time on

    abstract (s) int Read full-text? bool Dwell time on full-text (s) int Shared paper? bool Thumbs-up paper? bool Thumbs-down paper? bool
  5. Data Cleaning • Inactive users: haven’t used the platform in

    2 years • Suspicious users (e.g., bots): multiple shares, paper and abstract reads per minute Remove Outliers Define Threshold for Collection • >= 5 papers per collection • Actively curated (last updated in the past 2 years)
  6. Data Transformation Two types of recommender systems: Collaborative Filtering Content-Based

    Filtering Similar users like similar things Relies on user and item features item user John Jim Anne Liz Erica
  7. 10 5 0 0 100 0 6 1 0 0

    54 29 74 35 12 0 0 0 0 0 0 20 95 38 Data Transformation User-collection interaction score users collections Represents a user’s interaction with a collection Create a user-collections (“utility”) matrix
  8. Data Transformation Aggregate number of abstract and full-text reads within

    a collection 10 5 0 0 100 0 6 1 0 0 54 29 74 35 12 0 0 0 0 0 0 20 95 38 users collections Create a user-collections (“utility”) matrix
  9. Data Transformation 10 5 0 0 100 0 6 1

    0 0 54 29 74 35 12 0 0 0 0 0 0 20 95 38 users collections Create a user-collections (“utility”) matrix Pros - Does not need to know anything about the user or items - Can easily modify the “interaction” score based on the behaviour that you want to promote (e.g., abstract reads vs. shares) - Computationally efficient (parallelizable) - Captures inherent subtle characteristics Cons - Does not work for new users or items - Does not perform well on sparse datasets (i.e., not enough interactions)
  10. Data Transformation 10 5 0 0 100 0 6 1

    0 0 54 29 74 35 12 0 0 0 0 0 0 20 95 38 users collections new user Create a user-collections (“utility”) matrix Pros - Does not need to know anything about the user or items - Can easily modify the “interaction” score based on the behaviour that you want to promote (e.g., abstract reads vs. shares) - Computationally efficient (parallelizable) - Captures inherent subtle characteristics Cons - Does not work for new users or items - Does not perform well on sparse datasets (i.e., not enough interactions)
  11. • Matrix factorization: an unsupervised learning technique • Factorize user-item

    matrix (R) into two latent factor matrices: 1) user-factor matrix (n_users, k), 2) item-factor matrix (k, n_items) Model Training Alternating Least Squares Rmn ≈ Pmk × QT nk = ̂ R
  12. Model Evaluation Precision@K Recall@K • Proportion of items in a

    user’s top K recommendations that are relevant • Of the user’s top K recommendations, what proportion are relevant? precision = TP TP + FP recall = TP TP + FN • Proportion of relevant items that are captured in a user’s top K recommendation • Of the user’s relevant items, what proportion were captured in a user’s top K recommendations? Minimize number of false positives Minimize number of false negatives
  13. Generating Collection Titles Observational studies Anesthesia Respiratory aspiration Child Intensive

    care unit Reviews Adolescents Combined modality therapy Longitudinal studies Infant, newborn Blood glucose Metabolic diseases Top MeSH Terms Using TF-IDF