Upgrade to Pro — share decks privately, control downloads, hide ads and more …

storyStackr

ceboylan
February 10, 2016

 storyStackr

fight writer’s block and find your readership’s favorite topics

ceboylan

February 10, 2016
Tweet

Other Decks in Research

Transcript

  1. storyStackr:     fight  writer’s  block  and  find  your  readership’s

     favorite   topics I’m  stuck.  How  do  I   better  engage  my  readers   while  still  writing  in  my   favorite  genre? •world’s  largest  community  of  readers  and  writers   •readers  can  vote  on  stories
  2. Pipeline documents   (scraped            

                               stories) LDA   (Latent  Dirichlet     Allocation) stem  words,     remove  stop  words,   tf-­‐idf  filter
  3. Pipeline documents   (scraped            

                               stories) GLM     covariates:  genre,     votes apply  wattpad     genre  labels,  votes     per  document LDA   (Latent  Dirichlet     Allocation) stem  words,     remove  stop  words,   tf-­‐idf  filter
  4. Pipeline documents   (scraped            

                               stories) GLM     covariates:  genre,     votes apply  wattpad     genre  labels,  votes     per  document STM     (Structural  Topic  Model) vote-­‐optimized  topic  distributions,     top  terms  per  topic LDA   (Latent  Dirichlet     Allocation) stem  words,     remove  stop  words,   tf-­‐idf  filter
  5. Pipeline documents   (scraped            

                               stories) GLM     covariates:  genre,     votes apply  wattpad     genre  labels,  votes     per  document STM     (Structural  Topic  Model) vote-­‐optimized  topic  distributions,     top  terms  per  topic missing  and  matching     key  words user  input   (tags  and     summary/first  chapter) LDA   (Latent  Dirichlet     Allocation) stem  words,     remove  stop  words,   tf-­‐idf  filter
  6. Romance Poetry Humor Romance Romance Romance Romance Action Adventure  

    Adventure Adventure Topic  Vector  of  Document  1 Topic  Vector  of  Document  2
  7. Romance Poetry Humor Romance Romance Romance Romance Action Adventure  

    Adventure Adventure Topic  Vector  of  Document  1 Topic  Vector  of  Document  2 “Rom-­‐Coms” “Rom-­‐Coms”
  8. About  me! Ph.D.  in  Cognitive  Neuroscience   Neural  bases  of

     combinatorial  semantics Data  Science  Fellow  
  9. Validation  1:  topics  as  sub-­‐genre   features  (all  docs  -­‐

     messy) •1200  documents  across  23   genres   •each  document  is  a  vector   of  topic  weights   •correlating  these  topic   vectors  and  ordering  by   distance  should   recapitulate  genres  (and   show  super-­‐  and  sub-­‐ genre  structures)
  10. Validation  2:  topic  coherency Top  10  terms  in  topmost  topic

     for  3  randomly  chosen  genres: Topic  16   action   fight   love   mafia   gang   streetfight   gun   fighter   revenge   assassin Topic  24 mystery   crime   abuse   spy   violence   newadult   psychology   conspiracy   drug   detective Topic  2 science   fiction   power   superhero   adventure   wattpad   super   dystopia   hero   superpower
  11. Validation  2:  topic  coherency Top  10  terms  in  topmost  topic

     for  3  randomly  chosen  genres: Topic  16   action   fight   love   mafia   gang   streetfight   gun   fighter   revenge   assassin Topic  24 mystery   crime   abuse   spy   violence   newadult   psychology   conspiracy   drug   detective Topic  2 science   fiction   power   superhero   adventure   wattpad   super   dystopia   hero   superpower Action Mystery  /  Thriller Science  Fiction
  12. • algorithm  core:  structural  topic  model  (R)   • data:

     1.2k  stories,  with  tags  and  summaries,  across  23   genres   • back  end:  R,  stm,  Python,  pandas  to  do  heavy  lifting;   SQLalchemy  for  interfacing  with  SQL  (postgres).   • front  end:  Flask,  Bootstrap,  deployed  on  AWS Back  to  Front
  13. Model • stm: structural topic model (R; Roberts et al.,

    2015) • combined LDA (latent Dirichlet allocation) with regression model, where covariates were genre and votes • allowed topic rank assignment for a given story (document) to vary by the degree to which a topic predicts “likes,” by genre- • uses Latent Dirichlet Allocation; KL divergence to choose number of topics • output document-topic and term-topic matrices to pandas and postgresql db