Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Principal Component Analysis Using Stack Overflow Data

Julia Silge
February 03, 2018

Understanding Principal Component Analysis Using Stack Overflow Data

February 2018 talk at rstudio::conf

Julia Silge

February 03, 2018
Tweet

More Decks by Julia Silge

Other Decks in Technology

Transcript

  1. Understanding PCA Using
    Stack Overflow Data
    Julia Silge
    https://juliasilge.com/

    View Slide

  2. Hello
    I am Julia Silge
    Data Scientist, Stack Overflow
    @juliasilge

    View Slide

  3. View Slide

  4. 4
    Silge, J.D., Gebhardt, K., Bergmann, M., & Richstone, D. 2005, AJ, 130, 406

    View Slide

  5. @chrisalbon

    View Slide

  6. View Slide

  7. Data science at Stack Overflow

    View Slide

  8. I spend lots of time thinking about how technologies are related to each other
    Technology Relationships

    View Slide

  9. 9
    Click to edit slide title
    Click to edit slide subtitle
    What kinds of tags do I visit?
    Tag Percent
    r 63.1%
    regex 12.1%
    ggplot2 9.7%
    git 6.0%
    dataframe 4.2%

    View Slide


  10. 10
    Jamie Zawinski
    Some people, when confronted with
    a problem, think,
    "I know, I'll use regular expressions."
    Now they have two problems.

    View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. @chrisalbon

    View Slide

  16. 16
    Paragraph copy. Lorem ipsum dolor sit amet, consectetur
    adipiscing elit. Sed molestie lorem et ipsum euismod
    volutpat. Cras et neque euismod, suscipit turpis et,
    hendrerit libero.
    ● First level bullet point
    ○ Second level bullet point
    ■ Third level bullet point
    Click to edit slide title
    Click to edit slide subtitle
    AccountId Tag Value

    1 6461130 sass 0.00244
    2 1044010 tsql 0.00179
    3 405410 qt 0.00156
    4 3224070 http-headers 0.00306
    5 10525200 asp.net-mvc 0.00403
    6 6349580 amazon-s3 0.00123
    7 6114210 cookies 0.00373
    8 7397910 arrays 0.0237
    9 10997890 user-interface 0.00920
    10 1721510 fonts 0.00181
    11 9553740 sql-server 0.172
    12 3249020 frontend 0.00113
    13 10361710 concurrency 0.000377
    14 2251000 select 0.000527
    User tag visits

    View Slide

  17. 17
    Paragraph copy. Lorem ipsum dolor sit amet, consectetur
    adipiscing elit. Sed molestie lorem et ipsum euismod
    volutpat. Cras et neque euismod, suscipit turpis et,
    hendrerit libero.
    ● First level bullet point
    ○ Second level bullet point
    ■ Third level bullet point
    Click to edit slide title
    Click to edit slide subtitle
    sparse_tag_matrix <- user_tag_counts %>%
    tidytext::cast_sparse(AccountId, Tag, Percent)
    tags_scaled <- scale(sparse_tag_matrix)
    tags_pca <- irlba::prcomp_irlba(tags_scaled, n = 64)
    tidied_pca <- bind_cols(Tag = colnames(tags_scaled),
    tidy(tags_pca$rotation))
    User tag visits

    View Slide

  18. View Slide

  19. View Slide

  20. Using R at
    Stack Overflow
    20

    View Slide

  21. Thanks!
    Find me at @juliasilge and https://juliasilge.com/
    to
    Nick Larsen
    Kevin Montrose
    Jason Punyon
    David Robinson

    View Slide