Understanding PCA Using
Stack Overflow Data
Julia Silge
https://juliasilge.com/
Slide 2
Slide 2 text
Hello
I am Julia Silge
Data Scientist, Stack Overflow
@juliasilge
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
4
Silge, J.D., Gebhardt, K., Bergmann, M., & Richstone, D. 2005, AJ, 130, 406
Slide 5
Slide 5 text
@chrisalbon
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
Data science at Stack Overflow
Slide 8
Slide 8 text
I spend lots of time thinking about how technologies are related to each other
Technology Relationships
Slide 9
Slide 9 text
9
Click to edit slide title
Click to edit slide subtitle
What kinds of tags do I visit?
Tag Percent
r 63.1%
regex 12.1%
ggplot2 9.7%
git 6.0%
dataframe 4.2%
Slide 10
Slide 10 text
“
10
Jamie Zawinski
Some people, when confronted with
a problem, think,
"I know, I'll use regular expressions."
Now they have two problems.
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
No content
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
@chrisalbon
Slide 16
Slide 16 text
16
Paragraph copy. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Sed molestie lorem et ipsum euismod
volutpat. Cras et neque euismod, suscipit turpis et,
hendrerit libero.
● First level bullet point
○ Second level bullet point
■ Third level bullet point
Click to edit slide title
Click to edit slide subtitle
AccountId Tag Value
1 6461130 sass 0.00244
2 1044010 tsql 0.00179
3 405410 qt 0.00156
4 3224070 http-headers 0.00306
5 10525200 asp.net-mvc 0.00403
6 6349580 amazon-s3 0.00123
7 6114210 cookies 0.00373
8 7397910 arrays 0.0237
9 10997890 user-interface 0.00920
10 1721510 fonts 0.00181
11 9553740 sql-server 0.172
12 3249020 frontend 0.00113
13 10361710 concurrency 0.000377
14 2251000 select 0.000527
User tag visits
Slide 17
Slide 17 text
17
Paragraph copy. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Sed molestie lorem et ipsum euismod
volutpat. Cras et neque euismod, suscipit turpis et,
hendrerit libero.
● First level bullet point
○ Second level bullet point
■ Third level bullet point
Click to edit slide title
Click to edit slide subtitle
sparse_tag_matrix <- user_tag_counts %>%
tidytext::cast_sparse(AccountId, Tag, Percent)
tags_scaled <- scale(sparse_tag_matrix)
tags_pca <- irlba::prcomp_irlba(tags_scaled, n = 64)
tidied_pca <- bind_cols(Tag = colnames(tags_scaled),
tidy(tags_pca$rotation))
User tag visits
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
Using R at
Stack Overflow
20
Slide 21
Slide 21 text
Thanks!
Find me at @juliasilge and https://juliasilge.com/
to
Nick Larsen
Kevin Montrose
Jason Punyon
David Robinson