Slide 1

Slide 1 text

WHAT’S IN A NAME? Using first names as features for gender inference in Twitter Wendy Liu and Derek Ruths School of Computer Science, McGill University March 25 – AAAI 2013 Spring Symposium on Analyzing Microtext

Slide 2

Slide 2 text

First names carry signal.

Slide 3

Slide 3 text

First names carry signal. How can we use this signal?

Slide 4

Slide 4 text

First names carry signal. How can we use this signal? What are the limitations?

Slide 5

Slide 5 text

Our investigation First name gender

Slide 6

Slide 6 text

Prior work: feature-based classifiers Burger, J.; Henderson, J.; Kim, G.; and Zarrella, G. 2011. Discriminating Gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Pennacchiotti, M., and Popescu, A. 2011. A machine learning approach to Twitter user classification. In Proceedings of the International Conference on Weblogs and Social Media.

Slide 7

Slide 7 text

Name-based classifiers vs. baseline classifier

Slide 8

Slide 8 text

But first, we needed a dataset.

Slide 9

Slide 9 text

But first, we needed a dataset. No canonical gender-labelled dataset

Slide 10

Slide 10 text

To avoid: deriving labels from text.

Slide 11

Slide 11 text

To avoid: deriving labels from text. Common practice

Slide 12

Slide 12 text

To avoid: deriving labels from text. Common practice Inflates accuracy

Slide 13

Slide 13 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella

Slide 14

Slide 14 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella

Slide 15

Slide 15 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Female

Slide 16

Slide 16 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Female

Slide 17

Slide 17 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Male Female

Slide 18

Slide 18 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Female Male

Slide 19

Slide 19 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella ? Female Male

Slide 20

Slide 20 text

Latest status message kelseygreenwell Username I can't get over how perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella ? Female Male

Slide 21

Slide 21 text

Result: cherry-picking users Excludes users that are difficult to categorise

Slide 22

Slide 22 text

Our approach: profile pictures.

Slide 23

Slide 23 text

Amazon Mechanical Turk

Slide 24

Slide 24 text

Amazon Mechanical Turk 20 users per task

Slide 25

Slide 25 text

Amazon Mechanical Turk 20 users per task 3 workers per user

Slide 26

Slide 26 text

Result: 12,681 labelled users. 4,449 male, 8,232 female

Slide 27

Slide 27 text

Result: 12,681 labelled users. 4,449 male, 8,232 female Download from bit.ly/microtext2013

Slide 28

Slide 28 text

The classifier Support vector machine

Slide 29

Slide 29 text

Prior work: Zamal, F. A.; Liu, W.; and Ruths, D. 2012. Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In Proceedings of the International Conference on Weblogs and Social Media. Liu, W.; Zamal, F. A.; and Ruths, D. 2012. Using social media to infer gender composition from commuter populations. In Proceedings of the When the City Meets the Citizen Workshop, the International Conference on Weblogs and Social Media.

Slide 30

Slide 30 text

Features k-top words ("hello") digrams ("he", "el", "ll", "lo") trigrams ("hel", "ell", "llo") stems ("hel") co-stems ("lo") hashtags ("#hello") } Lovins stemming algorithm frequency (number per day) tweets, mentions, hashtags, links, retweets ratios tweets to retweets followers to followees

Slide 31

Slide 31 text

SVM kernel radial basis function parameters chosen using grid-search

Slide 32

Slide 32 text

US census data (1990) name score -1 (female) +1 (male)

Slide 33

Slide 33 text

score -1 (female) +1 (male) (Number of males with this name) - (Number of females with this name) (Number of people with this name)

Slide 34

Slide 34 text

Three gender inference methods

Slide 35

Slide 35 text

Three gender inference methods Baseline

Slide 36

Slide 36 text

Three gender inference methods Baseline Integrated

Slide 37

Slide 37 text

Three gender inference methods Baseline Integrated Threshold

Slide 38

Slide 38 text

Testing our methods 4,000 per gender 10-fold cross validation

Slide 39

Slide 39 text

Figure 1. SVM classifier results for all methods. Baseline Integrated Threshold τ = 1.0 Threshold τ = 0.7 *Error bars were too small and thus were omitted 0% 50% 100% 83.3% 85.2% 86.4% 87.1%

Slide 40

Slide 40 text

Figure 2. Improvement of each method over the baseline. Baseline Integrated Threshold τ = 1.0 Threshold τ = 0.7 0% 50% 100% 0.0% 11.4% 18.6% 22.8%

Slide 41

Slide 41 text

Trend: name information improves accuracy.

Slide 42

Slide 42 text

Figure 3. Distribution of Twitter names by gender score. 1.0 0.5 0.0 - 0.0 - 1.0 61% 23% 13% <2% Gender-name association <2%

Slide 43

Slide 43 text

Next steps

Slide 44

Slide 44 text

Next steps Improving a priori knowledge

Slide 45

Slide 45 text

Next steps Improving a priori knowledge The n-gram model

Slide 46

Slide 46 text

Conclusions

Slide 47

Slide 47 text

Conclusions Using the name field to improve performance

Slide 48

Slide 48 text

Conclusions Using the name field to improve performance Strategy for constructing datasets

Slide 49

Slide 49 text

Conclusions Using the name field to improve performance Strategy for constructing datasets Download our dataset: bit.ly/microtext2013

Slide 50

Slide 50 text

Thank you! Wendy Liu (wendy.liu@mail.mcgill.ca) and Derek Ruths (derek.ruths@mcgill.ca) Network Dynamics Lab (www.networkdynamics.org) School of Computer Science, McGill University