Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What's in a name? Using first names as features for gender inference in Twitter

0826c9e6449c9a08230ac6d40a3cde4d?s=47 Wendy Liu
March 25, 2013

What's in a name? Using first names as features for gender inference in Twitter

Presented at the AAAI 2013 Spring Symposium on Analyzing Microtext, held at Stanford University.

0826c9e6449c9a08230ac6d40a3cde4d?s=128

Wendy Liu

March 25, 2013
Tweet

Transcript

  1. WHAT’S IN A NAME? Using first names as features for

    gender inference in Twitter Wendy Liu and Derek Ruths School of Computer Science, McGill University March 25 – AAAI 2013 Spring Symposium on Analyzing Microtext
  2. First names carry signal.

  3. First names carry signal. How can we use this signal?

  4. First names carry signal. How can we use this signal?

    What are the limitations?
  5. Our investigation First name gender

  6. Prior work: feature-based classifiers Burger, J.; Henderson, J.; Kim, G.;

    and Zarrella, G. 2011. Discriminating Gender on Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Pennacchiotti, M., and Popescu, A. 2011. A machine learning approach to Twitter user classification. In Proceedings of the International Conference on Weblogs and Social Media.
  7. Name-based classifiers vs. baseline classifier

  8. But first, we needed a dataset.

  9. But first, we needed a dataset. No canonical gender-labelled dataset

  10. To avoid: deriving labels from text.

  11. To avoid: deriving labels from text. Common practice

  12. To avoid: deriving labels from text. Common practice Inflates accuracy

  13. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella
  14. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella
  15. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Female
  16. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Female
  17. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Male Female
  18. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella Female Male
  19. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella ? Female Male
  20. Latest status message kelseygreenwell Username I can't get over how

    perfect my prom dress is! OfficeOfSteve I would take pictures of myself at the gym, but I'm afraid I'd lose my man card on Twitter when people see that I bench press 40lbs Gender? Zatics I don't think I'll ever understand why people are obsessed with Nutella ? Female Male
  21. Result: cherry-picking users Excludes users that are difficult to categorise

  22. Our approach: profile pictures.

  23. Amazon Mechanical Turk

  24. Amazon Mechanical Turk 20 users per task

  25. Amazon Mechanical Turk 20 users per task 3 workers per

    user
  26. Result: 12,681 labelled users. 4,449 male, 8,232 female

  27. Result: 12,681 labelled users. 4,449 male, 8,232 female Download from

    bit.ly/microtext2013
  28. The classifier Support vector machine

  29. Prior work: Zamal, F. A.; Liu, W.; and Ruths, D.

    2012. Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In Proceedings of the International Conference on Weblogs and Social Media. Liu, W.; Zamal, F. A.; and Ruths, D. 2012. Using social media to infer gender composition from commuter populations. In Proceedings of the When the City Meets the Citizen Workshop, the International Conference on Weblogs and Social Media.
  30. Features k-top words ("hello") digrams ("he", "el", "ll", "lo") trigrams

    ("hel", "ell", "llo") stems ("hel") co-stems ("lo") hashtags ("#hello") } Lovins stemming algorithm frequency (number per day) tweets, mentions, hashtags, links, retweets ratios tweets to retweets followers to followees
  31. SVM kernel radial basis function parameters chosen using grid-search

  32. US census data (1990) name score -1 (female) +1 (male)

  33. score -1 (female) +1 (male) (Number of males with this

    name) - (Number of females with this name) (Number of people with this name)
  34. Three gender inference methods

  35. Three gender inference methods Baseline

  36. Three gender inference methods Baseline Integrated

  37. Three gender inference methods Baseline Integrated Threshold

  38. Testing our methods 4,000 per gender 10-fold cross validation

  39. Figure 1. SVM classifier results for all methods. Baseline Integrated

    Threshold τ = 1.0 Threshold τ = 0.7 *Error bars were too small and thus were omitted 0% 50% 100% 83.3% 85.2% 86.4% 87.1%
  40. Figure 2. Improvement of each method over the baseline. Baseline

    Integrated Threshold τ = 1.0 Threshold τ = 0.7 0% 50% 100% 0.0% 11.4% 18.6% 22.8%
  41. Trend: name information improves accuracy.

  42. Figure 3. Distribution of Twitter names by gender score. 1.0

    0.5 0.0 - 0.0 - 1.0 61% 23% 13% <2% Gender-name association <2%
  43. Next steps

  44. Next steps Improving a priori knowledge

  45. Next steps Improving a priori knowledge The n-gram model

  46. Conclusions

  47. Conclusions Using the name field to improve performance

  48. Conclusions Using the name field to improve performance Strategy for

    constructing datasets
  49. Conclusions Using the name field to improve performance Strategy for

    constructing datasets Download our dataset: bit.ly/microtext2013
  50. Thank you! Wendy Liu (wendy.liu@mail.mcgill.ca) and Derek Ruths (derek.ruths@mcgill.ca) Network

    Dynamics Lab (www.networkdynamics.org) School of Computer Science, McGill University