Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What's in a name? Using first names as features for gender inference in Twitter

Wendy Liu
March 25, 2013

What's in a name? Using first names as features for gender inference in Twitter

Presented at the AAAI 2013 Spring Symposium on Analyzing Microtext, held at Stanford University.

Wendy Liu

March 25, 2013
Tweet

More Decks by Wendy Liu

Other Decks in Research

Transcript

  1. WHAT’S IN A NAME?
    Using first names as features for gender inference in Twitter
    Wendy Liu and Derek Ruths
    School of Computer Science, McGill University
    March 25 – AAAI 2013 Spring Symposium on Analyzing Microtext

    View Slide

  2. First names carry signal.

    View Slide

  3. First names carry signal.
    How can we use this signal?

    View Slide

  4. First names carry signal.
    How can we use this signal?
    What are the limitations?

    View Slide

  5. Our investigation
    First name gender

    View Slide

  6. Prior work: feature-based classifiers
    Burger, J.; Henderson, J.; Kim, G.; and Zarrella, G. 2011. Discriminating
    Gender on Twitter. In Proceedings of the Conference on Empirical
    Methods in Natural Language Processing.
    Pennacchiotti, M., and Popescu, A. 2011. A machine learning approach
    to Twitter user classification. In Proceedings of the International
    Conference on Weblogs and Social Media.

    View Slide

  7. Name-based classifiers
    vs.
    baseline classifier

    View Slide

  8. But first, we needed a dataset.

    View Slide

  9. But first, we needed a dataset.
    No canonical gender-labelled dataset

    View Slide

  10. To avoid: deriving labels from text.

    View Slide

  11. To avoid: deriving labels from text.
    Common practice

    View Slide

  12. To avoid: deriving labels from text.
    Common practice
    Inflates accuracy

    View Slide

  13. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella

    View Slide

  14. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella

    View Slide

  15. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella
    Female

    View Slide

  16. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella
    Female

    View Slide

  17. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella
    Male
    Female

    View Slide

  18. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella
    Female
    Male

    View Slide

  19. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella
    ?
    Female
    Male

    View Slide

  20. Latest status message
    kelseygreenwell
    Username
    I can't get over how perfect my prom
    dress is!
    OfficeOfSteve I would take pictures of myself at the gym,
    but I'm afraid I'd lose my man card on Twitter
    when people see that I bench press 40lbs
    Gender?
    Zatics I don't think I'll ever understand why people
    are obsessed with Nutella
    ?
    Female
    Male

    View Slide

  21. Result: cherry-picking users
    Excludes users that are difficult to categorise

    View Slide

  22. Our approach: profile pictures.

    View Slide

  23. Amazon Mechanical Turk

    View Slide

  24. Amazon Mechanical Turk
    20 users per task

    View Slide

  25. Amazon Mechanical Turk
    20 users per task
    3 workers per user

    View Slide

  26. Result: 12,681 labelled users.
    4,449 male, 8,232 female

    View Slide

  27. Result: 12,681 labelled users.
    4,449 male, 8,232 female
    Download from bit.ly/microtext2013

    View Slide

  28. The classifier
    Support vector machine

    View Slide

  29. Prior work:
    Zamal, F. A.; Liu, W.; and Ruths, D. 2012. Homophily and latent attribute
    inference: inferring latent attributes of Twitter users from neighbors. In
    Proceedings of the International Conference on Weblogs and Social Media.
    Liu, W.; Zamal, F. A.; and Ruths, D. 2012. Using social media to infer gender
    composition from commuter populations. In Proceedings of the When the City
    Meets the Citizen Workshop, the International Conference on Weblogs and
    Social Media.

    View Slide

  30. Features
    k-top
    words ("hello")
    digrams ("he", "el", "ll", "lo")
    trigrams ("hel", "ell", "llo")
    stems ("hel")
    co-stems ("lo")
    hashtags ("#hello")
    } Lovins stemming algorithm
    frequency (number per day)
    tweets, mentions, hashtags, links, retweets
    ratios
    tweets to retweets
    followers to followees

    View Slide

  31. SVM kernel
    radial basis function
    parameters chosen using grid-search

    View Slide

  32. US census data (1990)
    name
    score
    -1
    (female)
    +1
    (male)

    View Slide

  33. score
    -1
    (female)
    +1
    (male)
    (Number of males with this name) - (Number of females with this name)
    (Number of people with this name)

    View Slide

  34. Three gender inference methods

    View Slide

  35. Three gender inference methods
    Baseline

    View Slide

  36. Three gender inference methods
    Baseline
    Integrated

    View Slide

  37. Three gender inference methods
    Baseline
    Integrated
    Threshold

    View Slide

  38. Testing our methods
    4,000 per gender
    10-fold cross validation

    View Slide

  39. Figure 1. SVM classifier results for all methods.
    Baseline Integrated Threshold
    τ = 1.0
    Threshold
    τ = 0.7
    *Error bars were too small and thus were omitted
    0%
    50%
    100%
    83.3% 85.2% 86.4% 87.1%

    View Slide

  40. Figure 2. Improvement of each method over the baseline.
    Baseline Integrated Threshold
    τ = 1.0
    Threshold
    τ = 0.7
    0%
    50%
    100%
    0.0%
    11.4%
    18.6%
    22.8%

    View Slide

  41. Trend: name information improves accuracy.

    View Slide

  42. Figure 3. Distribution of Twitter names by gender score.
    1.0
    0.5
    0.0
    - 0.0
    - 1.0
    61%
    23%
    13%
    <2%
    Gender-name association
    <2%

    View Slide

  43. Next steps

    View Slide

  44. Next steps
    Improving a priori knowledge

    View Slide

  45. Next steps
    Improving a priori knowledge
    The n-gram model

    View Slide

  46. Conclusions

    View Slide

  47. Conclusions
    Using the name field to improve performance

    View Slide

  48. Conclusions
    Using the name field to improve performance
    Strategy for constructing datasets

    View Slide

  49. Conclusions
    Using the name field to improve performance
    Strategy for constructing datasets
    Download our dataset: bit.ly/microtext2013

    View Slide

  50. Thank you!
    Wendy Liu (w[email protected]) and Derek Ruths ([email protected])
    Network Dynamics Lab (www.networkdynamics.org)
    School of Computer Science, McGill University

    View Slide