LET'S GET META!
What the data from the joind.in API can
teach us about the PHP community
Nara Kasbergen (@xiehan)
May 11, 2019 | #phpday
Slide 2
Slide 2 text
Disclaimer / Apology
I'm a little sick today
I'm doing my best
Please bear with me ♀
Slide 3
Slide 3 text
What is this?
◉ Results from a project to analyze the data from
joind.in's open API
◉ All joind.in events … not just PHP (but mostly)
◉ Not intended to be a very technical session
Slide 4
Slide 4 text
Who?
Nara Kasbergen
Software Engineer
Erica Liao
Data Analyst
Malorie Hughes
Data Scientist
Slide 5
Slide 5 text
When?
Serendipity Days is our workplace's version of "hack
days", which takes place 2-3 times a year. We get 2 full
days to work on something we're interested in,
unrelated to our day jobs, individually or in groups.
Then we give short presentations on what we learned.
Slide 6
Slide 6 text
Why?
◉ Big, juicy data set
◉ Inspired by my own experiences as a female
conference speaker
◉ Question: Is there gender bias in joind.in feedback?
◉ Tech is not a meritocracy
Slide 7
Slide 7 text
What's next?
◉ Working on a series of 4 blog posts
◉ Tried to have it done before this conference, but we ran
out of time
◉ Plan to open-source the code along with the posts
◉ ...pending legal department approval
Slide 8
Slide 8 text
Our process, summarized
1. Scrape the data + put it into a MySQL database
2. Clean/improve the data
3. Rudimentary data analysis (simple SQL queries)
4. Data science ☆.。₀:*゚magic ゚*:₀。.☆
Slide 9
Slide 9 text
Scraping + Cleaning the Data
How I did it
Slide 10
Slide 10 text
Tech stack
◉ node.js v10 + TypeScript + Sequelize ORM
◉ 100% local, command-line script
◉ data saved to MySQL DB (AWS Aurora)
◉ no unit tests
Slide 11
Slide 11 text
The joind.in API
api.joind.in/v2.1/events
api.joind.in/v2.1/events/7089
api.joind.in/v2.1/events/7089/talks
api.joind.in/v2.1/talks/26187
api.joind.in/v2.1/users/31070
Slide 12
Slide 12 text
Stats from the end of Day 2
◉ Number of conferences/meetups: 1858 (~75%)
◉ Number of talks: 14654
◉ Number of comments: 55684
◉ Number of commenters: 6844
◉ Number of speakers: 8691
Slide 13
Slide 13 text
Stats after we scraped everything
◉ Number of conferences/meetups: 2462
◉ Number of talks: 21570
◉ Number of comments: 84020
◉ Number of commenters: 9428
◉ Number of speakers: 12463
Slide 14
Slide 14 text
Challenges
◉ joind.in does not have any data about gender
◉ Not all speakers have a joind.in account
◉ Talks can have multiple speakers
◉ Commenters can choose to stay anonymous
◉ Not all comments have a 1-5 rating
Slide 15
Slide 15 text
Guessing gender based on name
◉ Python libraries:
◉ sexmachine
◉ gender-from-name
◉ genderize.io
◉ gender-api.com
Slide 16
Slide 16 text
There was a problem!
Can you guess what it is?
Slide 17
Slide 17 text
Andrea
Slide 18
Slide 18 text
Andrea
is a female name in the United States
Slide 19
Slide 19 text
Andrea
is a male name in Italy
Slide 20
Slide 20 text
The solution: use location
◉ Guess where a user is from based on the country
where they most often attend events
◉ Include that country code (e.g. "IT" for Italy) in the
query to genderize.io / gender-api.com, e.g.
gender-api.com/get?name=Andrea&country=IT
Slide 21
Slide 21 text
Based on this location logic...
◉ Derick Rethans is from the US
◉ Marco Pivetta is from Germany
◉ Larry Garfield is from the US
◉ Arne Blankerts is from the US
◉ Miro Svrtan is from Croatia
Slide 22
Slide 22 text
Based on the improved genderizer...
◉ Arne Blankerts is male
◉ Sammy Kaye Powers is male
◉ Michele Orselli (Italian) is male
◉ Andrea Giuliano (Italian) is male
◉ Andrea Skeries (American) is female
Slide 23
Slide 23 text
Rudimentary Data Analysis
Just using simple SQL queries
Slide 24
Slide 24 text
Gender distribution among speakers & commenters
Slide 25
Slide 25 text
How many talks has each speaker given? (Male vs. Female)
Slide 26
Slide 26 text
Women speakers receive double the number of comments on their talks that men do
Slide 27
Slide 27 text
Comment word cloud (all genders, all ratings)
Slide 28
Slide 28 text
Comment word clouds, all ratings: male speakers (left) vs. female speakers (right)
Slide 29
Slide 29 text
Comment word clouds, low ratings: male speakers (left) vs. female speakers (right)
Slide 30
Slide 30 text
Conference location vs. average rating (more red = lower, more blue = higher)
What the data scientist told me to say
◉ Logistic regression model
◉ Outcome variable: binarized score [1 if rating 4-5,
0 if rating 1-3]
◉ Dependents of interest: The interaction between
speakers' gender (% female) and the
commenter's gender.
Slide 34
Slide 34 text
What the data scientist told me to say, cont.
◉ Controls: Number of speakers, commenter's
comment index, talk attendance count, year,
conference continent.
Slide 35
Slide 35 text
The results of the model
Slide 36
Slide 36 text
◉ The odds of a male commenter highly rating a
female speaker is 1/4 his odds of highly rating a
male speaker.
◉ Female commenters highly rate female vs. male
speakers at a ratio of 3:1.
The results, interpreted
Slide 37
Slide 37 text
Recommendations
How do we make it better?
Slide 38
Slide 38 text
For regular users (commenters)
◉ Comment on men and women's talks at an equal
rate (don't comment on women's talks more often)
◉ Examine your own biases
◉ Give constructive feedback
◉ Use the feedback sandwich technique!
Slide 39
Slide 39 text
For joind.in
◉ Add a gender field to the user profile and start
tracking this data internally
◉ Consider UI tweaks, e.g.
◉ Cues for more constructive feedback
◉ Change the rating scale (e.g. 1 to 6 instead of 1 to 5)
◉ Remove ratings entirely…?