[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

LET'S GET META! What the data from the joind.in API
can teach us about the PHP community Nara Kasbergen (@xiehan) May 11, 2019 | #phpday

Disclaimer / Apology I'm a little sick today I'm doing
my best Please bear with me ‍♀

What is this? ◉ Results from a project to analyze
the data from joind.in's open API ◉ All joind.in events … not just PHP (but mostly) ◉ Not intended to be a very technical session

Who? Nara Kasbergen Software Engineer Erica Liao Data Analyst Malorie
Hughes Data Scientist

When? Serendipity Days is our workplace's version of "hack days",
which takes place 2-3 times a year. We get 2 full days to work on something we're interested in, unrelated to our day jobs, individually or in groups. Then we give short presentations on what we learned.

Why? ◉ Big, juicy data set ◉ Inspired by my
own experiences as a female conference speaker ◉ Question: Is there gender bias in joind.in feedback? ◉ Tech is not a meritocracy

What's next? ◉ Working on a series of 4 blog
posts ◉ Tried to have it done before this conference, but we ran out of time ◉ Plan to open-source the code along with the posts ◉ ...pending legal department approval

Our process, summarized 1. Scrape the data + put it
into a MySQL database 2. Clean/improve the data 3. Rudimentary data analysis (simple SQL queries) 4. Data science ☆.｡₀:*ﾟmagic ﾟ*:₀｡.☆

Scraping + Cleaning the Data How I did it

Tech stack ◉ node.js v10 + TypeScript + Sequelize ORM
◉ 100% local, command-line script ◉ data saved to MySQL DB (AWS Aurora) ◉ no unit tests

The joind.in API api.joind.in/v2.1/events api.joind.in/v2.1/events/7089 api.joind.in/v2.1/events/7089/talks api.joind.in/v2.1/talks/26187 api.joind.in/v2.1/users/31070

Stats from the end of Day 2 ◉ Number of
conferences/meetups: 1858 (~75%) ◉ Number of talks: 14654 ◉ Number of comments: 55684 ◉ Number of commenters: 6844 ◉ Number of speakers: 8691

Stats after we scraped everything ◉ Number of conferences/meetups: 2462
◉ Number of talks: 21570 ◉ Number of comments: 84020 ◉ Number of commenters: 9428 ◉ Number of speakers: 12463

Challenges ◉ joind.in does not have any data about gender
◉ Not all speakers have a joind.in account ◉ Talks can have multiple speakers ◉ Commenters can choose to stay anonymous ◉ Not all comments have a 1-5 rating

Guessing gender based on name ◉ Python libraries: ◉ sexmachine
◉ gender-from-name ◉ genderize.io ◉ gender-api.com

There was a problem! Can you guess what it is?

Andrea

Andrea is a female name in the United States

Andrea is a male name in Italy

The solution: use location ◉ Guess where a user is
from based on the country where they most often attend events ◉ Include that country code (e.g. "IT" for Italy) in the query to genderize.io / gender-api.com, e.g. gender-api.com/get?name=Andrea&country=IT

Based on this location logic... ◉ Derick Rethans is from
the US ◉ Marco Pivetta is from Germany ◉ Larry Garfield is from the US ◉ Arne Blankerts is from the US ◉ Miro Svrtan is from Croatia

Based on the improved genderizer... ◉ Arne Blankerts is male
◉ Sammy Kaye Powers is male ◉ Michele Orselli (Italian) is male ◉ Andrea Giuliano (Italian) is male ◉ Andrea Skeries (American) is female

Rudimentary Data Analysis Just using simple SQL queries

Gender distribution among speakers & commenters

How many talks has each speaker given? (Male vs. Female)

Women speakers receive double the number of comments on their
talks that men do

Comment word cloud (all genders, all ratings)

Comment word clouds, all ratings: male speakers (left) vs. female
speakers (right)

Comment word clouds, low ratings: male speakers (left) vs. female
speakers (right)

Conference location vs. average rating (more red = lower, more
blue = higher)

◉ United States: 4.66 ◉ Brazil: 4.64 ◉ Canada: 4.53
◉ Poland: 4.48 ◉ United Kingdom: 4.47 ◉ France: 4.42 Average talk ratings ◉ Germany: 4.37 ◉ Croatia: 4.33 ◉ Italy: 4.32 ◉ Netherlands: 4.28 ◉ Belgium: 4.23

☆.｡₀:*ﾟmagic ﾟ*:₀｡.☆ Data Science

What the data scientist told me to say ◉ Logistic
regression model ◉ Outcome variable: binarized score [1 if rating 4-5, 0 if rating 1-3] ◉ Dependents of interest: The interaction between speakers' gender (% female) and the commenter's gender.

What the data scientist told me to say, cont. ◉
Controls: Number of speakers, commenter's comment index, talk attendance count, year, conference continent.

The results of the model

◉ The odds of a male commenter highly rating a
female speaker is 1/4 his odds of highly rating a male speaker. ◉ Female commenters highly rate female vs. male speakers at a ratio of 3:1. The results, interpreted

Recommendations How do we make it better?

For regular users (commenters) ◉ Comment on men and women's
talks at an equal rate (don't comment on women's talks more often) ◉ Examine your own biases ◉ Give constructive feedback ◉ Use the feedback sandwich technique!

For joind.in ◉ Add a gender field to the user
profile and start tracking this data internally ◉ Consider UI tweaks, e.g. ◉ Cues for more constructive feedback ◉ Change the rating scale (e.g. 1 to 6 instead of 1 to 5) ◉ Remove ratings entirely…?

Thank you! Keep in touch: @xiehan on Twitter

[phpday] LET'S GET META! What the data from the...

[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

More Decks by Nara Kasbergen

Other Decks in Technology

Featured

Transcript