Slide 1

Slide 1 text

LET'S GET META! What the data from the joind.in API can teach us about the PHP community Nara Kasbergen (@xiehan) May 11, 2019 | #phpday

Slide 2

Slide 2 text

Disclaimer / Apology I'm a little sick today I'm doing my best Please bear with me ‍♀

Slide 3

Slide 3 text

What is this? ◉ Results from a project to analyze the data from joind.in's open API ◉ All joind.in events … not just PHP (but mostly) ◉ Not intended to be a very technical session

Slide 4

Slide 4 text

Who? Nara Kasbergen Software Engineer Erica Liao Data Analyst Malorie Hughes Data Scientist

Slide 5

Slide 5 text

When? Serendipity Days is our workplace's version of "hack days", which takes place 2-3 times a year. We get 2 full days to work on something we're interested in, unrelated to our day jobs, individually or in groups. Then we give short presentations on what we learned.

Slide 6

Slide 6 text

Why? ◉ Big, juicy data set ◉ Inspired by my own experiences as a female conference speaker ◉ Question: Is there gender bias in joind.in feedback? ◉ Tech is not a meritocracy

Slide 7

Slide 7 text

What's next? ◉ Working on a series of 4 blog posts ◉ Tried to have it done before this conference, but we ran out of time ◉ Plan to open-source the code along with the posts ◉ ...pending legal department approval

Slide 8

Slide 8 text

Our process, summarized 1. Scrape the data + put it into a MySQL database 2. Clean/improve the data 3. Rudimentary data analysis (simple SQL queries) 4. Data science ☆.。₀:*゚magic ゚*:₀。.☆

Slide 9

Slide 9 text

Scraping + Cleaning the Data How I did it

Slide 10

Slide 10 text

Tech stack ◉ node.js v10 + TypeScript + Sequelize ORM ◉ 100% local, command-line script ◉ data saved to MySQL DB (AWS Aurora) ◉ no unit tests

Slide 11

Slide 11 text

The joind.in API api.joind.in/v2.1/events api.joind.in/v2.1/events/7089 api.joind.in/v2.1/events/7089/talks api.joind.in/v2.1/talks/26187 api.joind.in/v2.1/users/31070

Slide 12

Slide 12 text

Stats from the end of Day 2 ◉ Number of conferences/meetups: 1858 (~75%) ◉ Number of talks: 14654 ◉ Number of comments: 55684 ◉ Number of commenters: 6844 ◉ Number of speakers: 8691

Slide 13

Slide 13 text

Stats after we scraped everything ◉ Number of conferences/meetups: 2462 ◉ Number of talks: 21570 ◉ Number of comments: 84020 ◉ Number of commenters: 9428 ◉ Number of speakers: 12463

Slide 14

Slide 14 text

Challenges ◉ joind.in does not have any data about gender ◉ Not all speakers have a joind.in account ◉ Talks can have multiple speakers ◉ Commenters can choose to stay anonymous ◉ Not all comments have a 1-5 rating

Slide 15

Slide 15 text

Guessing gender based on name ◉ Python libraries: ◉ sexmachine ◉ gender-from-name ◉ genderize.io ◉ gender-api.com

Slide 16

Slide 16 text

There was a problem! Can you guess what it is?

Slide 17

Slide 17 text

Andrea

Slide 18

Slide 18 text

Andrea is a female name in the United States

Slide 19

Slide 19 text

Andrea is a male name in Italy

Slide 20

Slide 20 text

The solution: use location ◉ Guess where a user is from based on the country where they most often attend events ◉ Include that country code (e.g. "IT" for Italy) in the query to genderize.io / gender-api.com, e.g. gender-api.com/get?name=Andrea&country=IT

Slide 21

Slide 21 text

Based on this location logic... ◉ Derick Rethans is from the US ◉ Marco Pivetta is from Germany ◉ Larry Garfield is from the US ◉ Arne Blankerts is from the US ◉ Miro Svrtan is from Croatia

Slide 22

Slide 22 text

Based on the improved genderizer... ◉ Arne Blankerts is male ◉ Sammy Kaye Powers is male ◉ Michele Orselli (Italian) is male ◉ Andrea Giuliano (Italian) is male ◉ Andrea Skeries (American) is female

Slide 23

Slide 23 text

Rudimentary Data Analysis Just using simple SQL queries

Slide 24

Slide 24 text

Gender distribution among speakers & commenters

Slide 25

Slide 25 text

How many talks has each speaker given? (Male vs. Female)

Slide 26

Slide 26 text

Women speakers receive double the number of comments on their talks that men do

Slide 27

Slide 27 text

Comment word cloud (all genders, all ratings)

Slide 28

Slide 28 text

Comment word clouds, all ratings: male speakers (left) vs. female speakers (right)

Slide 29

Slide 29 text

Comment word clouds, low ratings: male speakers (left) vs. female speakers (right)

Slide 30

Slide 30 text

Conference location vs. average rating (more red = lower, more blue = higher)

Slide 31

Slide 31 text

◉ United States: 4.66 ◉ Brazil: 4.64 ◉ Canada: 4.53 ◉ Poland: 4.48 ◉ United Kingdom: 4.47 ◉ France: 4.42 Average talk ratings ◉ Germany: 4.37 ◉ Croatia: 4.33 ◉ Italy: 4.32 ◉ Netherlands: 4.28 ◉ Belgium: 4.23

Slide 32

Slide 32 text

☆.。₀:*゚magic ゚*:₀。.☆ Data Science

Slide 33

Slide 33 text

What the data scientist told me to say ◉ Logistic regression model ◉ Outcome variable: binarized score [1 if rating 4-5, 0 if rating 1-3] ◉ Dependents of interest: The interaction between speakers' gender (% female) and the commenter's gender.

Slide 34

Slide 34 text

What the data scientist told me to say, cont. ◉ Controls: Number of speakers, commenter's comment index, talk attendance count, year, conference continent.

Slide 35

Slide 35 text

The results of the model

Slide 36

Slide 36 text

◉ The odds of a male commenter highly rating a female speaker is 1/4 his odds of highly rating a male speaker. ◉ Female commenters highly rate female vs. male speakers at a ratio of 3:1. The results, interpreted

Slide 37

Slide 37 text

Recommendations How do we make it better?

Slide 38

Slide 38 text

For regular users (commenters) ◉ Comment on men and women's talks at an equal rate (don't comment on women's talks more often) ◉ Examine your own biases ◉ Give constructive feedback ◉ Use the feedback sandwich technique!

Slide 39

Slide 39 text

For joind.in ◉ Add a gender field to the user profile and start tracking this data internally ◉ Consider UI tweaks, e.g. ◉ Cues for more constructive feedback ◉ Change the rating scale (e.g. 1 to 6 instead of 1 to 5) ◉ Remove ratings entirely…?

Slide 40

Slide 40 text

Thank you! Keep in touch: @xiehan on Twitter