Like many PHP conferences, phpDay has a long tradition of using the platform joind.in to collect attendee feedback on the talks, used both by speakers to improve their respective presentations, as well as by the conference organizers to make decisions about tracks, talks, and speakers in the future. While joind.in isn’t exclusive to PHP conferences, it has its roots in that community, and so that is where its usage is most prevalent. Since its longtime maintainers recently announced their intention to step down and decommission the web server, making way for a new leadership team to step in and breathe new life into the project, now is a great time to take a meta look at joind.in and what it says about the PHP community. In addition to the web interface that you’re likely familiar with, joind.in has a public API that exposes virtually all the data it collects, including every single comment on every single talk at every single event in the system. In August 2018, during one of my company’s semiannual hack weeks, I partnered with our data scientist and one of our data analysts to analyze all of the comments and accompanying ratings revealed by the API. Come learn how we did it and, more importantly, what we learned as a result, and what lessons this might hold for us all as PHP community members.
LET'S GET META!
What the data from the joind.in API can
teach us about the PHP community
Nara Kasbergen (@xiehan)
May 11, 2019 | #phpday
Disclaimer / Apology
I'm a little sick today
I'm doing my best
Please bear with me ♀
What is this?
◉ Results from a project to analyze the data from
joind.in's open API
◉ All joind.in events … not just PHP (but mostly)
◉ Not intended to be a very technical session
Serendipity Days is our workplace's version of "hack
days", which takes place 2-3 times a year. We get 2 full
days to work on something we're interested in,
unrelated to our day jobs, individually or in groups.
Then we give short presentations on what we learned.
◉ Big, juicy data set
◉ Inspired by my own experiences as a female
◉ Question: Is there gender bias in joind.in feedback?
◉ Tech is not a meritocracy
◉ Working on a series of 4 blog posts
◉ Tried to have it done before this conference, but we ran
out of time
◉ Plan to open-source the code along with the posts
◉ ...pending legal department approval
Our process, summarized
1. Scrape the data + put it into a MySQL database
2. Clean/improve the data
3. Rudimentary data analysis (simple SQL queries)
4. Data science ☆.｡₀:*ﾟmagic ﾟ*:₀｡.☆
Scraping + Cleaning the Data
How I did it
◉ node.js v10 + TypeScript + Sequelize ORM
◉ 100% local, command-line script
◉ data saved to MySQL DB (AWS Aurora)
◉ no unit tests
The joind.in API
Stats from the end of Day 2
◉ Number of conferences/meetups: 1858 (~75%)
◉ Number of talks: 14654
◉ Number of comments: 55684
◉ Number of commenters: 6844
◉ Number of speakers: 8691
Stats after we scraped everything
◉ Number of conferences/meetups: 2462
◉ Number of talks: 21570
◉ Number of comments: 84020
◉ Number of commenters: 9428
◉ Number of speakers: 12463
◉ joind.in does not have any data about gender
◉ Not all speakers have a joind.in account
◉ Talks can have multiple speakers
◉ Commenters can choose to stay anonymous
◉ Not all comments have a 1-5 rating
Guessing gender based on name
◉ Python libraries:
There was a problem!
Can you guess what it is?
is a female name in the United States
is a male name in Italy
The solution: use location
◉ Guess where a user is from based on the country
where they most often attend events
◉ Include that country code (e.g. "IT" for Italy) in the
query to genderize.io / gender-api.com, e.g.
Based on this location logic...
◉ Derick Rethans is from the US
◉ Marco Pivetta is from Germany
◉ Larry Garfield is from the US
◉ Arne Blankerts is from the US
◉ Miro Svrtan is from Croatia
Based on the improved genderizer...
◉ Arne Blankerts is male
◉ Sammy Kaye Powers is male
◉ Michele Orselli (Italian) is male
◉ Andrea Giuliano (Italian) is male
◉ Andrea Skeries (American) is female
Rudimentary Data Analysis
Just using simple SQL queries
Gender distribution among speakers & commenters
How many talks has each speaker given? (Male vs. Female)
Women speakers receive double the number of comments on their talks that men do
Comment word cloud (all genders, all ratings)
Comment word clouds, all ratings: male speakers (left) vs. female speakers (right)
Comment word clouds, low ratings: male speakers (left) vs. female speakers (right)
Conference location vs. average rating (more red = lower, more blue = higher)
◉ United States: 4.66
◉ Brazil: 4.64
◉ Canada: 4.53
◉ Poland: 4.48
◉ United Kingdom: 4.47
◉ France: 4.42
Average talk ratings
◉ Germany: 4.37
◉ Croatia: 4.33
◉ Italy: 4.32
◉ Netherlands: 4.28
◉ Belgium: 4.23
What the data scientist told me to say
◉ Logistic regression model
◉ Outcome variable: binarized score [1 if rating 4-5,
0 if rating 1-3]
◉ Dependents of interest: The interaction between
speakers' gender (% female) and the
What the data scientist told me to say, cont.
◉ Controls: Number of speakers, commenter's
comment index, talk attendance count, year,
The results of the model
◉ The odds of a male commenter highly rating a
female speaker is 1/4 his odds of highly rating a
◉ Female commenters highly rate female vs. male
speakers at a ratio of 3:1.
The results, interpreted
How do we make it better?
For regular users (commenters)
◉ Comment on men and women's talks at an equal
rate (don't comment on women's talks more often)
◉ Examine your own biases
◉ Give constructive feedback
◉ Use the feedback sandwich technique!
◉ Add a gender field to the user profile and start
tracking this data internally
◉ Consider UI tweaks, e.g.
◉ Cues for more constructive feedback
◉ Change the rating scale (e.g. 1 to 6 instead of 1 to 5)
◉ Remove ratings entirely…?
Keep in touch: @xiehan on Twitter