[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

Like many PHP conferences, phpDay has a long tradition of using the platform joind.in to collect attendee feedback on the talks, used both by speakers to improve their respective presentations, as well as by the conference organizers to make decisions about tracks, talks, and speakers in the future. While joind.in isn’t exclusive to PHP conferences, it has its roots in that community, and so that is where its usage is most prevalent. Since its longtime maintainers recently announced their intention to step down and decommission the web server, making way for a new leadership team to step in and breathe new life into the project, now is a great time to take a meta look at joind.in and what it says about the PHP community. In addition to the web interface that you’re likely familiar with, joind.in has a public API that exposes virtually all the data it collects, including every single comment on every single talk at every single event in the system. In August 2018, during one of my company’s semiannual hack weeks, I partnered with our data scientist and one of our data analysts to analyze all of the comments and accompanying ratings revealed by the API. Come learn how we did it and, more importantly, what we learned as a result, and what lessons this might hold for us all as PHP community members.

B36609b33707f04623f84f7381d5e94e?s=128

Nara Kasbergen

May 11, 2019
Tweet

Transcript

  1. LET'S GET META! What the data from the joind.in API

    can teach us about the PHP community Nara Kasbergen (@xiehan) May 11, 2019 | #phpday
  2. Disclaimer / Apology I'm a little sick today I'm doing

    my best Please bear with me ‍♀
  3. What is this? ◉ Results from a project to analyze

    the data from joind.in's open API ◉ All joind.in events … not just PHP (but mostly) ◉ Not intended to be a very technical session
  4. Who? Nara Kasbergen Software Engineer Erica Liao Data Analyst Malorie

    Hughes Data Scientist
  5. When? Serendipity Days is our workplace's version of "hack days",

    which takes place 2-3 times a year. We get 2 full days to work on something we're interested in, unrelated to our day jobs, individually or in groups. Then we give short presentations on what we learned.
  6. Why? ◉ Big, juicy data set ◉ Inspired by my

    own experiences as a female conference speaker ◉ Question: Is there gender bias in joind.in feedback? ◉ Tech is not a meritocracy
  7. What's next? ◉ Working on a series of 4 blog

    posts ◉ Tried to have it done before this conference, but we ran out of time ◉ Plan to open-source the code along with the posts ◉ ...pending legal department approval
  8. Our process, summarized 1. Scrape the data + put it

    into a MySQL database 2. Clean/improve the data 3. Rudimentary data analysis (simple SQL queries) 4. Data science ☆.。₀:*゚magic ゚*:₀。.☆
  9. Scraping + Cleaning the Data How I did it

  10. Tech stack ◉ node.js v10 + TypeScript + Sequelize ORM

    ◉ 100% local, command-line script ◉ data saved to MySQL DB (AWS Aurora) ◉ no unit tests
  11. The joind.in API api.joind.in/v2.1/events api.joind.in/v2.1/events/7089 api.joind.in/v2.1/events/7089/talks api.joind.in/v2.1/talks/26187 api.joind.in/v2.1/users/31070

  12. Stats from the end of Day 2 ◉ Number of

    conferences/meetups: 1858 (~75%) ◉ Number of talks: 14654 ◉ Number of comments: 55684 ◉ Number of commenters: 6844 ◉ Number of speakers: 8691
  13. Stats after we scraped everything ◉ Number of conferences/meetups: 2462

    ◉ Number of talks: 21570 ◉ Number of comments: 84020 ◉ Number of commenters: 9428 ◉ Number of speakers: 12463
  14. Challenges ◉ joind.in does not have any data about gender

    ◉ Not all speakers have a joind.in account ◉ Talks can have multiple speakers ◉ Commenters can choose to stay anonymous ◉ Not all comments have a 1-5 rating
  15. Guessing gender based on name ◉ Python libraries: ◉ sexmachine

    ◉ gender-from-name ◉ genderize.io ◉ gender-api.com
  16. There was a problem! Can you guess what it is?

  17. Andrea

  18. Andrea is a female name in the United States

  19. Andrea is a male name in Italy

  20. The solution: use location ◉ Guess where a user is

    from based on the country where they most often attend events ◉ Include that country code (e.g. "IT" for Italy) in the query to genderize.io / gender-api.com, e.g. gender-api.com/get?name=Andrea&country=IT
  21. Based on this location logic... ◉ Derick Rethans is from

    the US ◉ Marco Pivetta is from Germany ◉ Larry Garfield is from the US ◉ Arne Blankerts is from the US ◉ Miro Svrtan is from Croatia
  22. Based on the improved genderizer... ◉ Arne Blankerts is male

    ◉ Sammy Kaye Powers is male ◉ Michele Orselli (Italian) is male ◉ Andrea Giuliano (Italian) is male ◉ Andrea Skeries (American) is female
  23. Rudimentary Data Analysis Just using simple SQL queries

  24. Gender distribution among speakers & commenters

  25. How many talks has each speaker given? (Male vs. Female)

  26. Women speakers receive double the number of comments on their

    talks that men do
  27. Comment word cloud (all genders, all ratings)

  28. Comment word clouds, all ratings: male speakers (left) vs. female

    speakers (right)
  29. Comment word clouds, low ratings: male speakers (left) vs. female

    speakers (right)
  30. Conference location vs. average rating (more red = lower, more

    blue = higher)
  31. ◉ United States: 4.66 ◉ Brazil: 4.64 ◉ Canada: 4.53

    ◉ Poland: 4.48 ◉ United Kingdom: 4.47 ◉ France: 4.42 Average talk ratings ◉ Germany: 4.37 ◉ Croatia: 4.33 ◉ Italy: 4.32 ◉ Netherlands: 4.28 ◉ Belgium: 4.23
  32. ☆.。₀:*゚magic ゚*:₀。.☆ Data Science

  33. What the data scientist told me to say ◉ Logistic

    regression model ◉ Outcome variable: binarized score [1 if rating 4-5, 0 if rating 1-3] ◉ Dependents of interest: The interaction between speakers' gender (% female) and the commenter's gender.
  34. What the data scientist told me to say, cont. ◉

    Controls: Number of speakers, commenter's comment index, talk attendance count, year, conference continent.
  35. The results of the model

  36. ◉ The odds of a male commenter highly rating a

    female speaker is 1/4 his odds of highly rating a male speaker. ◉ Female commenters highly rate female vs. male speakers at a ratio of 3:1. The results, interpreted
  37. Recommendations How do we make it better?

  38. For regular users (commenters) ◉ Comment on men and women's

    talks at an equal rate (don't comment on women's talks more often) ◉ Examine your own biases ◉ Give constructive feedback ◉ Use the feedback sandwich technique!
  39. For joind.in ◉ Add a gender field to the user

    profile and start tracking this data internally ◉ Consider UI tweaks, e.g. ◉ Cues for more constructive feedback ◉ Change the rating scale (e.g. 1 to 6 instead of 1 to 5) ◉ Remove ratings entirely…?
  40. Thank you! Keep in touch: @xiehan on Twitter