[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

Like many PHP conferences, phpDay has a long tradition of using the platform joind.in to collect attendee feedback on the talks, used both by speakers to improve their respective presentations, as well as by the conference organizers to make decisions about tracks, talks, and speakers in the future. While joind.in isn’t exclusive to PHP conferences, it has its roots in that community, and so that is where its usage is most prevalent. Since its longtime maintainers recently announced their intention to step down and decommission the web server, making way for a new leadership team to step in and breathe new life into the project, now is a great time to take a meta look at joind.in and what it says about the PHP community. In addition to the web interface that you’re likely familiar with, joind.in has a public API that exposes virtually all the data it collects, including every single comment on every single talk at every single event in the system. In August 2018, during one of my company’s semiannual hack weeks, I partnered with our data scientist and one of our data analysts to analyze all of the comments and accompanying ratings revealed by the API. Come learn how we did it and, more importantly, what we learned as a result, and what lessons this might hold for us all as PHP community members.

B36609b33707f04623f84f7381d5e94e?s=128

Nara Kasbergen

May 11, 2019
Tweet

Transcript

  1. 1.

    LET'S GET META! What the data from the joind.in API

    can teach us about the PHP community Nara Kasbergen (@xiehan) May 11, 2019 | #phpday
  2. 2.

    Disclaimer / Apology I'm a little sick today I'm doing

    my best Please bear with me ‍♀
  3. 3.

    What is this? ◉ Results from a project to analyze

    the data from joind.in's open API ◉ All joind.in events … not just PHP (but mostly) ◉ Not intended to be a very technical session
  4. 5.

    When? Serendipity Days is our workplace's version of "hack days",

    which takes place 2-3 times a year. We get 2 full days to work on something we're interested in, unrelated to our day jobs, individually or in groups. Then we give short presentations on what we learned.
  5. 6.

    Why? ◉ Big, juicy data set ◉ Inspired by my

    own experiences as a female conference speaker ◉ Question: Is there gender bias in joind.in feedback? ◉ Tech is not a meritocracy
  6. 7.

    What's next? ◉ Working on a series of 4 blog

    posts ◉ Tried to have it done before this conference, but we ran out of time ◉ Plan to open-source the code along with the posts ◉ ...pending legal department approval
  7. 8.

    Our process, summarized 1. Scrape the data + put it

    into a MySQL database 2. Clean/improve the data 3. Rudimentary data analysis (simple SQL queries) 4. Data science ☆.。₀:*゚magic ゚*:₀。.☆
  8. 10.

    Tech stack ◉ node.js v10 + TypeScript + Sequelize ORM

    ◉ 100% local, command-line script ◉ data saved to MySQL DB (AWS Aurora) ◉ no unit tests
  9. 12.

    Stats from the end of Day 2 ◉ Number of

    conferences/meetups: 1858 (~75%) ◉ Number of talks: 14654 ◉ Number of comments: 55684 ◉ Number of commenters: 6844 ◉ Number of speakers: 8691
  10. 13.

    Stats after we scraped everything ◉ Number of conferences/meetups: 2462

    ◉ Number of talks: 21570 ◉ Number of comments: 84020 ◉ Number of commenters: 9428 ◉ Number of speakers: 12463
  11. 14.

    Challenges ◉ joind.in does not have any data about gender

    ◉ Not all speakers have a joind.in account ◉ Talks can have multiple speakers ◉ Commenters can choose to stay anonymous ◉ Not all comments have a 1-5 rating
  12. 15.

    Guessing gender based on name ◉ Python libraries: ◉ sexmachine

    ◉ gender-from-name ◉ genderize.io ◉ gender-api.com
  13. 17.
  14. 20.

    The solution: use location ◉ Guess where a user is

    from based on the country where they most often attend events ◉ Include that country code (e.g. "IT" for Italy) in the query to genderize.io / gender-api.com, e.g. gender-api.com/get?name=Andrea&country=IT
  15. 21.

    Based on this location logic... ◉ Derick Rethans is from

    the US ◉ Marco Pivetta is from Germany ◉ Larry Garfield is from the US ◉ Arne Blankerts is from the US ◉ Miro Svrtan is from Croatia
  16. 22.

    Based on the improved genderizer... ◉ Arne Blankerts is male

    ◉ Sammy Kaye Powers is male ◉ Michele Orselli (Italian) is male ◉ Andrea Giuliano (Italian) is male ◉ Andrea Skeries (American) is female
  17. 31.

    ◉ United States: 4.66 ◉ Brazil: 4.64 ◉ Canada: 4.53

    ◉ Poland: 4.48 ◉ United Kingdom: 4.47 ◉ France: 4.42 Average talk ratings ◉ Germany: 4.37 ◉ Croatia: 4.33 ◉ Italy: 4.32 ◉ Netherlands: 4.28 ◉ Belgium: 4.23
  18. 33.

    What the data scientist told me to say ◉ Logistic

    regression model ◉ Outcome variable: binarized score [1 if rating 4-5, 0 if rating 1-3] ◉ Dependents of interest: The interaction between speakers' gender (% female) and the commenter's gender.
  19. 34.

    What the data scientist told me to say, cont. ◉

    Controls: Number of speakers, commenter's comment index, talk attendance count, year, conference continent.
  20. 36.

    ◉ The odds of a male commenter highly rating a

    female speaker is 1/4 his odds of highly rating a male speaker. ◉ Female commenters highly rate female vs. male speakers at a ratio of 3:1. The results, interpreted
  21. 38.

    For regular users (commenters) ◉ Comment on men and women's

    talks at an equal rate (don't comment on women's talks more often) ◉ Examine your own biases ◉ Give constructive feedback ◉ Use the feedback sandwich technique!
  22. 39.

    For joind.in ◉ Add a gender field to the user

    profile and start tracking this data internally ◉ Consider UI tweaks, e.g. ◉ Cues for more constructive feedback ◉ Change the rating scale (e.g. 1 to 6 instead of 1 to 5) ◉ Remove ratings entirely…?