Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

[phpday] LET'S GET META! What the data from the joind.in API can teach us about the PHP community

Like many PHP conferences, phpDay has a long tradition of using the platform joind.in to collect attendee feedback on the talks, used both by speakers to improve their respective presentations, as well as by the conference organizers to make decisions about tracks, talks, and speakers in the future. While joind.in isn’t exclusive to PHP conferences, it has its roots in that community, and so that is where its usage is most prevalent. Since its longtime maintainers recently announced their intention to step down and decommission the web server, making way for a new leadership team to step in and breathe new life into the project, now is a great time to take a meta look at joind.in and what it says about the PHP community. In addition to the web interface that you’re likely familiar with, joind.in has a public API that exposes virtually all the data it collects, including every single comment on every single talk at every single event in the system. In August 2018, during one of my company’s semiannual hack weeks, I partnered with our data scientist and one of our data analysts to analyze all of the comments and accompanying ratings revealed by the API. Come learn how we did it and, more importantly, what we learned as a result, and what lessons this might hold for us all as PHP community members.

Nara Kasbergen

May 11, 2019
Tweet

More Decks by Nara Kasbergen

Other Decks in Technology

Transcript

  1. LET'S GET META!
    What the data from the joind.in API can
    teach us about the PHP community
    Nara Kasbergen (@xiehan)
    May 11, 2019 | #phpday

    View Slide

  2. Disclaimer / Apology
    I'm a little sick today
    I'm doing my best
    Please bear with me ‍♀

    View Slide

  3. What is this?
    ◉ Results from a project to analyze the data from
    joind.in's open API
    ◉ All joind.in events … not just PHP (but mostly)
    ◉ Not intended to be a very technical session

    View Slide

  4. Who?
    Nara Kasbergen
    Software Engineer
    Erica Liao
    Data Analyst
    Malorie Hughes
    Data Scientist

    View Slide

  5. When?
    Serendipity Days is our workplace's version of "hack
    days", which takes place 2-3 times a year. We get 2 full
    days to work on something we're interested in,
    unrelated to our day jobs, individually or in groups.
    Then we give short presentations on what we learned.

    View Slide

  6. Why?
    ◉ Big, juicy data set
    ◉ Inspired by my own experiences as a female
    conference speaker
    ◉ Question: Is there gender bias in joind.in feedback?
    ◉ Tech is not a meritocracy

    View Slide

  7. What's next?
    ◉ Working on a series of 4 blog posts
    ◉ Tried to have it done before this conference, but we ran
    out of time
    ◉ Plan to open-source the code along with the posts
    ◉ ...pending legal department approval

    View Slide

  8. Our process, summarized
    1. Scrape the data + put it into a MySQL database
    2. Clean/improve the data
    3. Rudimentary data analysis (simple SQL queries)
    4. Data science ☆.。₀:*゚magic ゚*:₀。.☆

    View Slide

  9. Scraping + Cleaning the Data
    How I did it

    View Slide

  10. Tech stack
    ◉ node.js v10 + TypeScript + Sequelize ORM
    ◉ 100% local, command-line script
    ◉ data saved to MySQL DB (AWS Aurora)
    ◉ no unit tests

    View Slide

  11. The joind.in API
    api.joind.in/v2.1/events
    api.joind.in/v2.1/events/7089
    api.joind.in/v2.1/events/7089/talks
    api.joind.in/v2.1/talks/26187
    api.joind.in/v2.1/users/31070

    View Slide

  12. Stats from the end of Day 2
    ◉ Number of conferences/meetups: 1858 (~75%)
    ◉ Number of talks: 14654
    ◉ Number of comments: 55684
    ◉ Number of commenters: 6844
    ◉ Number of speakers: 8691

    View Slide

  13. Stats after we scraped everything
    ◉ Number of conferences/meetups: 2462
    ◉ Number of talks: 21570
    ◉ Number of comments: 84020
    ◉ Number of commenters: 9428
    ◉ Number of speakers: 12463

    View Slide

  14. Challenges
    ◉ joind.in does not have any data about gender
    ◉ Not all speakers have a joind.in account
    ◉ Talks can have multiple speakers
    ◉ Commenters can choose to stay anonymous
    ◉ Not all comments have a 1-5 rating

    View Slide

  15. Guessing gender based on name
    ◉ Python libraries:
    ◉ sexmachine
    ◉ gender-from-name
    ◉ genderize.io
    ◉ gender-api.com

    View Slide

  16. There was a problem!
    Can you guess what it is?

    View Slide

  17. Andrea

    View Slide

  18. Andrea
    is a female name in the United States

    View Slide

  19. Andrea
    is a male name in Italy

    View Slide

  20. The solution: use location
    ◉ Guess where a user is from based on the country
    where they most often attend events
    ◉ Include that country code (e.g. "IT" for Italy) in the
    query to genderize.io / gender-api.com, e.g.
    gender-api.com/get?name=Andrea&country=IT

    View Slide

  21. Based on this location logic...
    ◉ Derick Rethans is from the US
    ◉ Marco Pivetta is from Germany
    ◉ Larry Garfield is from the US
    ◉ Arne Blankerts is from the US
    ◉ Miro Svrtan is from Croatia

    View Slide

  22. Based on the improved genderizer...
    ◉ Arne Blankerts is male
    ◉ Sammy Kaye Powers is male
    ◉ Michele Orselli (Italian) is male
    ◉ Andrea Giuliano (Italian) is male
    ◉ Andrea Skeries (American) is female

    View Slide

  23. Rudimentary Data Analysis
    Just using simple SQL queries

    View Slide

  24. Gender distribution among speakers & commenters

    View Slide

  25. How many talks has each speaker given? (Male vs. Female)

    View Slide

  26. Women speakers receive double the number of comments on their talks that men do

    View Slide

  27. Comment word cloud (all genders, all ratings)

    View Slide

  28. Comment word clouds, all ratings: male speakers (left) vs. female speakers (right)

    View Slide

  29. Comment word clouds, low ratings: male speakers (left) vs. female speakers (right)

    View Slide

  30. Conference location vs. average rating (more red = lower, more blue = higher)

    View Slide

  31. ◉ United States: 4.66
    ◉ Brazil: 4.64
    ◉ Canada: 4.53
    ◉ Poland: 4.48
    ◉ United Kingdom: 4.47
    ◉ France: 4.42
    Average talk ratings
    ◉ Germany: 4.37
    ◉ Croatia: 4.33
    ◉ Italy: 4.32
    ◉ Netherlands: 4.28
    ◉ Belgium: 4.23

    View Slide

  32. ☆.。₀:*゚magic ゚*:₀。.☆
    Data Science

    View Slide

  33. What the data scientist told me to say
    ◉ Logistic regression model
    ◉ Outcome variable: binarized score [1 if rating 4-5,
    0 if rating 1-3]
    ◉ Dependents of interest: The interaction between
    speakers' gender (% female) and the
    commenter's gender.

    View Slide

  34. What the data scientist told me to say, cont.
    ◉ Controls: Number of speakers, commenter's
    comment index, talk attendance count, year,
    conference continent.

    View Slide

  35. The results of the model

    View Slide

  36. ◉ The odds of a male commenter highly rating a
    female speaker is 1/4 his odds of highly rating a
    male speaker.
    ◉ Female commenters highly rate female vs. male
    speakers at a ratio of 3:1.
    The results, interpreted

    View Slide

  37. Recommendations
    How do we make it better?

    View Slide

  38. For regular users (commenters)
    ◉ Comment on men and women's talks at an equal
    rate (don't comment on women's talks more often)
    ◉ Examine your own biases
    ◉ Give constructive feedback
    ◉ Use the feedback sandwich technique!

    View Slide

  39. For joind.in
    ◉ Add a gender field to the user profile and start
    tracking this data internally
    ◉ Consider UI tweaks, e.g.
    ◉ Cues for more constructive feedback
    ◉ Change the rating scale (e.g. 1 to 6 instead of 1 to 5)
    ◉ Remove ratings entirely…?

    View Slide

  40. Thank you!
    Keep in touch: @xiehan on Twitter

    View Slide