From Pokémon to Donald Trump - Mining and Visualizing weird stuff

From Pokémon to Donald Trump - Mining and Visualizing weird stuff

In his talk, Markus talks about extracting, analyzing and visualizing data from unusual sources. He will talk about two of his projects: First he'll talk about using a Pokémon Go bot to gather and processing data on 250k spawns of Pokémon in Munich during the peak of the Pokémon Go hype in 2016 and the insights about the logic behind the game that he gained by visualizing this data. Second he will talk about his analysis of 16 mio. comments in r/The_Donald, a community on Reddit that is devoted to Donald Trump, analyzing (among others) the community's language compared to natural english, activity levels over time and geographical distribution of users.

3c3f3f18c25ea5283640ebd23553e7c6?s=128

MunichDataGeeks

October 07, 2017
Tweet

Transcript

  1. From Donald Trump to Pokémon Data mining off the beaten

    path
  2. /r/The_Donald Analyzing 17 million comments by Trump followers on reddit

    1
  3. The Data Source Reddit Reddit (/ˈrɛdɪt/) is an American social

    news aggregation, web content rating, and discussion website. Reddit's registered community members can submit content such as text posts or direct links. Registered users can then vote submissions up or down that determines their position on the page. #4 most visited website in U.S. and #9 in the world. /r/The_Donald /r/The_Donald is an Internet forum on Reddit where the participants create discussions and memes about President Donald Trump Initially created in June 2015 [..] the community has grown to over 500,000 subscribers and [..] one of the most active communities on the website The subreddit has been accused [..] of hosting conspiracy theories, and content that is racist, misogynistic, anti-Semitic, or white supremacistic.
  4. Reddit comment data is available via API, but data set

    is yuuuge > 3.4 bn rows in total
  5. to the rescue All reddit comments (~3.5 bn) All r/The_Donald

    comments (19 m) Deduplicated & cleaned (16.7 m) 29 Terabyte later..
  6. to the rescue All reddit comments (~3.5 bn) All r/The_Donald

    comments (19 m) Deduplicated & cleaned (16.7 m) 29 Terabyte later.. Watch your budget, it’s easy to spend money on BigQuery. Budget settings do not limit the spending, they just inform you that you just spent a lot of money
  7. What data do we have and what can we do

    with it?
  8. Full text in natural language Sentiment Analysis, Natural Language Processing,

    Deep Learning (i.e. train an AI) What data do we have and what can we do with it?
  9. Timestamp and rough location through flair (= user status) Full

    text in natural language Sentiment Analysis, Natural Language Processing, Deep Learning (i.e. train an AI) Time Series, Geovisualization, .. What data do we have and what can we do with it?
  10. What data do we have and what can we do

    with it? Timestamp and rough location through flair (= user status) Full text in natural language Ranking of comments by score (upvotes - downvotes) & gilding Sentiment Analysis, Natural Language Processing, Deep Learning (i.e. train an AI) Time Series, Geovisualization, .. Measures, Relations,.
  11. Enhancing the data (1) - Word counts of comments 1.

    Cleanse comments (encoding, ..) & export comment body only 2. MapReduce to count individual words with Cloud Dataflow • 16.7 million comments • 408 million words • 1.8 million distinct words 3. Remove stop words (“to”, “from”, ..) 4. Remove nonsense words (“asdf”, ..), variations (“lol”, “lool”, “l00l”), etc. • Import to SQL and Inner Join with English Dictionary • Challenge: Retain certain meaningful words (“Obummer”, “Killary”, “MAGA”, ..)
  12. Enhancing the data (2) - Word sentiments and reference data

    Word sentiments ◉ A word sentiment classifies a word as positive or negative. ◉ There’s a number of dictionaries for word sentiments available ◉ But: Sentiments are often dependent on context, so handle with care Reference data ◉ To compare The_Donald to natural english, we need a representative sample of the English Language ◉ Google Trillion Word Corpus contains 1 Trillion words from different sources ◉ Google’s Machine Learning Team published a frequency count for the dictionary
  13. 18/20 of the words with highest relative word frequency* are

    negative *Highest relative word frequency compared to reference data (per 100k words) I.e. The_Donald commenters use the word “rapist” 944 times more often per 100k words than our reference sample of natural English
  14. Russia and Trump? - Russians are the most active non-US

    commenters
  15. Activity is constantly increasing and peaks for certain events

  16. 12.-14.06.2016 ? 27.07.2016 Donald Trump conducts an AMA (Ask me

    Anything = Q&A) 08.11.2016 Election day 20.01.2017 + Inauguration and first days in office Activity is constantly increasing and peaks for certain events
  17. The shorter a comment, the higher the chance to be

    upvoted
  18. Known flaws and possible biases Reference data • Sample for

    natural english is taken from a different non-political source • Language in a political setting is different Better: Compare to a neutral subreddit or other political subreddit Confirmation Bias • I started my analysis with the hypothesis that the /r/The_Donald is an angry and rather uneducated subreddit • Results matched my initial expectations • Plenty of possibilities - especially in word sentiment analysis - to unconsciously manipulate result to confirm my existing bias
  19. Pokémon Go Analyzing 206k Pokémon spawns in the Munich area

    2
  20. The Data Source Pokémon Go Pokémon Go is a free-to-play,

    location-based augmented reality game The game utilizes the player's mobile device's GPS ability to locate, capture, battle, and train [..] Pokémon, which appear on the screen as if they were at the same real-world location as the player.
  21. Pokémon Go has an API, but it is not meant

    to be used by third parties UI Pokémon Go API Pokémon Go Scanner - Pokémon are discovered by submitting location to API - Scanner emulates multiple users walking around SQL Persist spawn data PokéAPI Open source project providing Pokémon metadata as a RESTful API GET spawn data Add metadata
  22. API data is very raw, but SQL Magic can help

    us out here SQL Magic
  23. Known flaws and possible biases Missing Data While collecting the

    data, I paused the script multiple times for a short while. In this time, no data has been collected at all Changing search radius In addition to stopping the script several times, I also changed the parameters of it multiple times, i.e. increasing/decreasing the search radius Data used came from a trial run over 2 days in July 2016. The game developers soon after secured the API, effectively disabling scanners -> A controlled run with fixed parameters never happened.
  24. Evolutions of a Pokémon mostly spawn in the same area

    - Weedle
  25. Evolutions of a Pokémon mostly spawn in the same area

    - Kakuna
  26. Evolutions of a Pokémon mostly spawn in the same area

    - Beedrill
  27. Some Pokémon seem to (illogically) align to characteristics of the

    environment
  28. Spawnrate does not change over time

  29. Other Findings ◉ 95.6% of all recorded Pokémon are of

    the lowest evolution level (1), 4.1% have evolved to the second level and only 0.32% have reached evolution level 3. ◉ There's a Pidgey and Rattata epidemic in Munich, with each accounting for 14% of all Pokémon. ◉ 28 Pokémon never showed up in Munich, 123 out of 151 however did. ◉ There is a relation between evolution level and spawn rate/probability (not really surprising), but there is no statistically significant relation between evolution level and spread over the map.
  30. THANKS! github.com/markusz works-on-my-machine.com

  31. Top 20 outliers in relative word frequency compared to natural

    english
  32. Activity