From Pokémon to Donald Trump - Mining and Visualizing weird stuff

From Donald Trump to Pokémon Data mining off the beaten
path

/r/The_Donald Analyzing 17 million comments by Trump followers on reddit
1

The Data Source Reddit Reddit (/ˈrɛdɪt/) is an American social
news aggregation, web content rating, and discussion website. Reddit's registered community members can submit content such as text posts or direct links. Registered users can then vote submissions up or down that determines their position on the page. #4 most visited website in U.S. and #9 in the world. /r/The_Donald /r/The_Donald is an Internet forum on Reddit where the participants create discussions and memes about President Donald Trump Initially created in June 2015 [..] the community has grown to over 500,000 subscribers and [..] one of the most active communities on the website The subreddit has been accused [..] of hosting conspiracy theories, and content that is racist, misogynistic, anti-Semitic, or white supremacistic.

Reddit comment data is available via API, but data set
is yuuuge > 3.4 bn rows in total

to the rescue All reddit comments (~3.5 bn) All r/The_Donald
comments (19 m) Deduplicated & cleaned (16.7 m) 29 Terabyte later..

to the rescue All reddit comments (~3.5 bn) All r/The_Donald
comments (19 m) Deduplicated & cleaned (16.7 m) 29 Terabyte later.. Watch your budget, it’s easy to spend money on BigQuery. Budget settings do not limit the spending, they just inform you that you just spent a lot of money

What data do we have and what can we do
with it?

Full text in natural language Sentiment Analysis, Natural Language Processing,
Deep Learning (i.e. train an AI) What data do we have and what can we do with it?

Timestamp and rough location through flair (= user status) Full
text in natural language Sentiment Analysis, Natural Language Processing, Deep Learning (i.e. train an AI) Time Series, Geovisualization, .. What data do we have and what can we do with it?

What data do we have and what can we do
with it? Timestamp and rough location through flair (= user status) Full text in natural language Ranking of comments by score (upvotes - downvotes) & gilding Sentiment Analysis, Natural Language Processing, Deep Learning (i.e. train an AI) Time Series, Geovisualization, .. Measures, Relations,.

Enhancing the data (1) - Word counts of comments 1.
Cleanse comments (encoding, ..) & export comment body only 2. MapReduce to count individual words with Cloud Dataflow • 16.7 million comments • 408 million words • 1.8 million distinct words 3. Remove stop words (“to”, “from”, ..) 4. Remove nonsense words (“asdf”, ..), variations (“lol”, “lool”, “l00l”), etc. • Import to SQL and Inner Join with English Dictionary • Challenge: Retain certain meaningful words (“Obummer”, “Killary”, “MAGA”, ..)

Enhancing the data (2) - Word sentiments and reference data
Word sentiments ◉ A word sentiment classifies a word as positive or negative. ◉ There’s a number of dictionaries for word sentiments available ◉ But: Sentiments are often dependent on context, so handle with care Reference data ◉ To compare The_Donald to natural english, we need a representative sample of the English Language ◉ Google Trillion Word Corpus contains 1 Trillion words from different sources ◉ Google’s Machine Learning Team published a frequency count for the dictionary

18/20 of the words with highest relative word frequency* are
negative *Highest relative word frequency compared to reference data (per 100k words) I.e. The_Donald commenters use the word “rapist” 944 times more often per 100k words than our reference sample of natural English

Russia and Trump? - Russians are the most active non-US
commenters

Activity is constantly increasing and peaks for certain events

12.-14.06.2016 ? 27.07.2016 Donald Trump conducts an AMA (Ask me
Anything = Q&A) 08.11.2016 Election day 20.01.2017 + Inauguration and first days in office Activity is constantly increasing and peaks for certain events

The shorter a comment, the higher the chance to be
upvoted

Known flaws and possible biases Reference data • Sample for
natural english is taken from a different non-political source • Language in a political setting is different Better: Compare to a neutral subreddit or other political subreddit Confirmation Bias • I started my analysis with the hypothesis that the /r/The_Donald is an angry and rather uneducated subreddit • Results matched my initial expectations • Plenty of possibilities - especially in word sentiment analysis - to unconsciously manipulate result to confirm my existing bias

Pokémon Go Analyzing 206k Pokémon spawns in the Munich area
2

The Data Source Pokémon Go Pokémon Go is a free-to-play,
location-based augmented reality game The game utilizes the player's mobile device's GPS ability to locate, capture, battle, and train [..] Pokémon, which appear on the screen as if they were at the same real-world location as the player.

Pokémon Go has an API, but it is not meant
to be used by third parties UI Pokémon Go API Pokémon Go Scanner - Pokémon are discovered by submitting location to API - Scanner emulates multiple users walking around SQL Persist spawn data PokéAPI Open source project providing Pokémon metadata as a RESTful API GET spawn data Add metadata

API data is very raw, but SQL Magic can help
us out here SQL Magic

Known flaws and possible biases Missing Data While collecting the
data, I paused the script multiple times for a short while. In this time, no data has been collected at all Changing search radius In addition to stopping the script several times, I also changed the parameters of it multiple times, i.e. increasing/decreasing the search radius Data used came from a trial run over 2 days in July 2016. The game developers soon after secured the API, effectively disabling scanners -> A controlled run with fixed parameters never happened.

Evolutions of a Pokémon mostly spawn in the same area
- Weedle

- Kakuna

- Beedrill

Some Pokémon seem to (illogically) align to characteristics of the
environment

Spawnrate does not change over time

Other Findings ◉ 95.6% of all recorded Pokémon are of
the lowest evolution level (1), 4.1% have evolved to the second level and only 0.32% have reached evolution level 3. ◉ There's a Pidgey and Rattata epidemic in Munich, with each accounting for 14% of all Pokémon. ◉ 28 Pokémon never showed up in Munich, 123 out of 151 however did. ◉ There is a relation between evolution level and spawn rate/probability (not really surprising), but there is no statistically significant relation between evolution level and spread over the map.

THANKS! github.com/markusz works-on-my-machine.com

Top 20 outliers in relative word frequency compared to natural
english

Activity

From Pokémon to Donald Trump - Mining and Visua...

From Pokémon to Donald Trump - Mining and Visualizing weird stuff

MunichDataGeeks

More Decks by MunichDataGeeks

Other Decks in Technology

Featured

Transcript

From Donald Trump to Pokémon Data mining off the beaten

/r/The_Donald Analyzing 17 million comments by Trump followers on reddit

The Data Source Reddit Reddit (/ˈrɛdɪt/) is an American social

Reddit comment data is available via API, but data set

to the rescue All reddit comments (~3.5 bn) All r/The_Donald

to the rescue All reddit comments (~3.5 bn) All r/The_Donald

What data do we have and what can we do

Full text in natural language Sentiment Analysis, Natural Language Processing,

Timestamp and rough location through flair (= user status) Full

What data do we have and what can we do

Enhancing the data (1) - Word counts of comments 1.

Enhancing the data (2) - Word sentiments and reference data

18/20 of the words with highest relative word frequency* are

Russia and Trump? - Russians are the most active non-US

Activity is constantly increasing and peaks for certain events

12.-14.06.2016 ? 27.07.2016 Donald Trump conducts an AMA (Ask me

The shorter a comment, the higher the chance to be

Known flaws and possible biases Reference data • Sample for

Pokémon Go Analyzing 206k Pokémon spawns in the Munich area

The Data Source Pokémon Go Pokémon Go is a free-to-play,

Pokémon Go has an API, but it is not meant

API data is very raw, but SQL Magic can help

Known flaws and possible biases Missing Data While collecting the

Evolutions of a Pokémon mostly spawn in the same area

Evolutions of a Pokémon mostly spawn in the same area

Evolutions of a Pokémon mostly spawn in the same area

Some Pokémon seem to (illogically) align to characteristics of the

Spawnrate does not change over time

Other Findings ◉ 95.6% of all recorded Pokémon are of

THANKS! github.com/markusz works-on-my-machine.com

Top 20 outliers in relative word frequency compared to natural

Activity