Data Pipelines and Computational
Methods for the Social Sciences
Alex Hanna | Department of Sociology | UW-Madison
Slide 2
Slide 2 text
Data Pipelines and
Computational Methods
for the Social Sciences
Alex Hanna
PhD Candidate
Sociology
March 9, 2016
@alexhanna // alex-hanna.com
Slide 3
Slide 3 text
Outline
@alexhanna // alex-hanna.com
Slide 4
Slide 4 text
Outline
@alexhanna // alex-hanna.com
Slide 5
Slide 5 text
Outline
@alexhanna // alex-hanna.com
Slide 6
Slide 6 text
Slides are available at
http://tinyurl.com/rds-ahanna
Follow along at home and check out links!
@alexhanna // alex-hanna.com
Slide 7
Slide 7 text
Part 1: Twitter and
politics
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
Twitter and politics
Understanding
dynamics of elections
and Twitter activity
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
Wisconsin Twitter Collection
Started in early 2012
Currently 40+ TB
Over 50 billion tweets
@alexhanna // alex-hanna.com
Slide 12
Slide 12 text
Olden times
@alexhanna // alex-hanna.com
Slide 13
Slide 13 text
Olden times
Storage
Hardware: PC
Software: Compressed (gzip) JSON
Host: Social Science Computing Cooperative
Collection
Home-rolled Perl
Slow to process, search, and index
@alexhanna // alex-hanna.com
Slide 14
Slide 14 text
Modern times
@alexhanna // alex-hanna.com
Slide 15
Slide 15 text
Modern times
Storage
Hardware: 7-node Hadoop cluster
Software: Hive, compressed with Snappy codec
Host: Computer Systems Lab
Collection
Twitter Hosebird client
Quick processing, search, indexing
@alexhanna // alex-hanna.com
Slide 16
Slide 16 text
Part 2: Protest event data
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
Date: February 6, 1987
Location: Mercury, Nevada
Issue: Peace (anti-nuclear)
Form: Rally
Target: US government
Size: 2,000
Orgs: Greenpeace et al.
@alexhanna // alex-hanna.com
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
No content
Slide 21
Slide 21 text
Machine-learning Protest Event Data
System (MPEDS)
A system for generating new protest event data
with minimal human intervention using tools from
natural language processing and machine learning
Slide 22
Slide 22 text
Original setup
Dynamics of Collective Action as training data
New York Times Annotated Corpus, XML files
Problem: managing and accessing files is slow
and messy
@alexhanna // alex-hanna.com
Slide 23
Slide 23 text
Current setup
Training interface
on SSCC VM
Stored in SQLite
(moving to MySQL)
Being developed
currently (GitHub)
@alexhanna // alex-hanna.com
Slide 24
Slide 24 text
Current setup
Document storage
(>10 million) in
Apache Solr
Quick searching, index
built upon insertion
@alexhanna // alex-hanna.com
Slide 25
Slide 25 text
Part 3: Computational
Social Science Education
Slide 26
Slide 26 text
The problem
Social scientists use SPSS or
STATA
But not R, Python, Hadoop,
command-line interface
Teaching literacy to both new and
veteran scholars
@alexhanna // alex-hanna.com
Slide 27
Slide 27 text
Old tasks
Data munging
Regression
Graphing
@alexhanna // alex-hanna.com
Slide 28
Slide 28 text
Old tasks
Data munging
Regression
Graphing
Data munging
Web scraping
Large-scale networks
Automated text analysis
New tasks
@alexhanna // alex-hanna.com
Slide 29
Slide 29 text
Pedagogical approach
Meet people where they are at
How can you get them involved in a meaningful way?
@alexhanna // alex-hanna.com
Slide 30
Slide 30 text
Pedagogical approach
Meet people where they are at
How can you get them involved in a meaningful way?
Provide a lab setting for working through problems
Guide people along with hands-on workshops
@alexhanna // alex-hanna.com
Slide 31
Slide 31 text
Pedagogical approach
Meet people where they are at
How can you get them involved in a meaningful way?
Provide a lab setting for working through problems
Guide people along with hands-on workshops
Make code and instructions integrated and available on the web
GitHub, RMarkdown, IPython/Project Jupyter notebooks
@alexhanna // alex-hanna.com
Slide 32
Slide 32 text
Example 1:
Introduction to RStudio
Goal: Traditional tasks
Data handling
Plotting
Univariate and bivariate analysis
Audience: Introductory methods course
Undergraduate sociology students
Some with STATA experience
@alexhanna // alex-hanna.com
Slide 33
Slide 33 text
RStudio:
Using Examples
Code blocks and interface
Allowing for “do-it-yourself”
puzzle after initial instructions
@alexhanna // alex-hanna.com
Slide 34
Slide 34 text
Example 2:
Blogclub “tworkshops”
Goal: From zero to Hadoop for social media data
Basic UNIX terminal, Python
Various types of analysis
Audience: Mix of ~10 faculty and PhD students in SJMC
Labs taking place over timespan of a year
@alexhanna // alex-hanna.com
Slide 35
Slide 35 text
Tworkshop syllabus
1. Twitter API and an
introduction to the terminal
2. More terminal and your
first Python script
3. Basic Python
4. Python modules and I/O
5. Hadoop and MapReduce
6. Basic sentiment analysis
7. Network analysis
@alexhanna // alex-hanna.com
Slide 36
Slide 36 text
Example 3:
Data Science and Social Science @ NYU
Goal: Data Science in R
Data munging handling
Visualization, regression
Textual analysis and social network analysis
Web scraping and API access
Audience: Graduate students
Have statistical training
Varying levels of R literacy
More proficient with STATA
@alexhanna // alex-hanna.com
Slide 37
Slide 37 text
Connecting methods with relevant questions
Using CS + education
psychology example:
fighting bullying with
machine learning
@alexhanna // alex-hanna.com
Slide 38
Slide 38 text
Summary
Computational social sciences has vastly increased the
amount of data that social scientists use in daily practice
Handling data at scale takes training, not to mention trial
and error
@alexhanna // alex-hanna.com
Slide 39
Slide 39 text
Takeaways
Play around with different solutions. Ask peers and
mentors. Collaborate.
Explore how to apply methods / technologies to your
current project.
Don’t be afraid to fail (and fail often!)
@alexhanna // alex-hanna.com