Data Pipelines and Computational Methods for the Social Sciences

Data Pipelines and Computational Methods for the Social Sciences Alex
Hanna | Department of Sociology | UW-Madison

Data Pipelines and Computational Methods for the Social Sciences Alex
Hanna PhD Candidate Sociology March 9, 2016 @alexhanna // alex-hanna.com

Outline @alexhanna // alex-hanna.com

Slides are available at http://tinyurl.com/rds-ahanna Follow along at home and
check out links! @alexhanna // alex-hanna.com

Part 1: Twitter and politics

Twitter and politics Understanding dynamics of elections and Twitter activity

Wisconsin Twitter Collection Started in early 2012 Currently 40+ TB
Over 50 billion tweets @alexhanna // alex-hanna.com

Olden times @alexhanna // alex-hanna.com

Olden times Storage Hardware: PC Software: Compressed (gzip) JSON Host:
Social Science Computing Cooperative Collection Home-rolled Perl Slow to process, search, and index @alexhanna // alex-hanna.com

Modern times @alexhanna // alex-hanna.com

Modern times Storage Hardware: 7-node Hadoop cluster Software: Hive, compressed
with Snappy codec Host: Computer Systems Lab Collection Twitter Hosebird client Quick processing, search, indexing @alexhanna // alex-hanna.com

Part 2: Protest event data

Date: February 6, 1987 Location: Mercury, Nevada Issue: Peace (anti-nuclear)
Form: Rally Target: US government Size: 2,000 Orgs: Greenpeace et al. @alexhanna // alex-hanna.com

Machine-learning Protest Event Data System (MPEDS) A system for generating
new protest event data with minimal human intervention using tools from natural language processing and machine learning

Original setup Dynamics of Collective Action as training data New
York Times Annotated Corpus, XML files Problem: managing and accessing files is slow and messy @alexhanna // alex-hanna.com

Current setup Training interface on SSCC VM Stored in SQLite
(moving to MySQL) Being developed currently (GitHub) @alexhanna // alex-hanna.com

Current setup Document storage (>10 million) in Apache Solr Quick
searching, index built upon insertion @alexhanna // alex-hanna.com

Part 3: Computational Social Science Education

The problem Social scientists use SPSS or STATA But not
R, Python, Hadoop, command-line interface Teaching literacy to both new and veteran scholars @alexhanna // alex-hanna.com

Old tasks Data munging Regression Graphing @alexhanna // alex-hanna.com

Old tasks Data munging Regression Graphing Data munging Web scraping
Large-scale networks Automated text analysis New tasks @alexhanna // alex-hanna.com

Pedagogical approach Meet people where they are at How can
you get them involved in a meaningful way? @alexhanna // alex-hanna.com

you get them involved in a meaningful way? Provide a lab setting for working through problems Guide people along with hands-on workshops @alexhanna // alex-hanna.com

you get them involved in a meaningful way? Provide a lab setting for working through problems Guide people along with hands-on workshops Make code and instructions integrated and available on the web GitHub, RMarkdown, IPython/Project Jupyter notebooks @alexhanna // alex-hanna.com

Example 1: Introduction to RStudio Goal: Traditional tasks Data handling
Plotting Univariate and bivariate analysis Audience: Introductory methods course Undergraduate sociology students Some with STATA experience @alexhanna // alex-hanna.com

RStudio: Using Examples Code blocks and interface Allowing for “do-it-yourself”
puzzle after initial instructions @alexhanna // alex-hanna.com

Example 2: Blogclub “tworkshops” Goal: From zero to Hadoop for
social media data Basic UNIX terminal, Python Various types of analysis Audience: Mix of ~10 faculty and PhD students in SJMC Labs taking place over timespan of a year @alexhanna // alex-hanna.com

Tworkshop syllabus 1. Twitter API and an introduction to the
terminal 2. More terminal and your first Python script 3. Basic Python 4. Python modules and I/O 5. Hadoop and MapReduce 6. Basic sentiment analysis 7. Network analysis @alexhanna // alex-hanna.com

Example 3: Data Science and Social Science @ NYU Goal:
Data Science in R Data munging handling Visualization, regression Textual analysis and social network analysis Web scraping and API access Audience: Graduate students Have statistical training Varying levels of R literacy More proficient with STATA @alexhanna // alex-hanna.com

Connecting methods with relevant questions Using CS + education psychology
example: fighting bullying with machine learning @alexhanna // alex-hanna.com

Summary Computational social sciences has vastly increased the amount of
data that social scientists use in daily practice Handling data at scale takes training, not to mention trial and error @alexhanna // alex-hanna.com

Takeaways Play around with different solutions. Ask peers and mentors.
Collaborate. Explore how to apply methods / technologies to your current project. Don’t be afraid to fail (and fail often!) @alexhanna // alex-hanna.com

Thanks! [email protected] @alexhanna // alex-hanna.com

Data Pipelines and Computational Methods for th...

Data Pipelines and Computational Methods for the Social Sciences

More Decks by Research Data Services

Featured

Transcript