Data Pipelines and Computational Methods for the Social Sciences

Slide 1

Slide 1 text

Data Pipelines and Computational Methods for the Social Sciences Alex Hanna | Department of Sociology | UW-Madison

Slide 2

Slide 2 text

Data Pipelines and Computational Methods for the Social Sciences Alex Hanna PhD Candidate Sociology March 9, 2016 @alexhanna // alex-hanna.com

Slide 3

Slide 3 text

Outline @alexhanna // alex-hanna.com

Slide 4

Slide 4 text

Outline @alexhanna // alex-hanna.com

Slide 5

Slide 5 text

Outline @alexhanna // alex-hanna.com

Slide 6

Slide 6 text

Slides are available at http://tinyurl.com/rds-ahanna Follow along at home and check out links! @alexhanna // alex-hanna.com

Slide 7

Slide 7 text

Part 1: Twitter and politics

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Twitter and politics Understanding dynamics of elections and Twitter activity

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Wisconsin Twitter Collection Started in early 2012 Currently 40+ TB Over 50 billion tweets @alexhanna // alex-hanna.com

Slide 12

Slide 12 text

Olden times @alexhanna // alex-hanna.com

Slide 13

Slide 13 text

Olden times Storage Hardware: PC Software: Compressed (gzip) JSON Host: Social Science Computing Cooperative Collection Home-rolled Perl Slow to process, search, and index @alexhanna // alex-hanna.com

Slide 14

Slide 14 text

Modern times @alexhanna // alex-hanna.com

Slide 15

Slide 15 text

Modern times Storage Hardware: 7-node Hadoop cluster Software: Hive, compressed with Snappy codec Host: Computer Systems Lab Collection Twitter Hosebird client Quick processing, search, indexing @alexhanna // alex-hanna.com

Slide 16

Slide 16 text

Part 2: Protest event data

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Date: February 6, 1987 Location: Mercury, Nevada Issue: Peace (anti-nuclear) Form: Rally Target: US government Size: 2,000 Orgs: Greenpeace et al. @alexhanna // alex-hanna.com

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Machine-learning Protest Event Data System (MPEDS) A system for generating new protest event data with minimal human intervention using tools from natural language processing and machine learning

Slide 22

Slide 22 text

Original setup Dynamics of Collective Action as training data New York Times Annotated Corpus, XML files Problem: managing and accessing files is slow and messy @alexhanna // alex-hanna.com

Slide 23

Slide 23 text

Current setup Training interface on SSCC VM Stored in SQLite (moving to MySQL) Being developed currently (GitHub) @alexhanna // alex-hanna.com

Slide 24

Slide 24 text

Current setup Document storage (>10 million) in Apache Solr Quick searching, index built upon insertion @alexhanna // alex-hanna.com

Slide 25

Slide 25 text

Part 3: Computational Social Science Education

Slide 26

Slide 26 text

The problem Social scientists use SPSS or STATA But not R, Python, Hadoop, command-line interface Teaching literacy to both new and veteran scholars @alexhanna // alex-hanna.com

Slide 27

Slide 27 text

Old tasks Data munging Regression Graphing @alexhanna // alex-hanna.com

Slide 28

Slide 28 text

Old tasks Data munging Regression Graphing Data munging Web scraping Large-scale networks Automated text analysis New tasks @alexhanna // alex-hanna.com

Slide 29

Slide 29 text

Pedagogical approach Meet people where they are at How can you get them involved in a meaningful way? @alexhanna // alex-hanna.com

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Pedagogical approach Meet people where they are at How can you get them involved in a meaningful way? Provide a lab setting for working through problems Guide people along with hands-on workshops Make code and instructions integrated and available on the web GitHub, RMarkdown, IPython/Project Jupyter notebooks @alexhanna // alex-hanna.com

Slide 32

Slide 32 text

Example 1: Introduction to RStudio Goal: Traditional tasks Data handling Plotting Univariate and bivariate analysis Audience: Introductory methods course Undergraduate sociology students Some with STATA experience @alexhanna // alex-hanna.com

Slide 33

Slide 33 text

RStudio: Using Examples Code blocks and interface Allowing for “do-it-yourself” puzzle after initial instructions @alexhanna // alex-hanna.com

Slide 34

Slide 34 text

Example 2: Blogclub “tworkshops” Goal: From zero to Hadoop for social media data Basic UNIX terminal, Python Various types of analysis Audience: Mix of ~10 faculty and PhD students in SJMC Labs taking place over timespan of a year @alexhanna // alex-hanna.com

Slide 35

Slide 35 text

Tworkshop syllabus 1. Twitter API and an introduction to the terminal 2. More terminal and your first Python script 3. Basic Python 4. Python modules and I/O 5. Hadoop and MapReduce 6. Basic sentiment analysis 7. Network analysis @alexhanna // alex-hanna.com

Slide 36

Slide 36 text

Example 3: Data Science and Social Science @ NYU Goal: Data Science in R Data munging handling Visualization, regression Textual analysis and social network analysis Web scraping and API access Audience: Graduate students Have statistical training Varying levels of R literacy More proficient with STATA @alexhanna // alex-hanna.com

Slide 37

Slide 37 text

Connecting methods with relevant questions Using CS + education psychology example: fighting bullying with machine learning @alexhanna // alex-hanna.com

Slide 38

Slide 38 text

Summary Computational social sciences has vastly increased the amount of data that social scientists use in daily practice Handling data at scale takes training, not to mention trial and error @alexhanna // alex-hanna.com

Slide 39

Slide 39 text

Takeaways Play around with different solutions. Ask peers and mentors. Collaborate. Explore how to apply methods / technologies to your current project. Don’t be afraid to fail (and fail often!) @alexhanna // alex-hanna.com

Slide 40

Slide 40 text

Thanks! [email protected] @alexhanna // alex-hanna.com