Data Pipelines and Computational Methods for the Social Sciences

440

Data Pipelines and Computational Methods for the Social Sciences

Presentation given as part of the RDS Holz Brown Bag series, March 2016.

60d0e0af6e89ae0f6114f89cb72b21d3?s=128

Research Data Services

March 21, 2016
Tweet

Transcript

  1. Data Pipelines and Computational Methods for the Social Sciences Alex

    Hanna | Department of Sociology | UW-Madison
  2. Data Pipelines and Computational Methods for the Social Sciences Alex

    Hanna PhD Candidate Sociology March 9, 2016 @alexhanna // alex-hanna.com
  3. Outline @alexhanna // alex-hanna.com

  4. Outline @alexhanna // alex-hanna.com

  5. Outline @alexhanna // alex-hanna.com

  6. Slides are available at http://tinyurl.com/rds-ahanna Follow along at home and

    check out links! @alexhanna // alex-hanna.com
  7. Part 1: Twitter and politics

  8. None
  9. Twitter and politics Understanding dynamics of elections and Twitter activity

  10. None
  11. Wisconsin Twitter Collection Started in early 2012 Currently 40+ TB

    Over 50 billion tweets @alexhanna // alex-hanna.com
  12. Olden times @alexhanna // alex-hanna.com

  13. Olden times Storage Hardware: PC Software: Compressed (gzip) JSON Host:

    Social Science Computing Cooperative Collection Home-rolled Perl Slow to process, search, and index @alexhanna // alex-hanna.com
  14. Modern times @alexhanna // alex-hanna.com

  15. Modern times Storage Hardware: 7-node Hadoop cluster Software: Hive, compressed

    with Snappy codec Host: Computer Systems Lab Collection Twitter Hosebird client Quick processing, search, indexing @alexhanna // alex-hanna.com
  16. Part 2: Protest event data

  17. None
  18. Date: February 6, 1987 Location: Mercury, Nevada Issue: Peace (anti-nuclear)

    Form: Rally Target: US government Size: 2,000 Orgs: Greenpeace et al. @alexhanna // alex-hanna.com
  19. None
  20. None
  21. Machine-learning Protest Event Data System (MPEDS) A system for generating

    new protest event data with minimal human intervention using tools from natural language processing and machine learning
  22. Original setup Dynamics of Collective Action as training data New

    York Times Annotated Corpus, XML files Problem: managing and accessing files is slow and messy @alexhanna // alex-hanna.com
  23. Current setup Training interface on SSCC VM Stored in SQLite

    (moving to MySQL) Being developed currently (GitHub) @alexhanna // alex-hanna.com
  24. Current setup Document storage (>10 million) in Apache Solr Quick

    searching, index built upon insertion @alexhanna // alex-hanna.com
  25. Part 3: Computational Social Science Education

  26. The problem Social scientists use SPSS or STATA But not

    R, Python, Hadoop, command-line interface Teaching literacy to both new and veteran scholars @alexhanna // alex-hanna.com
  27. Old tasks Data munging Regression Graphing @alexhanna // alex-hanna.com

  28. Old tasks Data munging Regression Graphing Data munging Web scraping

    Large-scale networks Automated text analysis New tasks @alexhanna // alex-hanna.com
  29. Pedagogical approach Meet people where they are at How can

    you get them involved in a meaningful way? @alexhanna // alex-hanna.com
  30. Pedagogical approach Meet people where they are at How can

    you get them involved in a meaningful way? Provide a lab setting for working through problems Guide people along with hands-on workshops @alexhanna // alex-hanna.com
  31. Pedagogical approach Meet people where they are at How can

    you get them involved in a meaningful way? Provide a lab setting for working through problems Guide people along with hands-on workshops Make code and instructions integrated and available on the web GitHub, RMarkdown, IPython/Project Jupyter notebooks @alexhanna // alex-hanna.com
  32. Example 1: Introduction to RStudio Goal: Traditional tasks Data handling

    Plotting Univariate and bivariate analysis Audience: Introductory methods course Undergraduate sociology students Some with STATA experience @alexhanna // alex-hanna.com
  33. RStudio: Using Examples Code blocks and interface Allowing for “do-it-yourself”

    puzzle after initial instructions @alexhanna // alex-hanna.com
  34. Example 2: Blogclub “tworkshops” Goal: From zero to Hadoop for

    social media data Basic UNIX terminal, Python Various types of analysis Audience: Mix of ~10 faculty and PhD students in SJMC Labs taking place over timespan of a year @alexhanna // alex-hanna.com
  35. Tworkshop syllabus 1. Twitter API and an introduction to the

    terminal 2. More terminal and your first Python script 3. Basic Python 4. Python modules and I/O 5. Hadoop and MapReduce 6. Basic sentiment analysis 7. Network analysis @alexhanna // alex-hanna.com
  36. Example 3: Data Science and Social Science @ NYU Goal:

    Data Science in R Data munging handling Visualization, regression Textual analysis and social network analysis Web scraping and API access Audience: Graduate students Have statistical training Varying levels of R literacy More proficient with STATA @alexhanna // alex-hanna.com
  37. Connecting methods with relevant questions Using CS + education psychology

    example: fighting bullying with machine learning @alexhanna // alex-hanna.com
  38. Summary Computational social sciences has vastly increased the amount of

    data that social scientists use in daily practice Handling data at scale takes training, not to mention trial and error @alexhanna // alex-hanna.com
  39. Takeaways Play around with different solutions. Ask peers and mentors.

    Collaborate. Explore how to apply methods / technologies to your current project. Don’t be afraid to fail (and fail often!) @alexhanna // alex-hanna.com
  40. Thanks! ahanna@ssc.wisc.edu @alexhanna // alex-hanna.com