Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Pipelines and Computational Methods for th...

540

Data Pipelines and Computational Methods for the Social Sciences

Presentation given as part of the RDS Holz Brown Bag series, March 2016.

Research Data Services

March 21, 2016
Tweet

More Decks by Research Data Services

Transcript

  1. Data Pipelines and Computational Methods for the Social Sciences Alex

    Hanna | Department of Sociology | UW-Madison
  2. Data Pipelines and Computational Methods for the Social Sciences Alex

    Hanna PhD Candidate Sociology March 9, 2016 @alexhanna // alex-hanna.com
  3. Wisconsin Twitter Collection Started in early 2012 Currently 40+ TB

    Over 50 billion tweets @alexhanna // alex-hanna.com
  4. Olden times Storage Hardware: PC Software: Compressed (gzip) JSON Host:

    Social Science Computing Cooperative Collection Home-rolled Perl Slow to process, search, and index @alexhanna // alex-hanna.com
  5. Modern times Storage Hardware: 7-node Hadoop cluster Software: Hive, compressed

    with Snappy codec Host: Computer Systems Lab Collection Twitter Hosebird client Quick processing, search, indexing @alexhanna // alex-hanna.com
  6. Date: February 6, 1987 Location: Mercury, Nevada Issue: Peace (anti-nuclear)

    Form: Rally Target: US government Size: 2,000 Orgs: Greenpeace et al. @alexhanna // alex-hanna.com
  7. Machine-learning Protest Event Data System (MPEDS) A system for generating

    new protest event data with minimal human intervention using tools from natural language processing and machine learning
  8. Original setup Dynamics of Collective Action as training data New

    York Times Annotated Corpus, XML files Problem: managing and accessing files is slow and messy @alexhanna // alex-hanna.com
  9. Current setup Training interface on SSCC VM Stored in SQLite

    (moving to MySQL) Being developed currently (GitHub) @alexhanna // alex-hanna.com
  10. Current setup Document storage (>10 million) in Apache Solr Quick

    searching, index built upon insertion @alexhanna // alex-hanna.com
  11. The problem Social scientists use SPSS or STATA But not

    R, Python, Hadoop, command-line interface Teaching literacy to both new and veteran scholars @alexhanna // alex-hanna.com
  12. Old tasks Data munging Regression Graphing Data munging Web scraping

    Large-scale networks Automated text analysis New tasks @alexhanna // alex-hanna.com
  13. Pedagogical approach Meet people where they are at How can

    you get them involved in a meaningful way? @alexhanna // alex-hanna.com
  14. Pedagogical approach Meet people where they are at How can

    you get them involved in a meaningful way? Provide a lab setting for working through problems Guide people along with hands-on workshops @alexhanna // alex-hanna.com
  15. Pedagogical approach Meet people where they are at How can

    you get them involved in a meaningful way? Provide a lab setting for working through problems Guide people along with hands-on workshops Make code and instructions integrated and available on the web GitHub, RMarkdown, IPython/Project Jupyter notebooks @alexhanna // alex-hanna.com
  16. Example 1: Introduction to RStudio Goal: Traditional tasks Data handling

    Plotting Univariate and bivariate analysis Audience: Introductory methods course Undergraduate sociology students Some with STATA experience @alexhanna // alex-hanna.com
  17. RStudio: Using Examples Code blocks and interface Allowing for “do-it-yourself”

    puzzle after initial instructions @alexhanna // alex-hanna.com
  18. Example 2: Blogclub “tworkshops” Goal: From zero to Hadoop for

    social media data Basic UNIX terminal, Python Various types of analysis Audience: Mix of ~10 faculty and PhD students in SJMC Labs taking place over timespan of a year @alexhanna // alex-hanna.com
  19. Tworkshop syllabus 1. Twitter API and an introduction to the

    terminal 2. More terminal and your first Python script 3. Basic Python 4. Python modules and I/O 5. Hadoop and MapReduce 6. Basic sentiment analysis 7. Network analysis @alexhanna // alex-hanna.com
  20. Example 3: Data Science and Social Science @ NYU Goal:

    Data Science in R Data munging handling Visualization, regression Textual analysis and social network analysis Web scraping and API access Audience: Graduate students Have statistical training Varying levels of R literacy More proficient with STATA @alexhanna // alex-hanna.com
  21. Connecting methods with relevant questions Using CS + education psychology

    example: fighting bullying with machine learning @alexhanna // alex-hanna.com
  22. Summary Computational social sciences has vastly increased the amount of

    data that social scientists use in daily practice Handling data at scale takes training, not to mention trial and error @alexhanna // alex-hanna.com
  23. Takeaways Play around with different solutions. Ask peers and mentors.

    Collaborate. Explore how to apply methods / technologies to your current project. Don’t be afraid to fail (and fail often!) @alexhanna // alex-hanna.com