Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Tour of Large-Scale Data Analysis Tools In Py...

A Tour of Large-Scale Data Analysis Tools In Python

Sean O'Connor

May 28, 2016
Tweet

More Decks by Sean O'Connor

Other Decks in Technology

Transcript

  1. Sarah Guido & Sean O’Connor PyCon 2016 May 28th, 2016

    A TOUR OF LARGE- SCALE DATA ANALYSIS TOOLS IN PYTHON
  2. •Download and install Virtual Box from
 http:/ /bit.ly/vboxdl •Download and

    install Vagrant from
 http:/ /bit.ly/vagrant-dl Env Setup Get pre-reqs installed
  3. •Sean O'Connor • Director of Application Engineering • Has spoken

    at a number of conferences • @theSeanOC •Sarah Guido • Lead Data Scientist • Has also spoken at a number of conferences • @sarah_guido About us!
  4. •Basics of large-scale data processing in Python • Pure Python

    • Hadoop/MapReduce • Spark • Final exercise •Mostly hands-on! About this tutorial!
  5. •A "decode" is a click on a Bitly link •Lots

    of fields: • h: Bitly user hash identifier • g: Bitly global hash identifier • a: browser user agent • u: long URL • t: timestamp (UTC) • c: country (two-letter code) • nk: repeat client • kw: keyword alias for user hash • ckw: custom keyword • cy: city (optional) All about the data! Decode data 1usagov_data 1usagov_data_small 1usagov_data_tiny
  6. •Fields • Agency • Ex. Library of Congress • City

    • Domain Name • Ex. loc.gov • Domain Type • Ex. Federal Agency • Global Hash • Same as in decodes data • Hostname • Ex. www.loc.gov • State All about the data! Agency data agency_map
  7. •Open the small data file. •Parse each line. •Count the

    clicks per link. •Output a TSV of results. Pure Python Count Clicks EXERCISE 1.1 Try on your own for 5 minutes and then we’ll review a solution
  8. •Build on Exercise 1.1 •Count clicks by country in addition

    to link. •Output results as a three column TSV:
 type, value, count Pure Python Count Clicks & Countries EXERCISE 1.2 Try on your own for 5 minutes and then we’ll review a solution
  9. •Build on Exercise 1.2 •Sort your results and return only

    the top 20 links and 20 countries in descending order. •Output results as a three column TSV:
 type, value, count Pure Python Top Links & Countries EXERCISE 1.3 Try on your own for 5 minutes and then we’ll review a solution
  10. •Build on Exercise 1.3 •Merge in the agency data •Calculate

    the top 20 agencies, in addition to the top countries and links. •Output results as a three column TSV:
 type, value, count Pure Python Join With Agencies EXERCISE 1.4 Try on your own for 5 minutes and then we’ll review a solution
  11. •Build on Exercise 1.4 •Filter to clicks on links with

    a agency of “Department of State”. •Calculate the top 20 links and countries within the filter. •Output results as a three column TSV:
 type, value, count Pure Python Filter By Agency EXERCISE 1.5 Try on your own for 5 minutes and then we’ll review a solution
  12. •Copy the boilerplate.py file into your working file. •Your goal

    is to count clicks by hash. •Write a mapper, combiner, and reducer to accomplish this. •Run the job on your VM. Hadoop Count Clicks EXERCISE 2.1 Try on your own for 5 minutes and then we’ll review a solution
  13. •Build on Exercise 2.1 •Count clicks by country in addition

    to link. •Update your mapper, combiner, and reducer to accomplish this. •Run the job on your VM. Hadoop EXERCISE 2.2 Try on your own for 5 minutes and then we’ll review a solution Count Clicks & Countries
  14. •Build on Exercise 2.2 •Sort your results and return only

    the top 20 links and 20 countries in descending order. •Hint: you will want to update your reducer to accomplish this. •Run the job on your VM. Hadoop EXERCISE 2.3 Try on your own for 5 minutes and then we’ll review a solution Top Links & Countries
  15. •Build on Exercise 2.3 •Merge in the agency data •Calculate

    the top 20 agencies, in addition to the top countries and links. •Run the job on your VM. Hadoop EXERCISE 2.4 Try on your own for 5 minutes and then we’ll review a solution Join With Agencies
  16. •Build on Exercise 2.4 •Filter to clicks on links with

    a agency of “Department of State”. •Calculate the top 20 links and countries within the filter. •Run the job on your VM. Hadoop EXERCISE 2.5 Try on your own for 5 minutes and then we’ll review a solution Join With Agencies
  17. •Large-scale distributed data processing tool •SQL and streaming tools •Faster

    than Hadoop •Python, Scala, Java, R APIs Spark The What
  18. •Fast. Really fast. •SQL layer – kind of like Hive

    •Distributed scientific tools •Python! Sometimes. •Cutting edge technology Spark The Why
  19. •Partitions your data to operate over in parallel •Capability to

    add map/ reduce features •Lazy – only operates when a method is called (ex. collect()/or writing to file) Spark The How
  20. •Lots of data •Need for large-scale analysis tools •Used to

    process data, build machine learning systems, explore relationships in data Spark Spark at Bitly
  21. •RDDs •DataFrames •Launch the interactive PySpark shell by typing "pyspark"

    in the vm •Let's walk through some basics! Spark Basics
  22. •What are the top countries per link? •What are the

    top links per country? •Looking at 20 is fine (show() will output 20) EXERCISE 3.1 Spark Top Links & Countries Try on your own for 5 minutes and then we’ll review a solution
  23. •What are the top links with the Library of Congress

    as the agency? •What are the top countries with the Department of Education as the agency? EXERCISE 3.2 Spark Filter by Agency Try on your own for 5 minutes and then we’ll review a solution
  24. •Find the breakdown of agencies for China (CN), Denmark (DE),

    and France (FR) •Find the most popular links for 3 agencies of your choice •How do they differ between countries? Are there any similarities? EXERCISE 4 More agencies, Try on your own for 15 minutes and then we’ll review a solution links, and countries
  25. •https:/ /github.com/SeanOC/ pycon-2016-tutorial-vm •https:/ /github.com/sarguido/ large-scale-data-analysis •https:/ /www.python.org/ •https:/ /pythonhosted.org/

    mrjob/ •https:/ /spark.apache.org/ • https:/ /databricks.com/blog • http:/ /blog.cloudera.com/ blog/category/spark/ Resources Check these out to learn more!
  26. THANK YOU Sarah Guido Lead Data Scientist @sarah_guido Sean O'Connor

    Director of Application Engineering @theSeanOC