A Tour of Large-Scale Data Analysis Tools In Python

Sarah Guido & Sean O’Connor PyCon 2016 May 28th, 2016
A TOUR OF LARGE- SCALE DATA ANALYSIS TOOLS IN PYTHON

gph.is/10JrTRx

•Download and install Virtual Box from  http:/ /bit.ly/vboxdl •Download and
install Vagrant from  http:/ /bit.ly/vagrant-dl Env Setup Get pre-reqs installed

•Clone the repo from  http:/ /bit.ly/pycon-vm •git clone https://... Env
Setup Clone Repo

•Navigate into the root of your checkout in your shell.
Env Setup Get VM Running

vagrant plugin install vagrant-hostmanager

vagrant up

•Sean O'Connor • Director of Application Engineering • Has spoken
at a number of conferences • @theSeanOC •Sarah Guido • Lead Data Scientist • Has also spoken at a number of conferences • @sarah_guido About us!

gph.is/1he9iDA

gph.is/1cH3iz6

SHORTLINKS? Really?

bit.ly/1wYAkjS

•Basics of large-scale data processing in Python • Pure Python
• Hadoop/MapReduce • Spark • Final exercise •Mostly hands-on! About this tutorial!

gph.is/1a7BN2Z

gph.is/11wceSz

•A "decode" is a click on a Bitly link •Lots
of fields: • h: Bitly user hash identifier • g: Bitly global hash identifier • a: browser user agent • u: long URL • t: timestamp (UTC) • c: country (two-letter code) • nk: repeat client • kw: keyword alias for user hash • ckw: custom keyword • cy: city (optional) All about the data! Decode data 1usagov_data 1usagov_data_small 1usagov_data_tiny

•Fields • Agency • Ex. Library of Congress • City
• Domain Name • Ex. loc.gov • Domain Type • Ex. Federal Agency • Global Hash • Same as in decodes data • Hostname • Ex. www.loc.gov • State All about the data! Agency data agency_map

vagrant ssh

cd /vagrant

./refresh_data.sh

./refresh_exercises.sh

Pure Python Easy but not scalable

•Open the small data file. •Parse each line. •Count the
clicks per link. •Output a TSV of results. Pure Python Count Clicks EXERCISE 1.1 Try on your own for 5 minutes and then we’ll review a solution

•Build on Exercise 1.1 •Count clicks by country in addition
to link. •Output results as a three column TSV:  type, value, count Pure Python Count Clicks & Countries EXERCISE 1.2 Try on your own for 5 minutes and then we’ll review a solution

•Build on Exercise 1.2 •Sort your results and return only
the top 20 links and 20 countries in descending order. •Output results as a three column TSV:  type, value, count Pure Python Top Links & Countries EXERCISE 1.3 Try on your own for 5 minutes and then we’ll review a solution

•Build on Exercise 1.3 •Merge in the agency data •Calculate
the top 20 agencies, in addition to the top countries and links. •Output results as a three column TSV:  type, value, count Pure Python Join With Agencies EXERCISE 1.4 Try on your own for 5 minutes and then we’ll review a solution

•Build on Exercise 1.4 •Filter to clicks on links with
a agency of “Department of State”. •Calculate the top 20 links and countries within the filter. •Output results as a three column TSV:  type, value, count Pure Python Filter By Agency EXERCISE 1.5 Try on your own for 5 minutes and then we’ll review a solution

Hadoop Not easy but scalable

bit.ly/1wYKtNB

bit.ly/1TMlFr9

gph.is/1sFS6ic

http:/ /bit.ly/pymrjob

vagrant ssh

cd /vagrant/exercises/hadoop_exercies/2.1/

./mr_count.py ../../../sample_data/1usagov_data

./mr_count.py -r hadoop \  hdfs:///user/vagrant/sample_data/1usagov_data

•Copy the boilerplate.py file into your working file. •Your goal
is to count clicks by hash. •Write a mapper, combiner, and reducer to accomplish this. •Run the job on your VM. Hadoop Count Clicks EXERCISE 2.1 Try on your own for 5 minutes and then we’ll review a solution

•Build on Exercise 2.1 •Count clicks by country in addition
to link. •Update your mapper, combiner, and reducer to accomplish this. •Run the job on your VM. Hadoop EXERCISE 2.2 Try on your own for 5 minutes and then we’ll review a solution Count Clicks & Countries

•Build on Exercise 2.2 •Sort your results and return only
the top 20 links and 20 countries in descending order. •Hint: you will want to update your reducer to accomplish this. •Run the job on your VM. Hadoop EXERCISE 2.3 Try on your own for 5 minutes and then we’ll review a solution Top Links & Countries

•Build on Exercise 2.3 •Merge in the agency data •Calculate
the top 20 agencies, in addition to the top countries and links. •Run the job on your VM. Hadoop EXERCISE 2.4 Try on your own for 5 minutes and then we’ll review a solution Join With Agencies

•Build on Exercise 2.4 •Filter to clicks on links with
a agency of “Department of State”. •Calculate the top 20 links and countries within the filter. •Run the job on your VM. Hadoop EXERCISE 2.5 Try on your own for 5 minutes and then we’ll review a solution Join With Agencies

Spark Made for large-scale data!

•Large-scale distributed data processing tool •SQL and streaming tools •Faster
than Hadoop •Python, Scala, Java, R APIs Spark The What

•Fast. Really fast. •SQL layer – kind of like Hive
•Distributed scientific tools •Python! Sometimes. •Cutting edge technology Spark The Why

•Partitions your data to operate over in parallel •Capability to
add map/ reduce features •Lazy – only operates when a method is called (ex. collect()/or writing to file) Spark The How

•Documentation is not mature •Major/frequent releases •Python API not as
robust as Scala API Spark Caveats

•Lots of data •Need for large-scale analysis tools •Used to
process data, build machine learning systems, explore relationships in data Spark Spark at Bitly

•RDDs •DataFrames •Launch the interactive PySpark shell by typing "pyspark"
in the vm •Let's walk through some basics! Spark Basics

•To the shell! •RDD vs. DataFrame WALKTHROUGH Spark Count Total
Clicks

•Group by WALKTHROUGH Spark Count Clicks By Hash

•Join WALKTHROUGH Spark Join to Agency Data

•What are the top countries per link? •What are the
top links per country? •Looking at 20 is fine (show() will output 20) EXERCISE 3.1 Spark Top Links & Countries Try on your own for 5 minutes and then we’ll review a solution

•What are the top links with the Library of Congress
as the agency? •What are the top countries with the Department of Education as the agency? EXERCISE 3.2 Spark Filter by Agency Try on your own for 5 minutes and then we’ll review a solution

Your mission Should you choose to accept it...

•Find the breakdown of agencies for China (CN), Denmark (DE),
and France (FR) •Find the most popular links for 3 agencies of your choice •How do they differ between countries? Are there any similarities? EXERCISE 4 More agencies, Try on your own for 15 minutes and then we’ll review a solution links, and countries

Solutions!

•https:/ /github.com/SeanOC/ pycon-2016-tutorial-vm •https:/ /github.com/sarguido/ large-scale-data-analysis •https:/ /www.python.org/ •https:/ /pythonhosted.org/
mrjob/ •https:/ /spark.apache.org/ • https:/ /databricks.com/blog • http:/ /blog.cloudera.com/ blog/category/spark/ Resources Check these out to learn more!

THANK YOU Sarah Guido Lead Data Scientist @sarah_guido Sean O'Connor
Director of Application Engineering @theSeanOC

A Tour of Large-Scale Data Analysis Tools In Py...

A Tour of Large-Scale Data Analysis Tools In Python

More Decks by Sean O'Connor

Other Decks in Technology

Featured

Transcript