"Intro to Apache Spark" by Maria Rosario Mestre at PyDataLondon 2014 August Meetup

Introduction to Spark Spark for real beginners

What is Spark? • framework to run applications over clusters
of computers Why Spark? • better memory use than Hadoop (https://amplab.cs.berkeley.edu/benchmark/) • good integration with Hadoop • Python is supported  Pyspark is Spark’s Python interactive shell

Cluster overview Master node “application” Slave nodes

Example of a job Resilient distributed dataset

A typical job on Spark using EC2+Python 1/ launch a
cluster using the spark-ec2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> 2a/ launch the application on the cluster using spark-submit, e.g. ./bin/spark-submit code.py <application-arguments> 2b/ use the interactive shell, e.g. ./bin/pyspark --py-files code.py ⇒ SparkContext object created to “tell Spark how to access a cluster”

Python adds another level of abstraction

Simple example: word count file = sc.textFile(‘s3n://bucket/text’) counts = file.map(
lambda x: x.replace(',', ' ') .replace('.',' ') .replace('-',' ') .lower()) \ .flatMap(lambda x: x.split()) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y:x+y) input: [‘So wise so young, they say, do never live long.’] map: [‘so wise so young they say do never live long’] flatMap: [‘so’, ‘wise’, ‘so’, ‘young’, ‘they’, ‘say’, ‘do’, ‘never’, ‘live’, ‘long’] map: [(‘so’, 1), (‘wise’, 1), (‘so’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)] output: [(‘so’, 2), (‘wise’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)]

Use case (1): filtering large datasets [{“cookie_id” : cookie1, “date”:
“05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”} … {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}] Which users have been on url1?

Code: result = sc.textFile(‘s3n://very_large_dataset’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda
x: (x[‘cookie_id’],1))\ .reduceByKey(operator.add) Output: [(cookie13,12), (cookie2, 5),(cookie8, 1), …] Problem of data sparsity: • number of initial partitions = 150,000, original size = ~3 Tb • output is only 16 Gb so ideally ~1500 partitions

one partition Map – Filter – Reduce - Shuffle empty
partition matching record

one partition Map – Filter – Reduce – Shuffle –
Coalesce to one partition

result = sc.textFile(‘s3://very_large_file’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda x:
(x[‘cookie_id’],1))\ .coalesce(1500)\ .reduceByKey(operator.add)

Use case (2): filtering comparing to small array Objective: we
would like to filter log events depending on whether the event is a member of a group [{“cookie_id” : cookie1, “date”: “05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”}, {“cookie_id”: cookie5, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}, {“cookie_id”: cookie2, “date”: “05/08/2014”, “page_visited”: url4, “country”: “UK”}, What user has been on [url1, url3, url5]?

If the array of urls can be held in memory
(~1 Gb) ⇒ broadcast the array to all the machines ⇒ do a simple comparison with keyword “in” urls = set([url1, url3, url5]) broadcastUrls = sc.broadcast(urls) input = sc.textFile(s3://very_large_file) filtered_events = input.filter(lambda x: x[‘url’] in broadcastUrls.value)

Use case (3): scraping pages Objective: scrape pages for content
categorisation over multiple machines. Using scrapy library in Python: 1/ input is a list of urls stored in S3 2/ urls are randomly distributed to machines in the cluster 3/ each machine runs an independent scrapy job with same settings 3/ output is written back to S3 as a json file

~/spark/bin/spark-submit --py-files scraper/pubCrawler.tar.gz,scraper/scraper_fun ctions.py ~/scraper/scraper.py --bucket s3://bucket_name/crawler/output --input s3n://bucket_name/crawler/list_urls --sizepart
1000 Up to 6 scraped urls per machine per second!

My experience: some insights (pros) 1/ easy code for the
manipulation of distributed data structures (tries to hide implementation details from you) 2/ for repeated queries, it is very fast 3/ easy to integrate your own python functions in the rdd 4/ active community

My experience: some insights (cons) 1/ Very new software, working
with latest commits, sometimes github repositories from individuals (not the main apache/spark repo) -a lot of time fixing bugs (reading large files from S3, web UI bug, jog hanging...) -hard to learn for someone without engineering background 2/ Python is always second to see new features 3/ When error is bad: spark context closes down, all data lost 4/ Some features still not figured out: persistent-hdfs, checkpoint data

Helpful resources • Spark main documentation • Pyspark API •
Looking at the python code for RDD operations • Spark mailing list

Conclusion • Spark has a bit of a learning curve
• Maybe not ideal for a data scientist without any software engineering experience • Very fast development, active community @ Skimlinks: currently recruiting for data scientists and engineers who want to use Spark! If interested, please contact me at [email protected]

"Intro to Apache Spark" by Maria Rosario Mestre...

"Intro to Apache Spark" by Maria Rosario Mestre at PyDataLondon 2014 August Meetup

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

Introduction to Spark Spark for real beginners

What is Spark? • framework to run applications over clusters

Cluster overview Master node “application” Slave nodes

Example of a job Resilient distributed dataset

A typical job on Spark using EC2+Python 1/ launch a

Python adds another level of abstraction

Simple example: word count file = sc.textFile(‘s3n://bucket/text’) counts = file.map(

Use case (1): filtering large datasets [{“cookie_id” : cookie1, “date”:

Code: result = sc.textFile(‘s3n://very_large_dataset’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda

one partition Map – Filter – Reduce - Shuffle empty

one partition Map – Filter – Reduce – Shuffle –

result = sc.textFile(‘s3://very_large_file’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda x:

Use case (2): filtering comparing to small array Objective: we

If the array of urls can be held in memory

Use case (3): scraping pages Objective: scrape pages for content

~/spark/bin/spark-submit --py-files scraper/pubCrawler.tar.gz,scraper/scraper_fun ctions.py ~/scraper/scraper.py --bucket s3://bucket_name/crawler/output --input s3n://bucket_name/crawler/list_urls --sizepart

My experience: some insights (pros) 1/ easy code for the

My experience: some insights (cons) 1/ Very new software, working

Helpful resources • Spark main documentation • Pyspark API •

Conclusion • Spark has a bit of a learning curve