Slide 1

Slide 1 text

Introduction to Spark Spark for real beginners

Slide 2

Slide 2 text

What is Spark? • framework to run applications over clusters of computers Why Spark? • better memory use than Hadoop (https://amplab.cs.berkeley.edu/benchmark/) • good integration with Hadoop • Python is supported  Pyspark is Spark’s Python interactive shell

Slide 3

Slide 3 text

Cluster overview Master node “application” Slave nodes

Slide 4

Slide 4 text

Example of a job Resilient distributed dataset

Slide 5

Slide 5 text

A typical job on Spark using EC2+Python 1/ launch a cluster using the spark-ec2 script ./spark-ec2 -k -i -s launch 2a/ launch the application on the cluster using spark-submit, e.g. ./bin/spark-submit code.py 2b/ use the interactive shell, e.g. ./bin/pyspark --py-files code.py ⇒ SparkContext object created to “tell Spark how to access a cluster”

Slide 6

Slide 6 text

Python adds another level of abstraction

Slide 7

Slide 7 text

Simple example: word count file = sc.textFile(‘s3n://bucket/text’) counts = file.map( lambda x: x.replace(',', ' ') .replace('.',' ') .replace('-',' ') .lower()) \ .flatMap(lambda x: x.split()) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y:x+y) input: [‘So wise so young, they say, do never live long.’] map: [‘so wise so young they say do never live long’] flatMap: [‘so’, ‘wise’, ‘so’, ‘young’, ‘they’, ‘say’, ‘do’, ‘never’, ‘live’, ‘long’] map: [(‘so’, 1), (‘wise’, 1), (‘so’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)] output: [(‘so’, 2), (‘wise’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)]

Slide 8

Slide 8 text

Use case (1): filtering large datasets [{“cookie_id” : cookie1, “date”: “05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”} … {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}] Which users have been on url1?

Slide 9

Slide 9 text

Code: result = sc.textFile(‘s3n://very_large_dataset’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda x: (x[‘cookie_id’],1))\ .reduceByKey(operator.add) Output: [(cookie13,12), (cookie2, 5),(cookie8, 1), …] Problem of data sparsity: • number of initial partitions = 150,000, original size = ~3 Tb • output is only 16 Gb so ideally ~1500 partitions

Slide 10

Slide 10 text

one partition Map – Filter – Reduce - Shuffle empty partition matching record

Slide 11

Slide 11 text

one partition Map – Filter – Reduce – Shuffle – Coalesce to one partition

Slide 12

Slide 12 text

result = sc.textFile(‘s3://very_large_file’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda x: (x[‘cookie_id’],1))\ .coalesce(1500)\ .reduceByKey(operator.add)

Slide 13

Slide 13 text

Use case (2): filtering comparing to small array Objective: we would like to filter log events depending on whether the event is a member of a group [{“cookie_id” : cookie1, “date”: “05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”}, {“cookie_id”: cookie5, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}, {“cookie_id”: cookie2, “date”: “05/08/2014”, “page_visited”: url4, “country”: “UK”}, What user has been on [url1, url3, url5]?

Slide 14

Slide 14 text

If the array of urls can be held in memory (~1 Gb) ⇒ broadcast the array to all the machines ⇒ do a simple comparison with keyword “in” urls = set([url1, url3, url5]) broadcastUrls = sc.broadcast(urls) input = sc.textFile(s3://very_large_file) filtered_events = input.filter(lambda x: x[‘url’] in broadcastUrls.value)

Slide 15

Slide 15 text

Use case (3): scraping pages Objective: scrape pages for content categorisation over multiple machines. Using scrapy library in Python: 1/ input is a list of urls stored in S3 2/ urls are randomly distributed to machines in the cluster 3/ each machine runs an independent scrapy job with same settings 3/ output is written back to S3 as a json file

Slide 16

Slide 16 text

~/spark/bin/spark-submit --py-files scraper/pubCrawler.tar.gz,scraper/scraper_fun ctions.py ~/scraper/scraper.py --bucket s3://bucket_name/crawler/output --input s3n://bucket_name/crawler/list_urls --sizepart 1000 Up to 6 scraped urls per machine per second!

Slide 17

Slide 17 text

My experience: some insights (pros) 1/ easy code for the manipulation of distributed data structures (tries to hide implementation details from you) 2/ for repeated queries, it is very fast 3/ easy to integrate your own python functions in the rdd 4/ active community

Slide 18

Slide 18 text

My experience: some insights (cons) 1/ Very new software, working with latest commits, sometimes github repositories from individuals (not the main apache/spark repo) -a lot of time fixing bugs (reading large files from S3, web UI bug, jog hanging...) -hard to learn for someone without engineering background 2/ Python is always second to see new features 3/ When error is bad: spark context closes down, all data lost 4/ Some features still not figured out: persistent-hdfs, checkpoint data

Slide 19

Slide 19 text

Helpful resources • Spark main documentation • Pyspark API • Looking at the python code for RDD operations • Spark mailing list

Slide 20

Slide 20 text

Conclusion • Spark has a bit of a learning curve • Maybe not ideal for a data scientist without any software engineering experience • Very fast development, active community @ Skimlinks: currently recruiting for data scientists and engineers who want to use Spark! If interested, please contact me at [email protected]