Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Intro to Apache Spark" by Maria Rosario Mestre at PyDataLondon 2014 August Meetup

3d644406158b4d440111903db1f62622?s=47 ianozsvald
August 13, 2014

"Intro to Apache Spark" by Maria Rosario Mestre at PyDataLondon 2014 August Meetup

Talk given by Maria for the PyDataLondon Meetup: http://ianozsvald.com/2014/08/08/pydatalondon-3rd-event/



August 13, 2014


  1. Introduction to Spark Spark for real beginners

  2. What is Spark? • framework to run applications over clusters

    of computers Why Spark? • better memory use than Hadoop (https://amplab.cs.berkeley.edu/benchmark/) • good integration with Hadoop • Python is supported  Pyspark is Spark’s Python interactive shell
  3. Cluster overview Master node “application” Slave nodes

  4. Example of a job Resilient distributed dataset

  5. A typical job on Spark using EC2+Python 1/ launch a

    cluster using the spark-ec2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> 2a/ launch the application on the cluster using spark-submit, e.g. ./bin/spark-submit code.py <application-arguments> 2b/ use the interactive shell, e.g. ./bin/pyspark --py-files code.py ⇒ SparkContext object created to “tell Spark how to access a cluster”
  6. Python adds another level of abstraction

  7. Simple example: word count file = sc.textFile(‘s3n://bucket/text’) counts = file.map(

    lambda x: x.replace(',', ' ') .replace('.',' ') .replace('-',' ') .lower()) \ .flatMap(lambda x: x.split()) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y:x+y) input: [‘So wise so young, they say, do never live long.’] map: [‘so wise so young they say do never live long’] flatMap: [‘so’, ‘wise’, ‘so’, ‘young’, ‘they’, ‘say’, ‘do’, ‘never’, ‘live’, ‘long’] map: [(‘so’, 1), (‘wise’, 1), (‘so’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)] output: [(‘so’, 2), (‘wise’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)]
  8. Use case (1): filtering large datasets [{“cookie_id” : cookie1, “date”:

    “05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”} … {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}] Which users have been on url1?
  9. Code: result = sc.textFile(‘s3n://very_large_dataset’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda

    x: (x[‘cookie_id’],1))\ .reduceByKey(operator.add) Output: [(cookie13,12), (cookie2, 5),(cookie8, 1), …] Problem of data sparsity: • number of initial partitions = 150,000, original size = ~3 Tb • output is only 16 Gb so ideally ~1500 partitions
  10. one partition Map – Filter – Reduce - Shuffle empty

    partition matching record
  11. one partition Map – Filter – Reduce – Shuffle –

    Coalesce to one partition
  12. result = sc.textFile(‘s3://very_large_file’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda x:

    (x[‘cookie_id’],1))\ .coalesce(1500)\ .reduceByKey(operator.add)
  13. Use case (2): filtering comparing to small array Objective: we

    would like to filter log events depending on whether the event is a member of a group [{“cookie_id” : cookie1, “date”: “05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”}, {“cookie_id”: cookie5, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}, {“cookie_id”: cookie2, “date”: “05/08/2014”, “page_visited”: url4, “country”: “UK”}, What user has been on [url1, url3, url5]?
  14. If the array of urls can be held in memory

    (~1 Gb) ⇒ broadcast the array to all the machines ⇒ do a simple comparison with keyword “in” urls = set([url1, url3, url5]) broadcastUrls = sc.broadcast(urls) input = sc.textFile(s3://very_large_file) filtered_events = input.filter(lambda x: x[‘url’] in broadcastUrls.value)
  15. Use case (3): scraping pages Objective: scrape pages for content

    categorisation over multiple machines. Using scrapy library in Python: 1/ input is a list of urls stored in S3 2/ urls are randomly distributed to machines in the cluster 3/ each machine runs an independent scrapy job with same settings 3/ output is written back to S3 as a json file
  16. ~/spark/bin/spark-submit --py-files scraper/pubCrawler.tar.gz,scraper/scraper_fun ctions.py ~/scraper/scraper.py --bucket s3://bucket_name/crawler/output --input s3n://bucket_name/crawler/list_urls --sizepart

    1000 Up to 6 scraped urls per machine per second!
  17. My experience: some insights (pros) 1/ easy code for the

    manipulation of distributed data structures (tries to hide implementation details from you) 2/ for repeated queries, it is very fast 3/ easy to integrate your own python functions in the rdd 4/ active community
  18. My experience: some insights (cons) 1/ Very new software, working

    with latest commits, sometimes github repositories from individuals (not the main apache/spark repo) -a lot of time fixing bugs (reading large files from S3, web UI bug, jog hanging...) -hard to learn for someone without engineering background 2/ Python is always second to see new features 3/ When error is bad: spark context closes down, all data lost 4/ Some features still not figured out: persistent-hdfs, checkpoint data
  19. Helpful resources • Spark main documentation • Pyspark API •

    Looking at the python code for RDD operations • Spark mailing list
  20. Conclusion • Spark has a bit of a learning curve

    • Maybe not ideal for a data scientist without any software engineering experience • Very fast development, active community @ Skimlinks: currently recruiting for data scientists and engineers who want to use Spark! If interested, please contact me at maria@skimlinks.com