Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Intro to Apache Spark" by Maria Rosario Mestre at PyDataLondon 2014 August Meetup

ianozsvald
August 13, 2014

"Intro to Apache Spark" by Maria Rosario Mestre at PyDataLondon 2014 August Meetup

Talk given by Maria for the PyDataLondon Meetup: http://ianozsvald.com/2014/08/08/pydatalondon-3rd-event/

ianozsvald

August 13, 2014
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. What is Spark? • framework to run applications over clusters

    of computers Why Spark? • better memory use than Hadoop (https://amplab.cs.berkeley.edu/benchmark/) • good integration with Hadoop • Python is supported  Pyspark is Spark’s Python interactive shell
  2. A typical job on Spark using EC2+Python 1/ launch a

    cluster using the spark-ec2 script ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> 2a/ launch the application on the cluster using spark-submit, e.g. ./bin/spark-submit code.py <application-arguments> 2b/ use the interactive shell, e.g. ./bin/pyspark --py-files code.py ⇒ SparkContext object created to “tell Spark how to access a cluster”
  3. Simple example: word count file = sc.textFile(‘s3n://bucket/text’) counts = file.map(

    lambda x: x.replace(',', ' ') .replace('.',' ') .replace('-',' ') .lower()) \ .flatMap(lambda x: x.split()) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y:x+y) input: [‘So wise so young, they say, do never live long.’] map: [‘so wise so young they say do never live long’] flatMap: [‘so’, ‘wise’, ‘so’, ‘young’, ‘they’, ‘say’, ‘do’, ‘never’, ‘live’, ‘long’] map: [(‘so’, 1), (‘wise’, 1), (‘so’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)] output: [(‘so’, 2), (‘wise’, 1), (‘young’, 1), (‘they’, 1), …, (‘long’, 1)]
  4. Use case (1): filtering large datasets [{“cookie_id” : cookie1, “date”:

    “05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”} … {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}] Which users have been on url1?
  5. Code: result = sc.textFile(‘s3n://very_large_dataset’)\ .filter(lambda x: x[‘page_visited’] == ‘url1’)\ .map(lambda

    x: (x[‘cookie_id’],1))\ .reduceByKey(operator.add) Output: [(cookie13,12), (cookie2, 5),(cookie8, 1), …] Problem of data sparsity: • number of initial partitions = 150,000, original size = ~3 Tb • output is only 16 Gb so ideally ~1500 partitions
  6. Use case (2): filtering comparing to small array Objective: we

    would like to filter log events depending on whether the event is a member of a group [{“cookie_id” : cookie1, “date”: “05/08/2014”, “page_visited”: url1, “country”: “UK”}, {“cookie_id”: cookie2, “date”: “06/08/2014”, “page_visited”: url2, “country”: “UK”}, {“cookie_id”: cookie3, “date”: “05/08/2014”, “page_visited”: url3, “country” : “US”}, {“cookie_id”: cookie5, “date”: “06/08/2014”, “page_visited”: url1, “country” : “US”}, {“cookie_id”: cookie2, “date”: “05/08/2014”, “page_visited”: url4, “country”: “UK”}, What user has been on [url1, url3, url5]?
  7. If the array of urls can be held in memory

    (~1 Gb) ⇒ broadcast the array to all the machines ⇒ do a simple comparison with keyword “in” urls = set([url1, url3, url5]) broadcastUrls = sc.broadcast(urls) input = sc.textFile(s3://very_large_file) filtered_events = input.filter(lambda x: x[‘url’] in broadcastUrls.value)
  8. Use case (3): scraping pages Objective: scrape pages for content

    categorisation over multiple machines. Using scrapy library in Python: 1/ input is a list of urls stored in S3 2/ urls are randomly distributed to machines in the cluster 3/ each machine runs an independent scrapy job with same settings 3/ output is written back to S3 as a json file
  9. My experience: some insights (pros) 1/ easy code for the

    manipulation of distributed data structures (tries to hide implementation details from you) 2/ for repeated queries, it is very fast 3/ easy to integrate your own python functions in the rdd 4/ active community
  10. My experience: some insights (cons) 1/ Very new software, working

    with latest commits, sometimes github repositories from individuals (not the main apache/spark repo) -a lot of time fixing bugs (reading large files from S3, web UI bug, jog hanging...) -hard to learn for someone without engineering background 2/ Python is always second to see new features 3/ When error is bad: spark context closes down, all data lost 4/ Some features still not figured out: persistent-hdfs, checkpoint data
  11. Helpful resources • Spark main documentation • Pyspark API •

    Looking at the python code for RDD operations • Spark mailing list
  12. Conclusion • Spark has a bit of a learning curve

    • Maybe not ideal for a data scientist without any software engineering experience • Very fast development, active community @ Skimlinks: currently recruiting for data scientists and engineers who want to use Spark! If interested, please contact me at [email protected]