In this slide, you will find the introduction to hadoop and spark. Why we use hadoop, spark. What is RDD in spark? How to operate RDD? Anything special in spark?
information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012 9
issues ◦ open source platform ◦ latest stable version • General limitations ◦ Hadoop is not the only answer ◦ data collection, aggregation and integration 41
big data (HDFS) and parallel execution model (MapReduce) • Spark ◦ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers 44
than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量 時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/ 48
= line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1) # reducer.py from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count) 51
val sc = new SparkContext(conf) • Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI 75
a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ... 80