Intro to Big Data

Big Data 101 AN INTRODUCTION TO HADOOP

• What is Big data? • Intro to Hadoop •
Hadoop ecosystem • HDFS • Map Reduce • Introduction to Spark (topic for the next talk) Topics

90% of the data in the world today has been
created in the last two years alone. http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

Network strength Cell tower Information Triangulated location Call logs Messaging
details Data usage

https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/

“Data too big to be processed on a single machine”
What is ‘Big Data’?

Big Data CHARACTERISTICS Variety Velocity Volume

• Open-source software framework/ecosystem • Made for distributed storage and
processing of large amounts of data on multiple processors. • Processing is taken to where the data is stored - rather than the other way around Introduction to Hadoop

Apache Hadoop framework CORE COMPONENTS HDFS Map Reduce Yarn DISTRIBUTED
PROCESSING RESOURCE MANAGEMENT DISTRIBUTED STORAGE

Mahaut MACHINE LEARNING LIB. Apache Hadoop framework ECOSYSTEM HDFS Map
Reduce Yarn HIVE SQL LAYER PIG MR SCRIPTING Impala LOW LATENCY SQL LAYER Sqoop RD TO HDFS Flume REAL TIME DATA INGESTION HBase DISTRIBUTED DATASTORE HUE GUI FRONTEND Oozie WORKFLOW MANAGEMENT Zookeeper CENTRALIZED CLUSTER MGMT. Storm EVENT STREAMING Spark 10X MR Shark SQL LAYER

• Distributed File system • Breaks ﬁles into blocks. •
Redundancy • Replication • Blocks scattered all over the network HDFS HADOOP DISTRIBUTED FILE SYSTEM

HDFS HADOOP DISTRIBUTED FILE SYSTEM https://www.cac.cornell.edu/vw/MapReduce/dfs.aspx Large File

• Scale-Out Architecture • High Availability • Fault Tolerance •
Flexible Access HDFS ADVANTAGES • Load Balancing • Tuneable Replication • Security

• Programming Model • Parallel, distributed processing of large amounts
of raw data. • Directly processes the data at the Nodes that store them Map Reduce INTRODUCTION

Map Reduce ALGORITHM - MAP Foo Bar Foo Foo Baz
Foo Bar Foo Baz Foo Bar Baz Foo Bar Foo Foo Baz Foo Bar Foo Baz Foo Bar Foo Foo Foo Foo Bar Foo Foo Bar Baz Foo Bar Baz Baz Text File

Map Reduce ALGORITHM - REDUCE Foo Foo Foo Bar Foo
Foo Bar Baz Foo Bar Baz Baz Foo - 6 Bar - 3 Baz - 3

UNDER THE HOOD Map Reduce http://blog.pivotal.io/pivotal/products/hadoop-101-programming-mapreduce-with-native-libraries-hive-pig-and-cascading Map Shufﬂe and Sort
Reduce Input File

Map Reduce WORD COUNT PROBLEM - MAPPER File Hello World
Foobar Hello World Hello World Foobar Hello World Input Hello 1 World 1 Foobar 1 World 1 Hello 1 Mapper Output Map Map

Map Reduce WORD COUNT PROBLEM - MAPPER Foobar 1 Hello
2 World 2 Reducer Reducer Output Foobar 1 Hello 1 Hello 1 World 1 World 1 Shufﬂe and Sort Reducer Input Hello 1 World 1 Foobar 1 World 1 Hello 1 Mapper Output

def map_lines(lines): for line in lines: line = line.strip() words
= line.split() for word in words: #print the keys in order print '%s\t%s' % (word, 1) def main(): map_lines(sys.stdin) mapper.py WORD COUNT

def reduce_lines(lines): current_word = None current_count = 0 word =
None for line in lines: line = line.strip() word, count = line.split('\t', 1) count = int(count) # the keys are sorted - checking for key change. if current_word == word: current_count += count reducer.py WORD COUNT else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word #output the last word if current_word == word: print '%s\t%s' % (current_word, current_count) def main(): reduce_lines(sys.stdin)

$ hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/ tools/lib/hadoop-*streaming*.jar \ -file mapper.py \ -file
reducer.py \ -mapper mapper.py \ -reducer reducer.py \ -input hdfs://localhost:8020/tmp/test_log_file.txt \ -output hdfs://localhost:8020/tmp/output Running a MR Job COMMAND

15/06/18 00:47:20 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/rahul/workspace/hadoop/./reducer.py] 15/06/18 00:47:20 INFO
Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 15/06/18 00:47:20 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 15/06/18 00:47:20 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] 15/06/18 00:47:20 INFO streaming.PipeMapRed: Records R/W=7/1 15/06/18 00:47:20 INFO streaming.PipeMapRed: MRErrorThread done 15/06/18 00:47:20 INFO streaming.PipeMapRed: mapRedFinished 15/06/18 00:47:20 INFO mapred.Task: Task:attempt_local430186074_0001_r_000000_0 is done. And is in the process of committing 15/06/18 00:47:20 INFO mapred.LocalJobRunner: 1 / 1 copied. 15/06/18 00:47:20 INFO mapred.Task: Task attempt_local430186074_0001_r_000000_0 is allowed to commit now 15/06/18 00:47:20 INFO output.FileOutputCommitter: Saved output of task 'attempt_local430186074_0001_r_000000_0' to hdfs://localhost:8020/tmp/test_output_2/_temporary/0/task_local430186074_0001_r_000000 15/06/18 00:47:20 INFO mapred.LocalJobRunner: Records R/W=7/1 > reduce 15/06/18 00:47:20 INFO mapred.Task: Task 'attempt_local430186074_0001_r_000000_0' done. 15/06/18 00:47:20 INFO mapred.LocalJobRunner: Finishing task: attempt_local430186074_0001_r_000000_0 15/06/18 00:47:20 INFO mapred.LocalJobRunner: reduce task executor complete. 15/06/18 00:47:21 INFO mapreduce.Job: Job job_local430186074_0001 running in uber mode : false 15/06/18 00:47:21 INFO mapreduce.Job: map 100% reduce 100% 15/06/18 00:47:21 INFO mapreduce.Job: Job job_local430186074_0001 completed successfully 15/06/18 00:47:21 INFO mapreduce.Job: Counters: 35 File System Counters FILE: Number of bytes read=3904 FILE: Number of bytes written=518111 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=70 HDFS: Number of bytes written=33 HDFS: Number of read operations=13 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Map-Reduce Framework Map input records=3 Map output records=7 Map output bytes=49 Map output materialized bytes=69 Input split bytes=97 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=69 Reduce input records=7 Reduce output records=5 Spilled Records=14 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=537919488 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=35 File Output Format Counters Bytes Written=33 15/06/18 00:47:21 INFO streaming.StreamJob: Output directory: hdfs://localhost:8020/tmp/test_output_2 Running a MR Job OUTPUT Map-Reduce Framework Map input records=3 Map output records=7 Map output bytes=49 Map output materialized bytes=69 Input split bytes=97 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=69 Reduce input records=7 Reduce output records=5 Spilled Records=14 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=537919488 15/06/18 00:47:21 INFO streaming.StreamJob: Output directory: hdfs:// localhost:8020/tmp/test_output_2

• Unit tests can cover individual Mappers and Reducers •
End to End ﬂows are harder to test • Especially if we are running multiple MR Jobs as a part of a workﬂow Testing MR Jobs TESTING FOR ACCURACY

Testing MR Jobs UNIT TEST MAPPER Input lines Output lines

class TestMapper(object): def test_one_line(self, capsys): expected_output = self._format_output([ [“Hello", “1"],
[“World”,"1"] ]) map_lines(["Hello World"]) out, err = capsys.readouterr() assert out == expected_output Testing MR Jobs UNIT TEST

Testing MR Jobs END TO END TESTING Input Files Output
Files MR1 MR2 M1 R1 M2 R2

Testing MR Jobs END TO END TESTING 1. Add input
ﬁles into the HDFS • Snakebite - Python HDFS Client 2. Run the multi-step job (use a runner script) • Use a runner script 3. Retrieve and parse output ﬁles from the HDFS 4. Assert!

Testing MR Jobs END TO END TESTING #step 1 def
setup_lines_on_hdfs(self, input_array, dest_dir): self.write_data_to_tmp(input_array) hdfs = SnakebiteClient(HDFS_HOST, port=HDFS_PORT) hdfs.copy_from_local( local_src_dir=self.this_test_root_dir, remote_dest_dir=dest_dir)

#step 2 : Execute the map reduce jobs self.run_mr(input_path, intermediate_path,
self.mapper1, self.reducer1) self.run_mr(intermediate_path, output_path, self.mapper2, self.reducer2) Testing MR Jobs END TO END TESTING def run_mr(self, inputs, output_dir, mapper_path, reducer_path, args=[]): hadoop = which('hadoop') cmd_args = [ hadoop, "jar", HADOOP_STREAMING_JAR]+args+[ "-output", HDFS_URI + output_dir, "-mapper", mapper_path, "-reducer", reducer_path] cmd = ' '.join(cmd_args) self._run_cmd(cmd)

#step 3 : Retrieve data from HDFS hdfs = SnakebiteClient(HDFS_HOST,
port=HDFS_PORT) combined_stats = defaultdict(int) output_text = hdfs.cat([output_path + "part-*"]).next() Testing MR Jobs END TO END TESTING

# Replace steps 1 - 3. Directly assert on output.
$ cat input_ﬁle.txt | ./mapper.py | sort \ | ./reducer.py > output.py Testing MR Jobs LOCAL END TO END TESTING

Testing MR Jobs BDD TESTING WITH LETTUCE/CUCUMBER Scenario: Testing the
Job001 workflow When I add test1_input file into HDFS And I run the task Job001 on that file Then the task must complete with status Success And job output must match test1_output file Scenario: navigate to DFS-Home and check the status of any directory When I go to "http://your-namenode:50070/dfshealth.jsp" Then the request should succeed And the directories should be "Active"

• Linear Scalability • You can easily predict the job
times for different data sets. • File type agnostic • store, process all types of Files - Don't need to specify out schema on load • Transparent Parallelism HDFS and Map Reduce STRENGTHS

• Filtering Patterns • Sampling • Top-N Lists • Structural
Patterns • Combining Data sets Map Reduce DESIGN PATTERNS • Summarisation Patterns • Counting • Min/Max • Statistics • Reverse Indexing

MR2 MR1 • Most algorithms that cant be broken down
into a single MR job find the MR model of chaining very restrictive. • Lack of flexibility inhibits optimisations • Full dump to disk between jobs - too much of disk IO • Only one mapper and reducer per MR job Map Reduce - one size fits all? MAYBE NOT… M1 R1 HDFS HDFS M2 R2 HDFS

Topic for another talk

Apache Spark INTRODUCTION • Fast and general engine for large-scale
data processing. • Work with Scala, Java, Python or R • Runs everywhere • Standalone cluster mode • HDFS with Yarn/HBase • EC2

• Less code • Supports In memory storage • Highly
customizable • Resilient Distributed Datasets • Fault tolerant • Can be cached Apache Spark COMPARISON TO HADOOP MAP REDUCE • 10x-100x faster than MR jobs http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014

text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b) Apache Spark WORD COUNT

Questions?

www.neo.com Neo Innovation SG | SF | NY

Intro to Big Data

Intro to Big Data

More Decks by Rahul Rajeev

Other Decks in Technology

Featured

Transcript