Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Big Data

Intro to Big Data

Overview to hadoop and How to get started by using python.

Rahul Rajeev

June 24, 2015
Tweet

More Decks by Rahul Rajeev

Other Decks in Technology

Transcript

  1. • What is Big data? • Intro to Hadoop •

    Hadoop ecosystem • HDFS • Map Reduce • Introduction to Spark (topic for the next talk) Topics
  2. 90% of the data in the world today has been

    created in the last two years alone. http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
  3. • Open-source software framework/ecosystem • Made for distributed storage and

    processing of large amounts of data on multiple processors. • Processing is taken to where the data is stored - rather than the other way around Introduction to Hadoop
  4. Apache Hadoop framework CORE COMPONENTS HDFS Map Reduce Yarn DISTRIBUTED

    PROCESSING RESOURCE MANAGEMENT DISTRIBUTED STORAGE
  5. Mahaut MACHINE LEARNING LIB. Apache Hadoop framework ECOSYSTEM HDFS Map

    Reduce Yarn HIVE SQL LAYER PIG MR SCRIPTING Impala LOW LATENCY SQL LAYER Sqoop RD TO HDFS Flume REAL TIME DATA INGESTION HBase DISTRIBUTED DATASTORE HUE GUI FRONTEND Oozie WORKFLOW MANAGEMENT Zookeeper CENTRALIZED CLUSTER MGMT. Storm EVENT STREAMING Spark 10X MR Shark SQL LAYER
  6. • Distributed File system • Breaks files into blocks. •

    Redundancy • Replication • Blocks scattered all over the network HDFS HADOOP DISTRIBUTED FILE SYSTEM
  7. • Scale-Out Architecture • High Availability • Fault Tolerance •

    Flexible Access HDFS ADVANTAGES • Load Balancing • Tuneable Replication • Security
  8. • Programming Model • Parallel, distributed processing of large amounts

    of raw data. • Directly processes the data at the Nodes that store them Map Reduce INTRODUCTION
  9. Map Reduce ALGORITHM - MAP Foo Bar Foo Foo Baz

    Foo Bar Foo Baz Foo Bar Baz Foo Bar Foo Foo Baz Foo Bar Foo Baz Foo Bar Foo Foo Foo Foo Bar Foo Foo Bar Baz Foo Bar Baz Baz Text File
  10. Map Reduce ALGORITHM - REDUCE Foo Foo Foo Bar Foo

    Foo Bar Baz Foo Bar Baz Baz Foo - 6 Bar - 3 Baz - 3
  11. Map Reduce WORD COUNT PROBLEM - MAPPER File Hello World

    Foobar Hello World Hello World Foobar Hello World Input Hello 1 World 1 Foobar 1 World 1 Hello 1 Mapper Output Map Map
  12. Map Reduce WORD COUNT PROBLEM - MAPPER Foobar 1 Hello

    2 World 2 Reducer Reducer Output Foobar 1 Hello 1 Hello 1 World 1 World 1 Shuffle and Sort Reducer Input Hello 1 World 1 Foobar 1 World 1 Hello 1 Mapper Output
  13. def map_lines(lines): for line in lines: line = line.strip() words

    = line.split() for word in words: #print the keys in order print '%s\t%s' % (word, 1) def main(): map_lines(sys.stdin) mapper.py WORD COUNT
  14. def reduce_lines(lines): current_word = None current_count = 0 word =

    None for line in lines: line = line.strip() word, count = line.split('\t', 1) count = int(count) # the keys are sorted - checking for key change. if current_word == word: current_count += count reducer.py WORD COUNT else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word #output the last word if current_word == word: print '%s\t%s' % (current_word, current_count) def main(): reduce_lines(sys.stdin)
  15. $ hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/ tools/lib/hadoop-*streaming*.jar \ -file mapper.py \ -file

    reducer.py \ -mapper mapper.py \ -reducer reducer.py \ -input hdfs://localhost:8020/tmp/test_log_file.txt \ -output hdfs://localhost:8020/tmp/output Running a MR Job COMMAND
  16. 15/06/18 00:47:20 INFO streaming.PipeMapRed: PipeMapRed exec [/Users/rahul/workspace/hadoop/./reducer.py] 15/06/18 00:47:20 INFO

    Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 15/06/18 00:47:20 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 15/06/18 00:47:20 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] 15/06/18 00:47:20 INFO streaming.PipeMapRed: Records R/W=7/1 15/06/18 00:47:20 INFO streaming.PipeMapRed: MRErrorThread done 15/06/18 00:47:20 INFO streaming.PipeMapRed: mapRedFinished 15/06/18 00:47:20 INFO mapred.Task: Task:attempt_local430186074_0001_r_000000_0 is done. And is in the process of committing 15/06/18 00:47:20 INFO mapred.LocalJobRunner: 1 / 1 copied. 15/06/18 00:47:20 INFO mapred.Task: Task attempt_local430186074_0001_r_000000_0 is allowed to commit now 15/06/18 00:47:20 INFO output.FileOutputCommitter: Saved output of task 'attempt_local430186074_0001_r_000000_0' to hdfs://localhost:8020/tmp/test_output_2/_temporary/0/task_local430186074_0001_r_000000 15/06/18 00:47:20 INFO mapred.LocalJobRunner: Records R/W=7/1 > reduce 15/06/18 00:47:20 INFO mapred.Task: Task 'attempt_local430186074_0001_r_000000_0' done. 15/06/18 00:47:20 INFO mapred.LocalJobRunner: Finishing task: attempt_local430186074_0001_r_000000_0 15/06/18 00:47:20 INFO mapred.LocalJobRunner: reduce task executor complete. 15/06/18 00:47:21 INFO mapreduce.Job: Job job_local430186074_0001 running in uber mode : false 15/06/18 00:47:21 INFO mapreduce.Job: map 100% reduce 100% 15/06/18 00:47:21 INFO mapreduce.Job: Job job_local430186074_0001 completed successfully 15/06/18 00:47:21 INFO mapreduce.Job: Counters: 35 File System Counters FILE: Number of bytes read=3904 FILE: Number of bytes written=518111 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=70 HDFS: Number of bytes written=33 HDFS: Number of read operations=13 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 Map-Reduce Framework Map input records=3 Map output records=7 Map output bytes=49 Map output materialized bytes=69 Input split bytes=97 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=69 Reduce input records=7 Reduce output records=5 Spilled Records=14 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=537919488 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=35 File Output Format Counters Bytes Written=33 15/06/18 00:47:21 INFO streaming.StreamJob: Output directory: hdfs://localhost:8020/tmp/test_output_2 Running a MR Job OUTPUT Map-Reduce Framework Map input records=3 Map output records=7 Map output bytes=49 Map output materialized bytes=69 Input split bytes=97 Combine input records=0 Combine output records=0 Reduce input groups=5 Reduce shuffle bytes=69 Reduce input records=7 Reduce output records=5 Spilled Records=14 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=537919488 15/06/18 00:47:21 INFO streaming.StreamJob: Output directory: hdfs:// localhost:8020/tmp/test_output_2
  17. • Unit tests can cover individual Mappers and Reducers •

    End to End flows are harder to test • Especially if we are running multiple MR Jobs as a part of a workflow Testing MR Jobs TESTING FOR ACCURACY
  18. class TestMapper(object): def test_one_line(self, capsys): expected_output = self._format_output([ [“Hello", “1"],

    [“World”,"1"] ]) map_lines(["Hello World"]) out, err = capsys.readouterr() assert out == expected_output Testing MR Jobs UNIT TEST
  19. Testing MR Jobs END TO END TESTING 1. Add input

    files into the HDFS • Snakebite - Python HDFS Client 2. Run the multi-step job (use a runner script) • Use a runner script 3. Retrieve and parse output files from the HDFS 4. Assert!
  20. Testing MR Jobs END TO END TESTING #step 1 def

    setup_lines_on_hdfs(self, input_array, dest_dir): self.write_data_to_tmp(input_array) hdfs = SnakebiteClient(HDFS_HOST, port=HDFS_PORT) hdfs.copy_from_local( local_src_dir=self.this_test_root_dir, remote_dest_dir=dest_dir)
  21. #step 2 : Execute the map reduce jobs self.run_mr(input_path, intermediate_path,

    self.mapper1, self.reducer1) self.run_mr(intermediate_path, output_path, self.mapper2, self.reducer2) Testing MR Jobs END TO END TESTING def run_mr(self, inputs, output_dir, mapper_path, reducer_path, args=[]): hadoop = which('hadoop') cmd_args = [ hadoop, "jar", HADOOP_STREAMING_JAR]+args+[ "-output", HDFS_URI + output_dir, "-mapper", mapper_path, "-reducer", reducer_path] cmd = ' '.join(cmd_args) self._run_cmd(cmd)
  22. #step 3 : Retrieve data from HDFS hdfs = SnakebiteClient(HDFS_HOST,

    port=HDFS_PORT) combined_stats = defaultdict(int) output_text = hdfs.cat([output_path + "part-*"]).next() Testing MR Jobs END TO END TESTING
  23. # Replace steps 1 - 3. Directly assert on output.

    $ cat input_file.txt | ./mapper.py | sort \ | ./reducer.py > output.py Testing MR Jobs LOCAL END TO END TESTING
  24. Testing MR Jobs BDD TESTING WITH LETTUCE/CUCUMBER Scenario: Testing the

    Job001 workflow When I add test1_input file into HDFS And I run the task Job001 on that file Then the task must complete with status Success And job output must match test1_output file Scenario: navigate to DFS-Home and check the status of any directory When I go to "http://your-namenode:50070/dfshealth.jsp" Then the request should succeed And the directories should be "Active"
  25. • Linear Scalability • You can easily predict the job

    times for different data sets. • File type agnostic • store, process all types of Files - Don't need to specify out schema on load • Transparent Parallelism HDFS and Map Reduce STRENGTHS
  26. • Filtering Patterns • Sampling • Top-N Lists • Structural

    Patterns • Combining Data sets Map Reduce DESIGN PATTERNS • Summarisation Patterns • Counting • Min/Max • Statistics • Reverse Indexing
  27. MR2 MR1 • Most algorithms that cant be broken down

    into a single MR job find the MR model of chaining very restrictive. • Lack of flexibility inhibits optimisations • Full dump to disk between jobs - too much of disk IO • Only one mapper and reducer per MR job Map Reduce - one size fits all? MAYBE NOT… M1 R1 HDFS HDFS M2 R2 HDFS
  28. Apache Spark INTRODUCTION • Fast and general engine for large-scale

    data processing. • Work with Scala, Java, Python or R • Runs everywhere • Standalone cluster mode • HDFS with Yarn/HBase • EC2
  29. • Less code • Supports In memory storage • Highly

    customizable • Resilient Distributed Datasets • Fault tolerant • Can be cached Apache Spark COMPARISON TO HADOOP MAP REDUCE • 10x-100x faster than MR jobs http://www.slideshare.net/MapRTechnologies/spark-overviewjune2014