Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Hadoop and Big Data

Avatar for Joe Alex Joe Alex
October 20, 2012

Introduction to Hadoop and Big Data

Introduction to Hadoop and Big Data

Avatar for Joe Alex

Joe Alex

October 20, 2012
Tweet

More Decks by Joe Alex

Other Decks in Technology

Transcript

  1. Who am I • Joe Alex – Software Architect /

    Data Scientist Loves to code in Java, Scala – Areas of Interest: Big Data, Data Analytics, Machine Learning, Hadoop, Cassandra – Currently working as Team Lead for Managed Security Services Portal at Verizon
  2. New kind of data • Social - messages, posts, blogs,

    photos, videos, maps, graphs, friends • Machine – sensors, firewalls, routers, logs, metrics, health monitoring, cell phones, credit card transactions
  3. New kind of data • Volume – massive, TB 

    PB – Convert 350 billion annual meter readings to better predict power consumption – Turn 12 terabytes of Tweets created each day into improved product sentiment analysis • Types – structured, semi/un-structured – Text, audio, video, click streams, logs, machine – Monitor 100’s of live video feeds from surveillance cameras to target points of interest • Velocity (time sensitive) – ideally processed as it is streaming (realtime, near-realtime, batch) – Scrutinize 5 million trade events created each day to identify potential fraud – Analyze 500 million daily call detail records in real-time to predict customer churn faster
  4. What is Big Data about • We are drowning is

    a sea of data, sometimes we throw away a lot. • Still we cant make much sense of it • We consider data as a cost • But Data is an opportunity • This is what Big Data is about – New Insights – New Business
  5. Big Data Domains • Digital marketing • Data discovery –

    patterns, trends • Fraud detection • Machine generated Data Analytics – Remote device insight, sensing, location based intel • Social • Data retention
  6. Big Data Architecture • Traditional – High Availability, RDBMS, Structured

    data • Big Data – High scalabilty/availability/flexibility, Compute/Storage on same nodes, Structured/Semi/Un-Structured data
  7. Apache Hadoop • Open source project under Apache Software Foundation

    • Based on papers published by Google – MapReduce: http://research.google.com/archive/mapreduce.html – GFS: http://research.google.com/archive/gfs.html
  8. Reliability • "Failure is the defining difference between distributed and

    local programming“ • - Ken Arnold, CORBA Designer
  9. Why Hadoop • Data processed by Google every month: 400Pb…

    in 2007 • Average job size: 180Gb • Time 180Gb of data would take to read sequentially off a single disk drive: 45 minutes • Solution: parallel reads • – 1 HDD = 75Mb/sec • – 1,000 HDDs = 75Gb/sec (Far more acceptable) • Data Access Speed is the Bottleneck • We can process data very quickly, but we can only read/write it very slowly
  10. Core Components • Hadoop consists of two core components –

    The Hadoop Distributed File System (HDFS) – MapReduce • There are many other projects based around core Hadoop – Often referred to as the “Hadoop Ecosystem” – Pig, Hive, HBase, Flume, Oozie, Sqoop etc • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster • Individual machines are known as nodes. A cluster can have as few as one node, as many as several thousands • More nodes = better performance
  11. System Requirements • System should support partial failure – Failure

    of one part of the system should result in a graceful decline in performance. Not a full halt • System should support data recoverability – If components fail, their workload should be picked up by still functioning units • System should support individual recoverability – Nodes that fail and restart should be able to rejoin the group activity without a full group restart
  12. System Requirements (cont’d) • System should be consistent – –

    Concurrent operations or partial internal failures should not cause the results of the job to change • System should be scalable – – Adding increased load to a system should not cause outright failure. Instead, should result in a graceful decline • Increasing resources should support a proportional increase in load capacity
  13. Hadoop’s radical approach • Hadoop provides a radical approach to

    these issues: – Nodes talk to each other as little as possible, Probably never. – This is known as a “shared nothing” architecture – Programmer should not explicitly be allowed to write code which communicates between nodes • Data is spread throughout machines in the cluster – Data distribution happens when data is loaded on to the cluster • Instead of bringing data to the processors, Hadoop brings the processing to the data
  14. Hadoop’s radical approach • Batch Oriented • Data Locality (code

    is shipped around) • Heavy Parallelization • Process Management • Append-only files • Express your computation in Map Reduce, get parallelism and and scalability for free
  15. Core Hadoop Daemons • Each node in a Hadoop installation

    runs one or more daemons executing MapReduce code or HDFS commands. Each daemon’s responsibilities in the cluster are: – NameNode: manages HDFS and communicates with every DataNode daemon in the cluster – JobTracker: dispatches jobs and assigns splits to mappers or reducers as each stage completes – TaskTracker: executes tasks sent by the JobTracker and reports status – DataNode: Manages HDFS content in the node and updates status to the NameNode
  16. Config files • hadoop-env.sh — environmental configuration, JVM configuration etc

    • core-site.xml — site wide configuration • hdfs-site.xml — HDFS block size, Name and Data node directories • mapred-site.xml — total MapReduce tasks, JobTracker address • masters, slaves files — NameNode, JobTracker, DataNodes, and TaskTrackers addresses, as appropriate
  17. HDFS: Hadoop Distributed File System • Based on Google’s GFS

    (Google File System) • Provides redundant storage of massive amounts of data – Using cheap, unreliable computers • At load time, data is distributed across all nodes – Provides for efficient MapReduce processing
  18. HDFS Assumptions • High component failure rates – Inexpensive components

    fail all the time • “Modest” number of HUGE files – Just a few million – Each file likely to be 100Mb or larger – Multi-Gigabyte files typical • Large streaming reads – Not random access • High sustained throughput should be favored over low latency
  19. HDFS Features • Operates ‘on top of’ an existing filesystem

    • Files are stored as ‘blocks’ – Much larger than for most filesystems – Default is 64Mb • Provides reliability through replication – Each block is replicated across three or more DataNodes • Single NameNode stores metadata and co-ordinates access – Provides simple, centralized management • No data caching – Would provide little benefit due to large datasets, streaming reads • Familiar interface, but customize the API – Simplify the problem and focus on distributed applications
  20. MapReduce • MapReduce is a method for distributing a task

    across multiple nodes in the Hadoop cluster • Consists of two phases: Map, and then Reduce – Between the two is a stage known as the shuffle and sort • Each Map task operates on a discrete portion of the overall dataset. Typically one HDFS block of data • After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase
  21. Features of MapReduce • Automatic parallelization and distribution • Fault-tolerance

    • Status and monitoring tools • A clean abstraction for programmers – – MapReduce programs are usually written in Java – – Can be written in any scripting language using Hadoop Streaming – – All of Hadoop is written in Java • MapReduce abstracts all the “housekeeping” away from the developer – – Developer can concentrate simply on writing the Map and Reduce functions
  22. MapReduce example • Map • // assume input is a

    set of text files k is a line offset v is the line for that offset • let map(k, v) = • for each word in v: • emit(word, 1) • Reduce • // k is a word vals is a list of 1s • let reduce(k, vals) = • emit(k, vals.length())
  23. Streaming API • Many organizations have developers skilled in languages

    other than Java – Perl, Ruby, Python, Etc • The Streaming API allows developers to use any language they wish to write Mappers and Reducers – As long as the language can read from standard input and write to standard output • Advantages of the Streaming API: – No need for non-Java coders to learn Java – Fast development time – Ability to use existing code libraries
  24. Job Driver • Job Driver JobConf conf = new JobConf(WordCount.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordMapper.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(IntWritable.class); conf.setReducerClass(SumReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); • Driver is submitted to the Hadoop cluster for processing, along with the rest of the code in a .jar file.
  25. Mapper • The basic Java code implementation for the mapper

    has the form: public class WordMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { /* implementation here */ } } • The implementation itself uses standard Java text manipulation tools; you can use regular expressions, scanners, whatever is necessary.
  26. Reducer • Reducer public class SumReducer extends MapReduceBase implements Reducer<Text,

    IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { /* implementation */} } • The reducer iterates over keys and values generated in the previous step and sums up the occurrence of the word
  27. Input/Output Formats • Input Formats – KeyValueTextInputFormat — Each line

    represents a key and value delimited by a separator – TextInputFormat — The key is the byte offset, the value is the text itself for each line – Sequence Input Format — Raw format serialized key/value pairs • Output Formats – Specify final output
  28. Hadoop Eco System : Hive • Hive – SQL-based data

    warehousing app – Data analysts are very familiar with SQL than Java etc – Hive allows users to query data using HiveQL, a language very similar to standard SQL – Hive turns HiveQL queries into standard MapReduce jobs – Automatically runs the jobs, and displays the results to the user – Note that Hive is not an RDBMS • Results take many seconds, minutes, or even hours to be produced • Not possible to modify the data using HiveQL – Features for analyzing very large data sets
  29. Hadoop Eco System: Pig • Pig – Data-flow oriented language

    – Pig can be used as an alternative to writing MapReduce jobs in Java (or some other language) – Provides a scripting language known as Pig Latin – Abstracts MapReduce details away from the user – Made up of a set of operations that are applied to the input data to produce output – Fairly easy to write complex asks such as joins of multiple datasets – Under the covers, Pig Latin scripts are converted to MapReduce jobs
  30. Hadoop Eco System: HBase • HBase – Distributed, sparse, column-oriented

    datastore • Distributed: designed to use multiple machines to store and serve data • Sparse: each row may or may not have values for all columns • Column-oriented: Data is stored grouped by column, rather than by row. Columns are grouped into ‘column families’, which define what columns are physically stored together – Leverages HDFS – Modeled after Google’s BigTable datastore
  31. Hadoop Eco System: Others • Flume – Flume is a

    distributed, reliable, available service for efficiently moving large amounts of data as it is produced. – Ideally suited to gathering logs from multiple systems and inserting them into HDFS as they are generated • Sqoop – Sqoop is “the SQL-to-Hadoop database import tool” – Designed to import data from RDBMS into Hadoop – Can also send data the other way, from Hadoop to an RDBMS – Uses JDBC to connect to the RDBMS • Oozie – Dataflow
  32. Hadoop Eco System: Others • Zookeeper – Distributed consensus engine

    – Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Avro – Serialization and RPC framework • Mahout – Machine learning library
  33. Next Gen • Storm – distributed realtime computation. Makes it

    easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing • Spark – Spark is an open source cluster computing system that aims to make data analytics fast. • Impala – real-time processing