Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Big Data using Hadoop

Sergejus
April 21, 2012

Intro to Big Data using Hadoop

Sergejus

April 21, 2012
Tweet

More Decks by Sergejus

Other Decks in Technology

Transcript

  1. Information is powerful… but it is how we use it

    that will define us how we use it
  2. Big Data (globally) – creates over 30 billion pieces of

    content per day – stores 30 petabytes of data – produces over 90 million tweets per day 30 billion 30 petabytes 90 million
  3. Big Data (our example) – logs over 300 gigabytes of

    transactions per day – stores more than 1,5 terabyte of aggregated data 300 gigabytes 1,5 terabyte
  4. Big Data Challenges Sort 10TB on 1 node = 100-node

    cluster = 2,5 days 35 mins ( log ) → (log )
  5. Big Data Challenges “Fat” servers implies high cost – use

    cheap commodity nodes instead Large # of cheap nodes implies often failures – leverage automatic fault-tolerance commodity fault-tolerance
  6. Big Data Challenges We need new data-parallel programming model for

    clusters of commodity machines data-parallel
  7. MapReduce Published in 2004 by Google – MapReduce: Simplified Data

    Processing on Large Clusters Popularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, … Hadoop Google
  8. Word Count Example the quick brown fox the fox ate

    the mouse how now brown cow Map Map Map Reduce Reduce the, 3 brown, 2 fox, 2 how, 1 now, 1 quick, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output
  9. Word Count Example the quick brown fox the fox ate

    the mouse how now brown cow Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output the, 1 quick, 1 brown, 1 fox, 1 the, 1 fox, 1 ate, 1 the, 1 mouse, 1 how, 1 now, 1 brown, 1 cow, 1 the, 1 brown, 1 fox, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 quick, 1 ate, 1 mouse, 1 cow, 1
  10. Word Count Example the quick brown fox the fox ate

    the mouse how now brown cow Map Map Map Reduce Reduce the, 3 brown, 2 fox, 2 how, 1 now, 1 quick, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output the, [1,1,1] brown, [1,1] fox, [1,1] how, [1] now, [1] quick, [1] ate, [1] mouse, [1] cow, [1]
  11. Hadoop Overview Open source implementation of – Google MapReduce paper

    – Google File System (GFS) paper First release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc. MapReduce (GFS) Yahoo!
  12. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop

    Distributed File System (HDFS) • Name Node stores file metadata • files split into 64 MB blocks • blocks replicated across 3 Data Nodes Name Node blocks 3 Data Nodes
  13. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop

    Distributed File System (HDFS) Name Node Data Node
  14. Hadoop Core (MapReduce) • Job Tracker distributes tasks and handles

    failures • tasks are assigned based on data locality • Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node Job Tracker data locality Task Trackers
  15. Hadoop Core (MapReduce) MapReduce (Job Scheduling / Execution System) Hadoop

    Distributed File System (HDFS) Name Node Data Node Job Tracker Task Tracker
  16. HBase Hadoop Ecosystem Hadoop Distributed File System (HDFS) MapReduce (Job

    Scheduling / Execution System) Pig (ETL) Avro Zookeeper Hive (BI) Sqoop (RDBMS)
  17. JavaScript MapReduce var map = function (key, value, context) {

    var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };
  18. Pig words = LOAD '/example/count' AS ( word: chararray, count:

    int ); popular_words = ORDER words BY count DESC; top_popular_words = LIMIT popular_words 10; DUMP top_popular_words;
  19. Hive CREATE EXTERNAL TABLE WordCount ( word string, count int

    ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION "/example/count"; SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;