Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data: a brief introduction using Hadoop

Big Data: a brief introduction using Hadoop

Mattia Bertorello

September 22, 2014
Tweet

More Decks by Mattia Bertorello

Other Decks in Technology

Transcript

  1. Summary • The Big Data Hype • Big Data introduction

    • NoSQL • Hadoop Framework: HDFS, YARN • MapReduce • High level languages: Hive, Pig, Impala • Hadoop Connectors • Demo time
  2. The Hype An interesting Google Trend Big Data factoid is

    that India and South Korea are rated with the highest interest, with the USA a distant third. So, all of the Big Data vendors should now focus on India and South Korea, and leave my email inbox clean Steve Hamby, CTO Orbis Technologies http://www.huffingtonpost.com/steve-hamby/the-big-data-nemesis-simp_b_1940169.html
  3. The Hype Everything is on the Internet. The Internet has

    a lot of data. Therefore, everything is big data. Alistair Croll http://radar.oreilly.com/2012/08/three-kinds-of-big-data.html
  4. The Hype Gartner got its hype cycle wrong this time.

    Big data is already well along on the so-called Plateau of Productivity as its countless success stories already prove. …Today, it is those big data skeptics that we should not take too seriously. Irfan Khan, Vice President and Chief Technology Officer for Sybase http://www.itworld.com/it-managementstrategy/293397/gartner-dead-wrong-about-big-data-hype-cycle
  5. What is Big Data? Big data is an all-encompassing term

    for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. http://en.wikipedia.org/wiki/Big_data
  6. What is Big Data? “The term ‘Big Data’ applies to

    information that can't be processed or analyzed using traditional processes or tools.”
  7. What is Big Data? “Big Data” is data whose scale,

    diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… No single standard definition
  8. 3 sometimes 4 V’s of Big data Volume : Data

    size Velocity: Speed of change, Streaming data Variety: Different form of data source Veracity: Uncertainty of data
  9. Why Big Data? From the dawn of civilization until 2003,

    humankind generated five exabyte of data. Now we produce five exabyte every two days… and the pace is accelerating. Eric Schmidt Executive Chairman, Google
  10. Where are Big Data? Health care Telecom Data Insurance Retail

    transactions Public Sector Social media activity Financial data
  11. Why Big Data? Most of the software used for Big

    Data projects is open-source.
  12. NoSQL - Features Schemas Typically dynamic. Scaling Horizontally Development Model

    Open-source Supports Transactions No Data Manipulation Through object-oriented APIs
  13. NoSQL - Data models Key–Value Store: Redis, Riak, Voldemort Column-Oriented

    Store: Cassandra, HBase Document-Oriented Store: MongoDB, CouchDB Graph Database: Neo4j, OrientDB
  14. NoSQL - CAP Theorem C A P Pick two Consistency

    Availability Partition tolerance Hbase, MongoDB, Redis Cassandra, CouchDB, Dynamo, Riak, Voldemort RDBMSs (MySQL, Postgres), Neo4j CP AP CA
  15. Hadoop A scalable fault-tolerant distributed system for data storage and

    processing on a cluster. (open source under Apache license)
  16. Hadoop Hadoop moves the data-processing tasks where the data are

    actually stored rather than OpenMPI-like software
  17. Hadoop Core Hadoop (2.x) has two main system: • HDFS:

    Hadoop Distributed File System • YARM: Yet Another Resource Manager
  18. HDFS Feature Scalable, High Performance, simple to expand Fault tolerance:

    data replication Data divided into blocks of 64MB Runs on commodity hardware (Ex. No RAID) Not fully POSIX-compliant Write-one and read-many
  19. HDFS Architecture Namenode Metadata /home/data/largeFile {1,2,3,4,5,6} 1 3 2 2

    1 4 5 3 6 4 5 2 3 5 1 6 4 Rack 1 Rack 2 Replication Client Client Write Metadata Operations Read Block Operations Datanodes Datanodes
  20. YARN Why? HDFS Distributed Redundant Storage MapReduce Resource management Data

    processing HDFS Distributed Redundant Storage YARN Resource management MapReduce Data processing Other Frameworks Data processing (MPI) Hadoop V 1 Hadoop V 2
  21. Map a.map(_ + 1) Array[Int] = Array(2, 3, 4) MapReduce

    Functionational Programming Reduce a.reduce(_ + _) Int = 6 val a = Array(1,2,3)
  22. MapReduce Hadoop framework INPUT (K, V) -> (K, V) REDUCE

    REDUCE OUTPUT MAP MAP MAP MAP (K, [V, ...]) -> (K, V)
  23. MapReduce Hello Word HBase BigData NoSQL Hadoop BigData NoSQL HBase

    NoSQL Hadoop Input Hadoop BigData NoSQL Splitting Mapping HBase, 1 BigData, 1 NoSQL, 1 Shuffling Reducing Hadoop, 2 HBase, 2 BigData, 3 NoSQL, 2 Hadoop, 2 HBase, 2 BigData, 3 NoSQL, 2 HBase BigData NoSQL HBase NoSQL Hadoop Hadoop, 1 BigData, 1 NoSQL, 1 HBase, 1 NoSQL, 1 Hadoop, 1 Final result Hadoop, 1 Hadoop, 1 HBase, 1 HBase, 1 BigData, 1 BigData, 1 BigData, 1 NoSQL, 1 NoSQL, 1
  24. MapReduce Word count Java public class WordCount extends Configured implements

    Tool { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static enum Counters { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase(); for (String pattern : patternsToSkip) { line = line.replaceAll(pattern, ""); } StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); reporter.incrCounter(Counters.INPUT_WORDS, 1); } if ((++numRecords % 100) == 0) { reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile); } } }
  25. MapReduce Word count Java public static class Reduce extends MapReduceBase

    implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(other_args.get(0))); FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1))); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } }
  26. Hive Data Warehouse infrastructure built on top of Hadoop Query

    the data using a SQL-like language called HiveQL Built-in user defined functions (UDFs) Custom mappers and reducers Different storage types HDFS or HBase
  27. Hive Does NOT offer real-time queries or row-level updates Query

    execution via MapReduce Not yet Hive on Spark HIVE-7292
  28. Hive Examples SELECT hour_view, sum(counter) FROM ( SELECT lang, hour(from_unixtime(datetime))

    as hour_view, sum(page_views) as counter FROM page_view_wiki_parquet WHERE lang = 'it' GROUP BY lang, datetime ) it_views GROUP BY hour_view ORDER BY hour_view ASC
  29. Pig Latin Pig Latin is procedural and fits very naturally

    in the pipeline paradigm while SQL is instead declarative. Pig Latin script describes a directed acyclic graph (DAG) Is able to store data at any point during a pipeline
  30. Hive Examples CREATE EXTERNAL TABLE page_counts_gz(lang STRING, page_name STRING, page_views

    BIGINT, page_weight BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/admin/pagecounts_text'; CREATE EXTERNAL TABLE page_counts_parquet(lang STRING, page_name STRING, page_views BIGINT, page_weight BIGINT, datetime BIGINT) STORED AS PARQUET LOCATION '/user/admin/pagecounts_parquet';
  31. Hive Examples INSERT OVERWRITE TABLE page_counts_parquet SELECT pcg.lang, pcg.page_name, pcg.page_views,

    pcg.page_weight, (unix_timestamp( regexp_extract(INPUT__FILE__NAME, '.*?pagecounts-([0-9]*-[0-9]*)', 1), 'yyyyMMdd-HHmmss') + 7200) as datetime; FROM page_counts_gz pcg
  32. Pig Latin Examples A = LOAD 'page_views_text' USING PigStorage(' ','-tagFile')

    AS (filename:chararray, lang:chararray, page_name:chararray, view_count: long, page_size:long); B = FOREACH A { date = REGEX_EXTRACT(filename, '.*?pagecounts-([0-9]*-[0-9]*)', 1); date_object = ToDate(date,'yyyyMMdd-HHmmss','+00:00'); unixtime = ToUnixTime(date_object); GENERATE (long)unixtime AS datetime, lang, page_name, view_count, page_size; } STORE B INTO 'page_views_parquet' USING parquet.pig.ParquetStorer;
  33. Cloudera Impala Open source Massive Parallel Processing SQL query engine

    for data stored in a computer cluster running Apache Hadoop Real-time query for HDFS and HBase Easy to Hive users to migrate For some things MapReduce is just too slow
  34. Hadoop Connectors Sqoop: transferring data between relational databases and Hadoop.

    Flume:collecting, aggregating, and moving large amounts of log data. MongoDB Hadoop Connector
  35. Other tools or possible presentations What is the next big

    thing in big data? It’s called Spark. Streaming Data: S4, Storm, Spark Streaming, Kafka Graph analysis: Giraph, GraphX(Spark) Machine Learning: Mahout, MLib(Spark) Other NoSql databases: Solr, HBase, Cassandra