Big Data: a brief introduction using Hadoop

Slide 1

Slide 1 text

Big Data: a brief introduction using Hadoop Mattia Bertorello

Slide 2

Slide 2 text

Summary ● The Big Data Hype ● Big Data introduction ● NoSQL ● Hadoop Framework: HDFS, YARN ● MapReduce ● High level languages: Hive, Pig, Impala ● Hadoop Connectors ● Demo time

Slide 3

Slide 3 text

The Big Data Hype

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

The Hype An interesting Google Trend Big Data factoid is that India and South Korea are rated with the highest interest, with the USA a distant third. So, all of the Big Data vendors should now focus on India and South Korea, and leave my email inbox clean Steve Hamby, CTO Orbis Technologies http://www.huffingtonpost.com/steve-hamby/the-big-data-nemesis-simp_b_1940169.html

Slide 7

Slide 7 text

The Hype Everything is on the Internet. The Internet has a lot of data. Therefore, everything is big data. Alistair Croll http://radar.oreilly.com/2012/08/three-kinds-of-big-data.html

Slide 8

Slide 8 text

The Hype Gartner got its hype cycle wrong this time. Big data is already well along on the so-called Plateau of Productivity as its countless success stories already prove. …Today, it is those big data skeptics that we should not take too seriously. Irfan Khan, Vice President and Chief Technology Officer for Sybase http://www.itworld.com/it-managementstrategy/293397/gartner-dead-wrong-about-big-data-hype-cycle

Slide 9

Slide 9 text

Big Data: What it is and why it matters

Slide 10

Slide 10 text

What is Big Data? Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. http://en.wikipedia.org/wiki/Big_data

Slide 11

Slide 11 text

What is Big Data? “The term ‘Big Data’ applies to information that can't be processed or analyzed using traditional processes or tools.”

Slide 12

Slide 12 text

What is Big Data? “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… No single standard definition

Slide 13

Slide 13 text

3 sometimes 4 V’s of Big data Volume : Data size Velocity: Speed of change, Streaming data Variety: Different form of data source Veracity: Uncertainty of data

Slide 14

Slide 14 text

Why Big Data?

Slide 15

Slide 15 text

Why Big Data? From the dawn of civilization until 2003, humankind generated five exabyte of data. Now we produce five exabyte every two days… and the pace is accelerating. Eric Schmidt Executive Chairman, Google

Slide 16

Slide 16 text

Where are Big Data? Health care Telecom Data Insurance Retail transactions Public Sector Social media activity Financial data

Slide 17

Slide 17 text

Why Big Data? Most of the software used for Big Data projects is open-source.

Slide 18

Slide 18 text

Not Only SQL

Slide 19

Slide 19 text

NoSQL - Features Schemas Typically dynamic. Scaling Horizontally Development Model Open-source Supports Transactions No Data Manipulation Through object-oriented APIs

Slide 20

Slide 20 text

NoSQL - Data models Key–Value Store: Redis, Riak, Voldemort Column-Oriented Store: Cassandra, HBase Document-Oriented Store: MongoDB, CouchDB Graph Database: Neo4j, OrientDB

Slide 21

Slide 21 text

NoSQL - CAP Theorem C A P Pick two Consistency Availability Partition tolerance Hbase, MongoDB, Redis Cassandra, CouchDB, Dynamo, Riak, Voldemort RDBMSs (MySQL, Postgres), Neo4j CP AP CA

Slide 22

Slide 22 text

NoSQL Data Size Data Complexity Key-Value stores Column-Oriented Document Databases Graph Databases RDBMSs

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Hadoop: What is it?

Slide 25

Slide 25 text

Hadoop A scalable fault-tolerant distributed system for data storage and processing on a cluster. (open source under Apache license)

Slide 26

Slide 26 text

Hadoop Hadoop moves the data-processing tasks where the data are actually stored rather than OpenMPI-like software

Slide 27

Slide 27 text

Hadoop Core Hadoop (2.x) has two main system: ● HDFS: Hadoop Distributed File System ● YARM: Yet Another Resource Manager

Slide 28

Slide 28 text

Hadoop Distributed File System

Slide 29

Slide 29 text

HDFS Feature Scalable, High Performance, simple to expand Fault tolerance: data replication Data divided into blocks of 64MB Runs on commodity hardware (Ex. No RAID) Not fully POSIX-compliant Write-one and read-many

Slide 30

Slide 30 text

HDFS Architecture Namenode Metadata /home/data/largeFile {1,2,3,4,5,6} 1 3 2 2 1 4 5 3 6 4 5 2 3 5 1 6 4 Rack 1 Rack 2 Replication Client Client Write Metadata Operations Read Block Operations Datanodes Datanodes

Slide 31

Slide 31 text

Resource Manager YARN

Slide 32

Slide 32 text

YARN Why? HDFS Distributed Redundant Storage MapReduce Resource management Data processing HDFS Distributed Redundant Storage YARN Resource management MapReduce Data processing Other Frameworks Data processing (MPI) Hadoop V 1 Hadoop V 2

Slide 33

Slide 33 text

YARN

Slide 34

Slide 34 text

Map Reduce

Slide 35

Slide 35 text

Map a.map(_ + 1) Array[Int] = Array(2, 3, 4) MapReduce Functionational Programming Reduce a.reduce(_ + _) Int = 6 val a = Array(1,2,3)

Slide 36

Slide 36 text

MapReduce Hadoop framework INPUT (K, V) -> (K, V) REDUCE REDUCE OUTPUT MAP MAP MAP MAP (K, [V, ...]) -> (K, V)

Slide 37

Slide 37 text

MapReduce Hello Word HBase BigData NoSQL Hadoop BigData NoSQL HBase NoSQL Hadoop Input Hadoop BigData NoSQL Splitting Mapping HBase, 1 BigData, 1 NoSQL, 1 Shuffling Reducing Hadoop, 2 HBase, 2 BigData, 3 NoSQL, 2 Hadoop, 2 HBase, 2 BigData, 3 NoSQL, 2 HBase BigData NoSQL HBase NoSQL Hadoop Hadoop, 1 BigData, 1 NoSQL, 1 HBase, 1 NoSQL, 1 Hadoop, 1 Final result Hadoop, 1 Hadoop, 1 HBase, 1 HBase, 1 BigData, 1 BigData, 1 BigData, 1 NoSQL, 1 NoSQL, 1

Slide 38

Slide 38 text

MapReduce Word count Java public class WordCount extends Configured implements Tool { public static class Map extends MapReduceBase implements Mapper { static enum Counters { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase(); for (String pattern : patternsToSkip) { line = line.replaceAll(pattern, ""); } StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); reporter.incrCounter(Counters.INPUT_WORDS, 1); } if ((++numRecords % 100) == 0) { reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile); } } }

Slide 39

Slide 39 text

MapReduce Word count Java public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(other_args.get(0))); FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1))); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } }

Slide 40

Slide 40 text

MapReduce

Slide 41

Slide 41 text

High Level Languages Hive

Slide 42

Slide 42 text

Hive Data Warehouse infrastructure built on top of Hadoop Query the data using a SQL-like language called HiveQL Built-in user defined functions (UDFs) Custom mappers and reducers Different storage types HDFS or HBase

Slide 43

Slide 43 text

Hive Does NOT offer real-time queries or row-level updates Query execution via MapReduce Not yet Hive on Spark HIVE-7292

Slide 44

Slide 44 text

Hive Examples SELECT hour_view, sum(counter) FROM ( SELECT lang, hour(from_unixtime(datetime)) as hour_view, sum(page_views) as counter FROM page_view_wiki_parquet WHERE lang = 'it' GROUP BY lang, datetime ) it_views GROUP BY hour_view ORDER BY hour_view ASC

Slide 45

Slide 45 text

High Level Languages Pig Latin

Slide 46

Slide 46 text

Pig Latin Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. Pig Latin script describes a directed acyclic graph (DAG) Is able to store data at any point during a pipeline

Slide 47

Slide 47 text

Hive Examples CREATE EXTERNAL TABLE page_counts_gz(lang STRING, page_name STRING, page_views BIGINT, page_weight BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/admin/pagecounts_text'; CREATE EXTERNAL TABLE page_counts_parquet(lang STRING, page_name STRING, page_views BIGINT, page_weight BIGINT, datetime BIGINT) STORED AS PARQUET LOCATION '/user/admin/pagecounts_parquet';

Slide 48

Slide 48 text

Hive Examples INSERT OVERWRITE TABLE page_counts_parquet SELECT pcg.lang, pcg.page_name, pcg.page_views, pcg.page_weight, (unix_timestamp( regexp_extract(INPUT__FILE__NAME, '.*?pagecounts-([0-9]*-[0-9]*)', 1), 'yyyyMMdd-HHmmss') + 7200) as datetime; FROM page_counts_gz pcg

Slide 49

Slide 49 text

Pig Latin Examples A = LOAD 'page_views_text' USING PigStorage(' ','-tagFile') AS (filename:chararray, lang:chararray, page_name:chararray, view_count: long, page_size:long); B = FOREACH A { date = REGEX_EXTRACT(filename, '.*?pagecounts-([0-9]*-[0-9]*)', 1); date_object = ToDate(date,'yyyyMMdd-HHmmss','+00:00'); unixtime = ToUnixTime(date_object); GENERATE (long)unixtime AS datetime, lang, page_name, view_count, page_size; } STORE B INTO 'page_views_parquet' USING parquet.pig.ParquetStorer;

Slide 50

Slide 50 text

High Level Languages Cloudera Impala

Slide 51

Slide 51 text

Cloudera Impala Open source Massive Parallel Processing SQL query engine for data stored in a computer cluster running Apache Hadoop Real-time query for HDFS and HBase Easy to Hive users to migrate For some things MapReduce is just too slow

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Hadoop Sources

Slide 54

Slide 54 text

Hadoop Connectors Sqoop: transferring data between relational databases and Hadoop. Flume:collecting, aggregating, and moving large amounts of log data. MongoDB Hadoop Connector

Slide 55

Slide 55 text

Other Tools

Slide 56

Slide 56 text

Other tools or possible presentations What is the next big thing in big data? It’s called Spark. Streaming Data: S4, Storm, Spark Streaming, Kafka Graph analysis: Giraph, GraphX(Spark) Machine Learning: Mahout, MLib(Spark) Other NoSql databases: Solr, HBase, Cassandra