Big Data: a brief introduction using Hadoop

Big Data: a brief introduction using Hadoop Mattia Bertorello

Summary • The Big Data Hype • Big Data introduction
• NoSQL • Hadoop Framework: HDFS, YARN • MapReduce • High level languages: Hive, Pig, Impala • Hadoop Connectors • Demo time

The Big Data Hype

The Hype An interesting Google Trend Big Data factoid is
that India and South Korea are rated with the highest interest, with the USA a distant third. So, all of the Big Data vendors should now focus on India and South Korea, and leave my email inbox clean Steve Hamby, CTO Orbis Technologies http://www.huffingtonpost.com/steve-hamby/the-big-data-nemesis-simp_b_1940169.html

The Hype Everything is on the Internet. The Internet has
a lot of data. Therefore, everything is big data. Alistair Croll http://radar.oreilly.com/2012/08/three-kinds-of-big-data.html

The Hype Gartner got its hype cycle wrong this time.
Big data is already well along on the so-called Plateau of Productivity as its countless success stories already prove. …Today, it is those big data skeptics that we should not take too seriously. Irfan Khan, Vice President and Chief Technology Officer for Sybase http://www.itworld.com/it-managementstrategy/293397/gartner-dead-wrong-about-big-data-hype-cycle

Big Data: What it is and why it matters

What is Big Data? Big data is an all-encompassing term
for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. http://en.wikipedia.org/wiki/Big_data

What is Big Data? “The term ‘Big Data’ applies to
information that can't be processed or analyzed using traditional processes or tools.”

What is Big Data? “Big Data” is data whose scale,
diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… No single standard definition

3 sometimes 4 V’s of Big data Volume : Data
size Velocity: Speed of change, Streaming data Variety: Different form of data source Veracity: Uncertainty of data

Why Big Data?

Why Big Data? From the dawn of civilization until 2003,
humankind generated five exabyte of data. Now we produce five exabyte every two days… and the pace is accelerating. Eric Schmidt Executive Chairman, Google

Where are Big Data? Health care Telecom Data Insurance Retail
transactions Public Sector Social media activity Financial data

Why Big Data? Most of the software used for Big
Data projects is open-source.

Not Only SQL

NoSQL - Features Schemas Typically dynamic. Scaling Horizontally Development Model
Open-source Supports Transactions No Data Manipulation Through object-oriented APIs

NoSQL - Data models Key–Value Store: Redis, Riak, Voldemort Column-Oriented
Store: Cassandra, HBase Document-Oriented Store: MongoDB, CouchDB Graph Database: Neo4j, OrientDB

NoSQL - CAP Theorem C A P Pick two Consistency
Availability Partition tolerance Hbase, MongoDB, Redis Cassandra, CouchDB, Dynamo, Riak, Voldemort RDBMSs (MySQL, Postgres), Neo4j CP AP CA

NoSQL Data Size Data Complexity Key-Value stores Column-Oriented Document Databases
Graph Databases RDBMSs

Hadoop: What is it?

Hadoop A scalable fault-tolerant distributed system for data storage and
processing on a cluster. (open source under Apache license)

Hadoop Hadoop moves the data-processing tasks where the data are
actually stored rather than OpenMPI-like software

Hadoop Core Hadoop (2.x) has two main system: • HDFS:
Hadoop Distributed File System • YARM: Yet Another Resource Manager

Hadoop Distributed File System

HDFS Feature Scalable, High Performance, simple to expand Fault tolerance:
data replication Data divided into blocks of 64MB Runs on commodity hardware (Ex. No RAID) Not fully POSIX-compliant Write-one and read-many

HDFS Architecture Namenode Metadata /home/data/largeFile {1,2,3,4,5,6} 1 3 2 2
1 4 5 3 6 4 5 2 3 5 1 6 4 Rack 1 Rack 2 Replication Client Client Write Metadata Operations Read Block Operations Datanodes Datanodes

Resource Manager YARN

YARN Why? HDFS Distributed Redundant Storage MapReduce Resource management Data
processing HDFS Distributed Redundant Storage YARN Resource management MapReduce Data processing Other Frameworks Data processing (MPI) Hadoop V 1 Hadoop V 2

Map Reduce

Map a.map(_ + 1) Array[Int] = Array(2, 3, 4) MapReduce
Functionational Programming Reduce a.reduce(_ + _) Int = 6 val a = Array(1,2,3)

MapReduce Hadoop framework INPUT (K, V) -> (K, V) REDUCE
REDUCE OUTPUT MAP MAP MAP MAP (K, [V, ...]) -> (K, V)

MapReduce Hello Word HBase BigData NoSQL Hadoop BigData NoSQL HBase
NoSQL Hadoop Input Hadoop BigData NoSQL Splitting Mapping HBase, 1 BigData, 1 NoSQL, 1 Shuffling Reducing Hadoop, 2 HBase, 2 BigData, 3 NoSQL, 2 Hadoop, 2 HBase, 2 BigData, 3 NoSQL, 2 HBase BigData NoSQL HBase NoSQL Hadoop Hadoop, 1 BigData, 1 NoSQL, 1 HBase, 1 NoSQL, 1 Hadoop, 1 Final result Hadoop, 1 Hadoop, 1 HBase, 1 HBase, 1 BigData, 1 BigData, 1 BigData, 1 NoSQL, 1 NoSQL, 1

MapReduce Word count Java public class WordCount extends Configured implements
Tool { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { static enum Counters { INPUT_WORDS } private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase(); for (String pattern : patternsToSkip) { line = line.replaceAll(pattern, ""); } StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); reporter.incrCounter(Counters.INPUT_WORDS, 1); } if ((++numRecords % 100) == 0) { reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile); } } }

MapReduce Word count Java public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(other_args.get(0))); FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1))); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } }

MapReduce

High Level Languages Hive

Hive Data Warehouse infrastructure built on top of Hadoop Query
the data using a SQL-like language called HiveQL Built-in user defined functions (UDFs) Custom mappers and reducers Different storage types HDFS or HBase

Hive Does NOT offer real-time queries or row-level updates Query
execution via MapReduce Not yet Hive on Spark HIVE-7292

Hive Examples SELECT hour_view, sum(counter) FROM ( SELECT lang, hour(from_unixtime(datetime))
as hour_view, sum(page_views) as counter FROM page_view_wiki_parquet WHERE lang = 'it' GROUP BY lang, datetime ) it_views GROUP BY hour_view ORDER BY hour_view ASC

High Level Languages Pig Latin

Pig Latin Pig Latin is procedural and fits very naturally
in the pipeline paradigm while SQL is instead declarative. Pig Latin script describes a directed acyclic graph (DAG) Is able to store data at any point during a pipeline

Hive Examples CREATE EXTERNAL TABLE page_counts_gz(lang STRING, page_name STRING, page_views
BIGINT, page_weight BIGINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/admin/pagecounts_text'; CREATE EXTERNAL TABLE page_counts_parquet(lang STRING, page_name STRING, page_views BIGINT, page_weight BIGINT, datetime BIGINT) STORED AS PARQUET LOCATION '/user/admin/pagecounts_parquet';

Hive Examples INSERT OVERWRITE TABLE page_counts_parquet SELECT pcg.lang, pcg.page_name, pcg.page_views,
pcg.page_weight, (unix_timestamp( regexp_extract(INPUT__FILE__NAME, '.*?pagecounts-([0-9]*-[0-9]*)', 1), 'yyyyMMdd-HHmmss') + 7200) as datetime; FROM page_counts_gz pcg

Pig Latin Examples A = LOAD 'page_views_text' USING PigStorage(' ','-tagFile')
AS (filename:chararray, lang:chararray, page_name:chararray, view_count: long, page_size:long); B = FOREACH A { date = REGEX_EXTRACT(filename, '.*?pagecounts-([0-9]*-[0-9]*)', 1); date_object = ToDate(date,'yyyyMMdd-HHmmss','+00:00'); unixtime = ToUnixTime(date_object); GENERATE (long)unixtime AS datetime, lang, page_name, view_count, page_size; } STORE B INTO 'page_views_parquet' USING parquet.pig.ParquetStorer;

High Level Languages Cloudera Impala

Cloudera Impala Open source Massive Parallel Processing SQL query engine
for data stored in a computer cluster running Apache Hadoop Real-time query for HDFS and HBase Easy to Hive users to migrate For some things MapReduce is just too slow

Hadoop Sources

Hadoop Connectors Sqoop: transferring data between relational databases and Hadoop.
Flume:collecting, aggregating, and moving large amounts of log data. MongoDB Hadoop Connector

Other Tools

Other tools or possible presentations What is the next big
thing in big data? It’s called Spark. Streaming Data: S4, Storm, Spark Streaming, Kafka Graph analysis: Giraph, GraphX(Spark) Machine Learning: Mahout, MLib(Spark) Other NoSql databases: Solr, HBase, Cassandra

Thank you for your attention

Demo Time

Big Data: a brief introduction using Hadoop

Big Data: a brief introduction using Hadoop

More Decks by Mattia Bertorello

Other Decks in Technology

Featured

Transcript