Big Data & Hadoop Overview

Agenda for the Day 1 1) Understanding what is Big
Data 2) Understanding what is Hadoop (Architecture) 3) Understand the main Hadoop Components 4) HDFS and its Architecture

What and where is Big Data? Lots of Data (Terabytes
or Petabytes) Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. Data is every were, we either ignore it or destroy it. Examples Rolling web log data Network logs of data flowing through various networks System logs Click information from website Stock trading data from Stock exchange Examples of personal data Photo Stream, Social media Streams Data need not be in a table in some RDBMS A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.

What and where is Big Data? What makes Big Data
'Big‘? The internet is the biggest source for data. Estimated size to be 1.8 Zettabytes (1 Zettabyte=1021 bytes) Searching the Internet means knowing what us out there and saving the data. Being able to determine what a end user is looking for and getting it from this vast store (searching and Ranking data) Getting the results back in time such that a user hasn't moved away to something else (Indexing) Defining the Big data Problem Data will continue to grow indefinitely, so need hardware that grows with data Cannot keep buying Bigger machines because after a while they become cost prohibitive Hardware should 'grow' as data grows and scale Horizontally Additional hardware should result in a proportional increase in performance

Facebook Example As of 2011, There are 500,000,000 active Facebook
users. Aprox. 1 in Every 13 People on earth. Half of the them are logged in on any given day. A record-breaking 750 Million Photos were uploaded to Facebook over new year’s weekend. There are 206.2 million internet users in the US. That means 71.2% of the US web audience is on Facebook. Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social network. Facebook has an average of 3.2 Billion Likes and comments are posted every day. Links Shared Event Invites Friend Requests Accepted Photos Uploaded Messages Sent Tagged Photos Status Updates Wall Posts Comments Made 20 Minutes on FACEBOOK 1,000,000 1,484,000 1,972,000 2,716,000 2,716,000 1,323,000 1,851,000 1,587,000 10,208,000

Twitter has over 500 million registered users. The USA, whose
141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia. 79% of US Twitter users are more like to recommend brands they follow . 67% of US Twitter users are more likely to buy from brands they follow . 57% of all companies that use social media for business use Twitter. Twitter Example

Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Volume Velocity Variety 12 Terabytes of
Tweets created each day Scrutinizes 5 million trade events created each day to identify potential fraud Sensor data, audio, video, click streams, log files and more Characteristics of Big Data

Estimated Global Data Volume: 2011: 1.8 ZB 2015: 7.9 ZB
The world's information doubles every two years Over the next 10 years: The number of servers worldwide will grow by 10x Amount of information managed by enterprise data centers will grow by 50x Number of “files” enterprise data center handle will grow by 75x Data Volume is Growing Exponentially Humanity Passes 1 Zettabyte Mark in 2010 A zettabyte is 1,000,000,000,000,000,000,000 bytes (that's 21 zeroes for the counting), or one trillion gigabytes. That's enough data to file 75 trillion 16-gigabytes-sized iPads. 1 million terabytes = 1 exabyte 1,000 terabytes = 1 petabyte 1,000 gigabytes = 1 terabyte 1 billion terabytes 1 zettabyte

Un-Structured Data is Exploding 1970 1990 1980 2000 2010 Complex,
Unstructured Application Data Relational Business Transaction Data 2500 exabytes of new information in 2012 with internet as primary driver Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Stored Digital Information (exabytes)

Common Big Data Customer Scenarios Financial Services Web & E-Tailing
Retail Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring and Analysis Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection Point of sales Transaction Analysis Customer Churn Analysis Sentiment Analysis

Limitations of Existing Data Analytics Architecture • BI Reports +
Interactive Apps • RDBMS (Aggregated Data) • ETL Compute Grid Can’t Explore Original High Fidelity Raw Data Moving Data to Compute Doesn’t Scale Archiving = Premature Data Death Storage Only Grid (Original Raw Data • Collection • Instrumentation 1) 2) 3) Mostly Append

Solution: A Combined Storage Computer Layer • BI Reports +
Interactive Apps • RDBMS (Aggregated Data) Data Exploration and Advanced Analytics Scalable Throughout For ETL and Aggregation Keep data alive forever Hadoop : Storage + Compute Grid • Collection • Instrumentation 1) 2) 3) Mostly Append

Why DFS? A 8 Core 64 bit processor with 128
GB RAM Scale Vertically A Quad Core 64 bit processor with 32 GB RAM Scale Horizontally A Quad core 64 bit processor with 32 GB RAM A Single core 64 bit processor with 8 GB RAM Add bigger CPU/Storage Add another computer with similar Memory/CPU/Storage ?

Why DFS? Read 1 TB Data 1 Machine 4 I/O
Channels Each Channel – 100MB/S 45 Minutes 10 Machines 4 I/O Channels Each Channel – 100MB/S 4.5 Minutes

What Is Hadoop? Apache Hadoop is a framework that allows
for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an Open-source Data Management with scale-out storage & distributed processing. It scale horizontally to manage Peta-bytes of data, abstracts away distributed storage and computing. The storage abstraction hidden behind Hadoop Distributed file system (HDFS) and analytics via Map Reduce Framework. The Hadoop Services: HDFS: Distributed File System Map Reduce: A Distributed data Processing Model Hbase: A Distributed Column Oriented data Base Hive: A Distributed data warehouse Pig: A data flow language Zookeeper: A Distributed Highly available Coordination Service Sqoop: A tool for efficient bulk transfer of data between hadoop and other sources like RDBMS Oozie: A Service for running and scheduling workflows of hadoop jobs

Hadoop Key Characteristics Reliable Flexible Scalable Economical Hadoop Features

Hadoop Eco-System Hive DW System Pig Latin Data Analysis Mahout
Machine Learning Map Reduce Framework HBase HDFS (Hadoop Distributed File System) Import Or Export Flume Sqoop Unstructured Or Semi-Structured Data Structured Data Apache Oozie Workflow

HDFS –Hadoop Distributed File System (Storage) Distributed across “nodes” Natively
redundant NameNode tracks locations. Map Reduce (Processing) Splits a task across processors “near” the data & assembles results Self-Healing, High Bandwidth Clustered storage Hadoop Core Components Data Node Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node HDFS Map Reduce Job Tracker Name Node

HDFS Overview HDFS Design What is Hadoop designed for Storing
very large files MB,GB and TB range files each Write once - Read many times Works of commodity hardware What is Hadoop not designed for Low latency data Access Lots of small Files Multiple writes and Arbitrary file modifications The HDFS Concepts Data will continue to grow indefinitely, so need hardware that grows with data Cannot keep buying Bigger machines because after a while they become cost prohibitive Hardware should 'grow' as data grows and scale Horizontally Additional hardware should result in a proportional increase in performance

HDFS Architecture Data Nodes Data Nodes Rack 1 Rack 2
Client Read Write Client Replication Name Node Secondary Name Node Metadata Ops Block Ops Write

Main Components Of HDFS Name Node (Master Node) Maintains and
manages the blocks which are present on the Data Nodes Manages the file system tree and Other meta information It is responsible for maintaining namespace image and edit log files Any Changes to the file System namespace or its properties is recorded by Name Node Mapping of file blocks to Data Node(Physical location of Data) Aware of the data nodes for a particular file Important files for Name Node Image: File Consist of Meta data information Check Point: Persistent record of image stored in the native file system Journal: Modification log of image stored in local file system Data Node (Slave Node) Slaves which are deployed on each machine and provide the actual storage Responsible for serving read and write requests for the clients Store and retrieve blocks when they are told to Report back to Name Node periodically with the list of blocks they have Data Nodes are the work horses of the Hadoop file system Block replica is represented by two files File that stores the data itself File that stores the block meta data which includes checksum for the block and block’s generation time stamp

Secondary Name Node Name Node Not a hot standby for
the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can build a failed Name Node Its primary role is to periodically merge the namespace image and edit log size within a limit Secondary Name Node usually runs on separate physical machine because it requires as much memory as Name Node to perform the merge Secondary Name Node You give me metadata every hour, I will make it secure Single Point Failure Meta Data

Meta-data in Memory The entire metadata is in main memory
No demand paging of FS meta-data Types of Metadata List of files List of Blocks for each file List of Data Node for each block File attributes, e.g. access time, replication factor A Transaction Log Records file creations, file deletions, etc. Name Node Metadata

Job Tracker 1) Copy Input Files 2) Submit Job 3)
Get Input File’s Info 4) Create Splits 5) Upload Job Information 6) Submit Job User Client Input Files DFS Job Tracker Job.xml, job.jar

Job Tracker (contd.) 7) Initialize Job 8) Read Job Files
Input Splits As many maps as splits Job Queue Client DFS Job.xml, job.jar Job Tracker 6) Submit Job Reduces Maps 9) Create Maps and reduces

Job Tracker (contd.) Job Queue Job Tracker Task Tracker H4
Task Tracker H2 Task Tracker H3 Task Tracker H1 H2 H3 H4 H1 12) Assign Tracks 10) Heartbeat 11) Picks Tasks (data local if possible) 10) Heartbeat

Job Tracker Map Reduce Master ; delegating jobs to task
tracker Client submits jobs to job tracker , jobs are kept in queue FIFO Scheduler Capacity Scheduler Job Tracker Determines the location of data through Name Node Job Tracker determines available task tracker (prefers the slots near to the data) Job Tracker submits the work to Task Tracker Task Tracker monitors it and send update to Job Tracker After completion Job Tracker Updates its Status Job Tracker is a single point of failure

Anatomy of a File Write HDFS Client Name Node Data
Node Data Node Data Node Add Block (Source) Write Data Pipeline Blocks Received Cluster

Client Writing a file to HDFS Client creates a new
file by giving path to Name Node For Each Block Name Node returns the list of data nodes to host its replicas Client Pipelines the data to the chosen data nodes Data node confirms the creation of block replica to name node

Anatomy of a File Read HDFS Client Name Node Data
Node Data Node Data Node Get Block Locations Read Cluster

Client Reading a file from HDFS Connects to Name node
Ask name node to give the list of data nodes that is hosting the replica’s of the block of file Client then directly read from data node without contacting again to name node Along with the data, checksum is also shipped for verifying the data integrity. Why?? If the replica is corrupt client intimates name node, and try to get the data from other data node Client ask name node to fetch List of blocks Location of each block from name node Location is ordered by the distance from the reader

Replication and Rack Awareness Block A Block B Block C
1 2 3 4 9 10 11 12 5 6 7 8

Replication and Rack Awareness Replica placement policy 1st replica on
one node in the local rack 2nd replica on different node in the local rack 3rd replica on different node in different rack Replica Selection Closest to the reader If there exist a replica on the same rack as the reader node then that replica is preferred Reduce bandwidth consumption Improves read latency Block size and replication can be configured per file Application can specify the replication of file

Big Data –It’s about Scale And Structure Structured Data Types
Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required on write Schema Required on Read Reads are Fast Speed Writes are Fast Software License Cost Support only Known Entity Resources Growing, complexities, wide Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Best Fit Use Data discovery Processing unstructured data Massive storage/processing EDW MPP NoSQL RDBMS HADOOP

Big Data & Hadoop Overview

Big Data & Hadoop Overview

More Decks by StratApps

Other Decks in Education

Featured

Transcript