Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data & Hadoop Overview

StratApps
February 14, 2014

Big Data & Hadoop Overview

1) Understanding what is Big Data
2) Understanding what is Hadoop (Architecture)
3) Understand the main Hadoop Components
4) HDFS and its Architecture

StratApps

February 14, 2014
Tweet

More Decks by StratApps

Other Decks in Education

Transcript

  1. Agenda for the Day 1 1) Understanding what is Big

    Data 2) Understanding what is Hadoop (Architecture) 3) Understand the main Hadoop Components 4) HDFS and its Architecture
  2. What and where is Big Data? Lots of Data (Terabytes

    or Petabytes) Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. Data is every were, we either ignore it or destroy it. Examples Rolling web log data Network logs of data flowing through various networks System logs Click information from website Stock trading data from Stock exchange Examples of personal data Photo Stream, Social media Streams Data need not be in a table in some RDBMS A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
  3. What and where is Big Data? What makes Big Data

    'Big‘? The internet is the biggest source for data. Estimated size to be 1.8 Zettabytes (1 Zettabyte=1021 bytes) Searching the Internet means knowing what us out there and saving the data. Being able to determine what a end user is looking for and getting it from this vast store (searching and Ranking data) Getting the results back in time such that a user hasn't moved away to something else (Indexing) Defining the Big data Problem Data will continue to grow indefinitely, so need hardware that grows with data Cannot keep buying Bigger machines because after a while they become cost prohibitive Hardware should 'grow' as data grows and scale Horizontally Additional hardware should result in a proportional increase in performance
  4. Facebook Example As of 2011, There are 500,000,000 active Facebook

    users. Aprox. 1 in Every 13 People on earth. Half of the them are logged in on any given day. A record-breaking 750 Million Photos were uploaded to Facebook over new year’s weekend. There are 206.2 million internet users in the US. That means 71.2% of the US web audience is on Facebook. Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social network. Facebook has an average of 3.2 Billion Likes and comments are posted every day. Links Shared Event Invites Friend Requests Accepted Photos Uploaded Messages Sent Tagged Photos Status Updates Wall Posts Comments Made 20 Minutes on FACEBOOK 1,000,000 1,484,000 1,972,000 2,716,000 2,716,000 1,323,000 1,851,000 1,587,000 10,208,000
  5. Twitter has over 500 million registered users. The USA, whose

    141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia. 79% of US Twitter users are more like to recommend brands they follow . 67% of US Twitter users are more likely to buy from brands they follow . 57% of all companies that use social media for business use Twitter. Twitter Example
  6. Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Volume Velocity Variety 12 Terabytes of

    Tweets created each day Scrutinizes 5 million trade events created each day to identify potential fraud Sensor data, audio, video, click streams, log files and more Characteristics of Big Data
  7. Estimated Global Data Volume: 2011: 1.8 ZB 2015: 7.9 ZB

    The world's information doubles every two years Over the next 10 years: The number of servers worldwide will grow by 10x Amount of information managed by enterprise data centers will grow by 50x Number of “files” enterprise data center handle will grow by 75x Data Volume is Growing Exponentially Humanity Passes 1 Zettabyte Mark in 2010 A zettabyte is 1,000,000,000,000,000,000,000 bytes (that's 21 zeroes for the counting), or one trillion gigabytes. That's enough data to file 75 trillion 16-gigabytes-sized iPads. 1 million terabytes = 1 exabyte 1,000 terabytes = 1 petabyte 1,000 gigabytes = 1 terabyte 1 billion terabytes 1 zettabyte
  8. Un-Structured Data is Exploding 1970 1990 1980 2000 2010 Complex,

    Unstructured Application Data Relational Business Transaction Data 2500 exabytes of new information in 2012 with internet as primary driver Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Stored Digital Information (exabytes)
  9. Common Big Data Customer Scenarios Financial Services Web & E-Tailing

    Retail Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring and Analysis Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection Point of sales Transaction Analysis Customer Churn Analysis Sentiment Analysis
  10. Limitations of Existing Data Analytics Architecture • BI Reports +

    Interactive Apps • RDBMS (Aggregated Data) • ETL Compute Grid Can’t Explore Original High Fidelity Raw Data Moving Data to Compute Doesn’t Scale Archiving = Premature Data Death Storage Only Grid (Original Raw Data • Collection • Instrumentation 1) 2) 3) Mostly Append
  11. Solution: A Combined Storage Computer Layer • BI Reports +

    Interactive Apps • RDBMS (Aggregated Data) Data Exploration and Advanced Analytics Scalable Throughout For ETL and Aggregation Keep data alive forever Hadoop : Storage + Compute Grid • Collection • Instrumentation 1) 2) 3) Mostly Append
  12. Why DFS? A 8 Core 64 bit processor with 128

    GB RAM Scale Vertically A Quad Core 64 bit processor with 32 GB RAM Scale Horizontally A Quad core 64 bit processor with 32 GB RAM A Single core 64 bit processor with 8 GB RAM Add bigger CPU/Storage Add another computer with similar Memory/CPU/Storage ?
  13. Why DFS? Read 1 TB Data 1 Machine 4 I/O

    Channels Each Channel – 100MB/S 45 Minutes 10 Machines 4 I/O Channels Each Channel – 100MB/S 4.5 Minutes
  14. What Is Hadoop? Apache Hadoop is a framework that allows

    for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an Open-source Data Management with scale-out storage & distributed processing. It scale horizontally to manage Peta-bytes of data, abstracts away distributed storage and computing. The storage abstraction hidden behind Hadoop Distributed file system (HDFS) and analytics via Map Reduce Framework. The Hadoop Services: HDFS: Distributed File System Map Reduce: A Distributed data Processing Model Hbase: A Distributed Column Oriented data Base Hive: A Distributed data warehouse Pig: A data flow language Zookeeper: A Distributed Highly available Coordination Service Sqoop: A tool for efficient bulk transfer of data between hadoop and other sources like RDBMS Oozie: A Service for running and scheduling workflows of hadoop jobs
  15. Hadoop Eco-System Hive DW System Pig Latin Data Analysis Mahout

    Machine Learning Map Reduce Framework HBase HDFS (Hadoop Distributed File System) Import Or Export Flume Sqoop Unstructured Or Semi-Structured Data Structured Data Apache Oozie Workflow
  16. HDFS –Hadoop Distributed File System (Storage) Distributed across “nodes” Natively

    redundant NameNode tracks locations. Map Reduce (Processing) Splits a task across processors “near” the data & assembles results Self-Healing, High Bandwidth Clustered storage Hadoop Core Components Data Node Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node HDFS Map Reduce Job Tracker Name Node
  17. HDFS Overview HDFS Design What is Hadoop designed for Storing

    very large files MB,GB and TB range files each Write once - Read many times Works of commodity hardware What is Hadoop not designed for Low latency data Access Lots of small Files Multiple writes and Arbitrary file modifications The HDFS Concepts Data will continue to grow indefinitely, so need hardware that grows with data Cannot keep buying Bigger machines because after a while they become cost prohibitive Hardware should 'grow' as data grows and scale Horizontally Additional hardware should result in a proportional increase in performance
  18. HDFS Architecture Data Nodes Data Nodes Rack 1 Rack 2

    Client Read Write Client Replication Name Node Secondary Name Node Metadata Ops Block Ops Write
  19. Main Components Of HDFS Name Node (Master Node) Maintains and

    manages the blocks which are present on the Data Nodes Manages the file system tree and Other meta information It is responsible for maintaining namespace image and edit log files Any Changes to the file System namespace or its properties is recorded by Name Node Mapping of file blocks to Data Node(Physical location of Data) Aware of the data nodes for a particular file Important files for Name Node Image: File Consist of Meta data information Check Point: Persistent record of image stored in the native file system Journal: Modification log of image stored in local file system Data Node (Slave Node) Slaves which are deployed on each machine and provide the actual storage Responsible for serving read and write requests for the clients Store and retrieve blocks when they are told to Report back to Name Node periodically with the list of blocks they have Data Nodes are the work horses of the Hadoop file system Block replica is represented by two files File that stores the data itself File that stores the block meta data which includes checksum for the block and block’s generation time stamp
  20. Secondary Name Node Name Node Not a hot standby for

    the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can build a failed Name Node Its primary role is to periodically merge the namespace image and edit log size within a limit Secondary Name Node usually runs on separate physical machine because it requires as much memory as Name Node to perform the merge Secondary Name Node You give me metadata every hour, I will make it secure Single Point Failure Meta Data
  21. Meta-data in Memory The entire metadata is in main memory

    No demand paging of FS meta-data Types of Metadata List of files List of Blocks for each file List of Data Node for each block File attributes, e.g. access time, replication factor A Transaction Log Records file creations, file deletions, etc. Name Node Metadata
  22. Job Tracker 1) Copy Input Files 2) Submit Job 3)

    Get Input File’s Info 4) Create Splits 5) Upload Job Information 6) Submit Job User Client Input Files DFS Job Tracker Job.xml, job.jar
  23. Job Tracker (contd.) 7) Initialize Job 8) Read Job Files

    Input Splits As many maps as splits Job Queue Client DFS Job.xml, job.jar Job Tracker 6) Submit Job Reduces Maps 9) Create Maps and reduces
  24. Job Tracker (contd.) Job Queue Job Tracker Task Tracker H4

    Task Tracker H2 Task Tracker H3 Task Tracker H1 H2 H3 H4 H1 12) Assign Tracks 10) Heartbeat 11) Picks Tasks (data local if possible) 10) Heartbeat
  25. Job Tracker Map Reduce Master ; delegating jobs to task

    tracker Client submits jobs to job tracker , jobs are kept in queue FIFO Scheduler Capacity Scheduler Job Tracker Determines the location of data through Name Node Job Tracker determines available task tracker (prefers the slots near to the data) Job Tracker submits the work to Task Tracker Task Tracker monitors it and send update to Job Tracker After completion Job Tracker Updates its Status Job Tracker is a single point of failure
  26. Anatomy of a File Write HDFS Client Name Node Data

    Node Data Node Data Node Add Block (Source) Write Data Pipeline Blocks Received Cluster
  27. Client Writing a file to HDFS Client creates a new

    file by giving path to Name Node For Each Block Name Node returns the list of data nodes to host its replicas Client Pipelines the data to the chosen data nodes Data node confirms the creation of block replica to name node
  28. Anatomy of a File Read HDFS Client Name Node Data

    Node Data Node Data Node Get Block Locations Read Cluster
  29. Client Reading a file from HDFS Connects to Name node

    Ask name node to give the list of data nodes that is hosting the replica’s of the block of file Client then directly read from data node without contacting again to name node Along with the data, checksum is also shipped for verifying the data integrity. Why?? If the replica is corrupt client intimates name node, and try to get the data from other data node Client ask name node to fetch List of blocks Location of each block from name node Location is ordered by the distance from the reader
  30. Replication and Rack Awareness Replica placement policy 1st replica on

    one node in the local rack 2nd replica on different node in the local rack 3rd replica on different node in different rack Replica Selection Closest to the reader If there exist a replica on the same rack as the reader node then that replica is preferred Reduce bandwidth consumption Improves read latency Block size and replication can be configured per file Application can specify the replication of file
  31. Big Data –It’s about Scale And Structure Structured Data Types

    Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required on write Schema Required on Read Reads are Fast Speed Writes are Fast Software License Cost Support only Known Entity Resources Growing, complexities, wide Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Best Fit Use Data discovery Processing unstructured data Massive storage/processing EDW MPP NoSQL RDBMS HADOOP