Introduction to Big Data

Big Data Shankar Radhakrishnan July, 2011

Big Data in the News   Savings   American Health-Care:
$300 Billion/Year   European Public Sector: €250 Billion/Year   Productivity Margins: 60% increase Sources: McKinsey Global Institute

Topics   What do we collect today?   DBMS Landscape
  The Disconnect   The Need   What is BigData?   Characteristics   Approach   Architectural Requirements   Techniques   Challenges   Solutions   Issues   Deep Dive – Practical Approaches to Big Data   Hadoop   Aster Data

What do we collect?   In 2010, people stored data
to fill 60,000 Library of Congress (LoC collected 235TB in Apr/2011)   YouTube receives 24hours of video, every minute   5 Billion mobile phones in use in 2010   Tesco (British Retailer) collects 1.5 billion pieces of information to adjust prices and promotions   Amazon.com: 30% of sales is out of its recommendation engine   Planecast, Mobclix : Track & Target systems promotes contextual promotions   A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvements Sources: Forrester, The Economist, McKinsey Global Institute

Collect More   Business Operations   Transactions   Registers  
Gateways   Customer Information   CRM   Product Information   Barcodes   RFID   Web   Pages   Web Repositories   Unstructured Information   Social Media   Signals   Mobile   GPS, GeoSpatial

DBMS Solutions   Legacy   Faster Retrieval   Efficient Storage
  Divide and Access   Data Consolidation   Broader Tables   Access all as a row   Fine Grain   Access   Security   Rules and Policies   Problems   Data Growth   When storage cost is not an issue   Scalability Issues   Performance Issues   New types of requirements   Deciding what to analyze, when and how?   Cost of a change in the subject-area to analyze

The Disconnect   Old DBMS vs. New Data Types/Structures  
Old DBMS vs. New volume   Old DBMS vs. New Analysis   Old DBMS vs. Data Retention   Old DBMS vs. Data Element Striping   Old DBMS vs. Data Infrastructure   Old DBMS vs. One DB Platform for all

  System that can handle high volume data   Perform
complex operations   Scalable   Robust   Highly Available   Fault Tolerant   Economic The Need New Approach

Big Data “Tools and techniques to manage different types of
data, in high volume, in high velocity with varied requirements to mine them” Characteristics   Size   Scale up and scale out: Terabyte, Petabyte …   Structure   Structured   Unstructured : Audio, Video, Text, GeoSpatial   Schema Less Structures   Stream   Torrent of real-time information   Operation   Massively Parallel Processing (MPP)

Approach Hardware   Commodity Hardware   Appliance   Dynamic Scaling
  Fault Tolerant   Highly Available   No constraints on Storage   Cloud   Virtual Environment, Storage Processing Models   In-memory   In-database   Interfaces/Adapters   Workload Management   Distributed Data Processing Software   Frameworks – Hadoop, MapReduce, Vrije, BOOM, Bloom   Open Source   Proprietary

Architectural Requirements Integration Framework Management Framework Development Framework Processing Framework
Modeling Framework Data Management Framework

Challenges   Volumetric Analysis   Complexity   Streaming Data/Real Time
Data   Network Topology   Infrastructure   Pattern-based Strategy

Techniques   Controlled and Variate Testing   Mining   Machine
Learning   Natural Language Processing (NLP)   Cohort Analysis   Network or Path Analysis   Predictive Models   Crowd Sourcing   Regression Models   Sentiment Analysis   Processing Signals   Spatial Analytics   Visualization   Time-series Analysis

Solutions   IBM: Infosphere BigInsights, Streams   Teradata/Aster Data: nCluster,
SQL-MR   Frameworks   Hadoop   MapReduce   Infobright*   Splunk   Cloudera*   Cassandra   NoSQL, NewSQL   Google’s Big Table   Appliance   Teradata   Netezza (IBM)   Columnar Databases   Vertica (HP)   ParAccel Managed Services Available

Issues   Latency   Faultiness   Accuracy   ACID  
Atomicity   Consistency   Isolation   Durability   Setup Cost   Development Cost   Cost-to-fly

Deep Dive Hadoop

  Top level Apache project   Open source   Software
Framework - Java   Inspired by Google’s white papers on Map/Reduce (MR) Google File System (GFS) Big Table   Originally developed to support Apache Nutch   Designed   Large scale data processing   For batch processing   For sophisticated analysis   To deal with structured and unstructured data DB Architect’s Hadoop : "Heck Another Darn Obscure Open-source Project"

  Runs on commodity hardware   Portability across heterogeneous hardware
and software platforms   Shared-nothing architecture   Scale hardware when ever you want   System compensates for hardware scaling and issues (if any)   Run large-scale, high volume data processes   Scales well with complex analysis jobs   (Hardware) “Failure is an option”   Ideal to consolidate data from both new and legacy data sources   Highly Integrable   Value to the business Why Hadoop?

  HDFS Hadoop Distributed File System   Map/Reduce Software framework
for Clustered, Distributed data processing   ZooKeeper Scheduler   Avro Data Serialization   Chukwa Data Collection System to monitor Distributed Systems   HBase Data storage for distributed large tables   Hive Data warehouse   Pig High-Level Query Language   Scribe Log Collection   UDF User Defined Functions Hadoop Ecosystem

Hadoop Flow (Example) Web Servers Scribe Network Storage Hadoop Hive
DWH MySQL Oracle Oracle MySQL Apps Feeds

  Master/Slave Architecture   Runs on commodity hardware   Fault
Tolerant   Handle large volumes of data   Provides High Throughput   Streaming data-access   Simple file coherency model   Portable to heterogeneous hardware and software   Robust   Handles disk failures, replication (& re-replication)   Performs cluster rebalancing, data integrity checks Hadoop Distributed File System HDFS

Name node •  File system operations •  Maps data-nodes Data
node •  Process read/write •  Handles Data-blocks •  Replication HDFS Architecture

  Tagged by a job   Splits input data-set into
separate chunk’s   Processed by map tasks, in parallel   Sorts the output of the maps   Processed by reduce tasks, in parallel   Typically stored and processed in a file system   Framework takes care of   Scheduling tasks   Monitoring   Re-executing failed tasks   Infrastructure issues   Load-balancing, Load-redistribution   Replication, Failover Hadoop M/R

Who uses Hadoop?

Deep Dive Aster Data

Aster Data   Now part of Teradata   Massively Parallel
  SQL Layer on MR (MapReduce)   In-Database Analytics   Appliance vs. Software Stack Model   Cloud Options   nPath and Statistical Options   Data Integration

nCluster

Thank You "You either scale to where your customer base
takes you or you die" Jim Starkey – Founder and CTO NimbusDB "Our philosophy is to build infrastructure using the best tools available for the job and we are constantly evaluating better ways to do things when and where it matters." Facebook "In any year we probably generate more data than the Walt Disney Co. did in the first 80 years of existence" Bud Albers - Disney

Introduction to Big Data

Introduction to Big Data

Shankar

More Decks by Shankar

Other Decks in Technology

Featured

Transcript

Big Data Shankar Radhakrishnan July, 2011

Big Data in the News   Savings   American Health-Care:

Topics   What do we collect today?   DBMS Landscape

What do we collect?   In 2010, people stored data

Collect More   Business Operations   Transactions   Registers 

DBMS Solutions   Legacy   Faster Retrieval   Efficient Storage

The Disconnect   Old DBMS vs. New Data Types/Structures 

  System that can handle high volume data   Perform

Big Data “Tools and techniques to manage different types of

Approach Hardware   Commodity Hardware   Appliance   Dynamic Scaling

Architectural Requirements Integration Framework Management Framework Development Framework Processing Framework

Challenges   Volumetric Analysis   Complexity   Streaming Data/Real Time

Techniques   Controlled and Variate Testing   Mining   Machine

Solutions   IBM: Infosphere BigInsights, Streams   Teradata/Aster Data: nCluster,

Issues   Latency   Faultiness   Accuracy   ACID 

Deep Dive Hadoop

  Top level Apache project   Open source   Software

  Runs on commodity hardware   Portability across heterogeneous hardware

  HDFS Hadoop Distributed File System   Map/Reduce Software framework

Hadoop Flow (Example) Web Servers Scribe Network Storage Hadoop Hive

  Master/Slave Architecture   Runs on commodity hardware   Fault

Name node •  File system operations •  Maps data-nodes Data

  Tagged by a job   Splits input data-set into

Mapper Function input | map | shuffle | reduce |

Reduce Function input | map | shuffle | reduce |

Who uses Hadoop?

Deep Dive Aster Data

Aster Data   Now part of Teradata   Massively Parallel

nCluster

Thank You "You either scale to where your customer base