Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Big Data

Introduction to Big Data

Introduction to Big Data

Shankar

July 06, 2011
Tweet

More Decks by Shankar

Other Decks in Technology

Transcript

  1. Big Data in the News ›  Savings ›  American Health-Care:

    $300 Billion/Year ›  European Public Sector: €250 Billion/Year ›  Productivity Margins: 60% increase Sources: McKinsey Global Institute
  2. Topics ›  What do we collect today? ›  DBMS Landscape

    ›  The Disconnect ›  The Need ›  What is BigData? ›  Characteristics ›  Approach ›  Architectural Requirements ›  Techniques ›  Challenges ›  Solutions ›  Issues ›  Deep Dive – Practical Approaches to Big Data ›  Hadoop ›  Aster Data
  3. What do we collect? ›  In 2010, people stored data

    to fill 60,000 Library of Congress (LoC collected 235TB in Apr/2011) ›  YouTube receives 24hours of video, every minute ›  5 Billion mobile phones in use in 2010 ›  Tesco (British Retailer) collects 1.5 billion pieces of information to adjust prices and promotions ›  Amazon.com: 30% of sales is out of its recommendation engine ›  Planecast, Mobclix : Track & Target systems promotes contextual promotions ›  A Boeing Jet Engine produces 20TB/Hour for engineers to examine in real time to make improvements Sources: Forrester, The Economist, McKinsey Global Institute
  4. Collect More ›  Business Operations ›  Transactions ›  Registers › 

    Gateways ›  Customer Information ›  CRM ›  Product Information ›  Barcodes ›  RFID ›  Web ›  Pages ›  Web Repositories ›  Unstructured Information ›  Social Media ›  Signals ›  Mobile ›  GPS, GeoSpatial
  5. DBMS Solutions ›  Legacy ›  Faster Retrieval ›  Efficient Storage

    ›  Divide and Access ›  Data Consolidation ›  Broader Tables ›  Access all as a row ›  Fine Grain ›  Access ›  Security ›  Rules and Policies ›  Problems ›  Data Growth ›  When storage cost is not an issue ›  Scalability Issues ›  Performance Issues ›  New types of requirements ›  Deciding what to analyze, when and how? ›  Cost of a change in the subject-area to analyze
  6. The Disconnect ›  Old DBMS vs. New Data Types/Structures › 

    Old DBMS vs. New volume ›  Old DBMS vs. New Analysis ›  Old DBMS vs. Data Retention ›  Old DBMS vs. Data Element Striping ›  Old DBMS vs. Data Infrastructure ›  Old DBMS vs. One DB Platform for all
  7. ›  System that can handle high volume data ›  Perform

    complex operations ›  Scalable ›  Robust ›  Highly Available ›  Fault Tolerant ›  Economic The Need New Approach
  8. Big Data “Tools and techniques to manage different types of

    data, in high volume, in high velocity with varied requirements to mine them” Characteristics ›  Size ›  Scale up and scale out: Terabyte, Petabyte … ›  Structure ›  Structured ›  Unstructured : Audio, Video, Text, GeoSpatial ›  Schema Less Structures ›  Stream ›  Torrent of real-time information ›  Operation ›  Massively Parallel Processing (MPP)
  9. Approach Hardware ›  Commodity Hardware ›  Appliance ›  Dynamic Scaling

    ›  Fault Tolerant ›  Highly Available ›  No constraints on Storage ›  Cloud ›  Virtual Environment, Storage Processing Models ›  In-memory ›  In-database ›  Interfaces/Adapters ›  Workload Management ›  Distributed Data Processing Software ›  Frameworks – Hadoop, MapReduce, Vrije, BOOM, Bloom ›  Open Source ›  Proprietary
  10. Challenges ›  Volumetric Analysis ›  Complexity ›  Streaming Data/Real Time

    Data ›  Network Topology ›  Infrastructure ›  Pattern-based Strategy
  11. Techniques ›  Controlled and Variate Testing ›  Mining ›  Machine

    Learning ›  Natural Language Processing (NLP) ›  Cohort Analysis ›  Network or Path Analysis ›  Predictive Models ›  Crowd Sourcing ›  Regression Models ›  Sentiment Analysis ›  Processing Signals ›  Spatial Analytics ›  Visualization ›  Time-series Analysis
  12. Solutions ›  IBM: Infosphere BigInsights, Streams ›  Teradata/Aster Data: nCluster,

    SQL-MR ›  Frameworks ›  Hadoop ›  MapReduce ›  Infobright* ›  Splunk ›  Cloudera* ›  Cassandra ›  NoSQL, NewSQL ›  Google’s Big Table ›  Appliance ›  Teradata ›  Netezza (IBM) ›  Columnar Databases ›  Vertica (HP) ›  ParAccel Managed Services Available
  13. Issues ›  Latency ›  Faultiness ›  Accuracy ›  ACID › 

    Atomicity ›  Consistency ›  Isolation ›  Durability ›  Setup Cost ›  Development Cost ›  Cost-to-fly
  14. ›  Top level Apache project ›  Open source ›  Software

    Framework - Java ›  Inspired by Google’s white papers on Map/Reduce (MR) Google File System (GFS) Big Table ›  Originally developed to support Apache Nutch ›  Designed ›  Large scale data processing ›  For batch processing ›  For sophisticated analysis ›  To deal with structured and unstructured data DB Architect’s Hadoop : "Heck Another Darn Obscure Open-source Project"
  15. ›  Runs on commodity hardware ›  Portability across heterogeneous hardware

    and software platforms ›  Shared-nothing architecture ›  Scale hardware when ever you want ›  System compensates for hardware scaling and issues (if any) ›  Run large-scale, high volume data processes ›  Scales well with complex analysis jobs ›  (Hardware) “Failure is an option” ›  Ideal to consolidate data from both new and legacy data sources ›  Highly Integrable ›  Value to the business Why Hadoop?
  16. ›  HDFS Hadoop Distributed File System ›  Map/Reduce Software framework

    for Clustered, Distributed data processing ›  ZooKeeper Scheduler ›  Avro Data Serialization ›  Chukwa Data Collection System to monitor Distributed Systems ›  HBase Data storage for distributed large tables ›  Hive Data warehouse ›  Pig High-Level Query Language ›  Scribe Log Collection ›  UDF User Defined Functions Hadoop Ecosystem
  17. ›  Master/Slave Architecture ›  Runs on commodity hardware ›  Fault

    Tolerant ›  Handle large volumes of data ›  Provides High Throughput ›  Streaming data-access ›  Simple file coherency model ›  Portable to heterogeneous hardware and software ›  Robust ›  Handles disk failures, replication (& re-replication) ›  Performs cluster rebalancing, data integrity checks Hadoop Distributed File System HDFS
  18. Name node •  File system operations •  Maps data-nodes Data

    node •  Process read/write •  Handles Data-blocks •  Replication HDFS Architecture
  19. ›  Tagged by a job ›  Splits input data-set into

    separate chunk’s ›  Processed by map tasks, in parallel ›  Sorts the output of the maps ›  Processed by reduce tasks, in parallel ›  Typically stored and processed in a file system ›  Framework takes care of ›  Scheduling tasks ›  Monitoring ›  Re-executing failed tasks ›  Infrastructure issues ›  Load-balancing, Load-redistribution ›  Replication, Failover Hadoop M/R
  20. Mapper Function input | map | shuffle | reduce |

    output cat * | grep | sort | uniq –c | cat > file
  21. Reduce Function input | map | shuffle | reduce |

    output cat * | grep | sort | uniq –c | cat > file
  22. Aster Data ›  Now part of Teradata ›  Massively Parallel

    ›  SQL Layer on MR (MapReduce) ›  In-Database Analytics ›  Appliance vs. Software Stack Model ›  Cloud Options ›  nPath and Statistical Options ›  Data Integration
  23. Thank You "You either scale to where your customer base

    takes you or you die" Jim Starkey – Founder and CTO NimbusDB "Our philosophy is to build infrastructure using the best tools available for the job and we are constantly evaluating better ways to do things when and where it matters." Facebook "In any year we probably generate more data than the Walt Disney Co. did in the first 80 years of existence" Bud Albers - Disney