Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data

Shankar
February 13, 2013

Big Data

Presentation on Big Data during Kurukshetra 2013

Shankar

February 13, 2013
Tweet

More Decks by Shankar

Other Decks in Technology

Transcript

  1. Topics • Data Management Today • New Interests, Expectations, Problems

    • Big Data • New Approach • Big Data Ecosystem • Q & A
  2. Data Management Today • Relational Databases • Oracle, MySQL, MS-SQL

    Server • Data warehouse Appliances • Teradata, IBM-Netezza • Legacy Systems • Mainframes
  3. New Interests, Expectations • Collect More, Data-Mine More • Complex

    Data Integration • Advanced Analytics • Social Data Analysis • Machine Data Analysis • Realtime Data Analysis • Actionable Insights • Extension of Investments • Talent Management • ROI • TCO • Business Continuity
  4. How Big is Data? 5 BIG Facts (as of Oct

    2012) ? of the world’s data was created in the last two years is the average amount companies have to spend per compromised customer when a data breach occurs $214 data centers across the world are large enough to fill 5,955 football fields 90 500,000+ e-mail messages are sent each day… about 80% of them are spam 247bn Average number of “likes” and “comments” posted on Facebook daily 2.7bn It would take 2,000 hours to watch all the YouTube videos uploaded while we’re talking on this panel* *this is 3x more than just 2 short years ago
  5. New Problems • Unpredictable Volume • Data Processing Issues •

    Data Integration Issues • Identifying Source-of-Truth • Store vs. Analyze • Data Retrieval Requirements • Computing Limitations • Information vs. Insights • Business Requirements • Regulatory Requirements • True Value-of-Data • Price to Performance Dilemma
  6. What is Big Data? Volume Velocity Complexity Variety Source: Ventana

    Research • Structured data • OLTP • DW • ODS • Data marts • Unstructured data • Text • Audio • Video • Click streams • Log files • Very large data sets • Sizes from 100 TB to 50 PB • Larger than “one machine” • Whole data set analysis replaces “sampling” • Real-time data streaming data • High volume / Low latency • Write heavy • Read heavy • Both is common • Complexity • Data acquisition • Analysis • Deriving insights
  7. New Approach • Commodity Hardware • Open Computing Project •

    Open Source Solutions, Frameworks • Value Added Products – Cloudera, Datastax, 10gen • Research Oriented Product Development • Augmented Ecosystem
  8. Big Data : Ecosystem BI / Reporting Data Engineering -

    Performance Reporting, Enterprise Metrics, Data Agility - Data Mining, OLAP Modeling etc Advanced Visualizations Data Delivery - Dashboards , Scorecard (Strategy Maps), Spatial & Temporal Analysis Advanced Analytics Predictive & Optimization Modeling, Business Processes Analysis, Functional Analysis Mahout R Avro Sqoop Flume Scribe Oozie Zookeeper Traditional ETL with Hadoop connectors Chukwa Datameer Tableau Hive Pig Data Consolidation Data Economics Data Engineering Data Agility Data Delivery Data Visualization Data Analytics Other BI Tools with Hadoop connectors Data Integration & Management Data Filtering, Data Consolidation & Warehousing, Data Quality, Metadata Management, Job Scheduling, Data Economics Data Storage and Processing Data Storage, Data processing Mapreduce Crunch Cassandra Pangool HDFS Splunk Madlib Native Hadoop ETL Integration Distributed Infrastructure HBase Hadoop components Open source Hadoop platforms 3rd party Hadoop supporting platforms Karmasphere Lucene SAS Big Data Visual Analytics SpotFire
  9. What Big Data can do that traditional data warehousing and

    analytics cannot? 10 Traditional DW Big Data Complete records from known transactional systems. u Data from many different internal & external sources with unknown quality and/or utility. Data is structured, and data fields have known (and often complex) interrelationships. u Loosely structured data. Flat schemas with few complex interrelationships, connections between data elements have to be probabilistically inferred. Multi Terabytes of Data u Multi Peta Bytes of Data Mostly Scale Up Architecture u Scale Out Architecture Analytics run on a stable data model. u The analytic models are larger and require very large amounts of hardware resources to process them in a timely manner Low Performance/Cost ratio as most of the software/hardware platforms are proprietary and license based u High Performance/Cost ratio as most of the software/ hardware platforms are commodity, free, open source
  10. What Big Data can do that traditional data warehousing and

    analytics cannot? 11 Traditional DW Big Data Aggregate data (structured) u Raw Data (structured and unstructured) Aggregate / Segment analytics u Individual level analytics, Micro segmentation, individualized offers to customers Mainstream analytics – Structured analysis - OLAP cubes u Outlier analytics, Pattern discovery, Simulation and modeling, Machine learning Sample data is used for identifying patterns u Entire population of granular data can be leveraged Reports & Dashboards are done on a production basis u Real-time operational analytics and reporting. Intra- day decision making. Traditional models good for small amount of data due to time constraints u Big Models: Computationally intensive analyses, simulations, models with many parameters