Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Hadoop's Role in Your Big Data Architecture

cj_harris5
October 02, 2013

Apache Hadoop's Role in Your Big Data Architecture

cj_harris5

October 02, 2013
Tweet

More Decks by cj_harris5

Other Decks in Technology

Transcript

  1. ©  Hortonworks  Inc.  2012   Apache Hadoop's Role in Your

    Big Data Architecture Chris  Harris     EMEA,  Hortonworks   [email protected]     Twi<er  :  cj_harris5   Page  1  
  2. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Agenda • 

    The Growth of Enterprise Data •  Hadoop Market Drivers •  Hortonworks – an Overview •  The Future of Hadoop and Big Data Page 2
  3. © Hortonworks Inc. 2013 Data Explosion The Growth of Data

    in the Enterprise Page 3 By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent. – Gartner, Mark Beyer, “Information Management in the 21st Century” 1 Zettabyte (ZB) = 1 Billion TBs 15x growth rate of machine generated data by 2020 Source: IDC
  4. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Next Generation

    Data Architecture Drivers Business Drivers Technical Drivers Financial Drivers •  From reactive analytics to proactive customer interaction •  Find insights for competitive advantage & optimal returns •  Cost of data systems, as % of IT spend, continues to grow •  Cost advantages of commodity hardware & open source •  Data continues to grow exponentially •  Data is increasingly everywhere and in many formats
  5. © Hortonworks Inc. 2013 Market Transitioning into Early Majority time

    relative % customers The CHASM Customers want solutions & convenience Customers want technology & performance Innovators, technology enthusiasts Early adopters, visionaries Early majority, pragmatists Late majority, conservatives Laggards, Skeptics Source: Geoffrey Moore - Crossing the Chasm Page 5
  6. © Hortonworks Inc. 2013 6 Key Hadoop DATA TYPES 1. 

    Sentiment Understand how your customers feel about your brand and products – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4.  Geographic Analyze location-based data to manage operations where they occur 5.  Server Logs Research logs to diagnose process failures and prevent security breaches 6.  Text Understand patterns in text across millions of web pages, emails, and documents Value Page 6
  7. © Hortonworks Inc. 2013 Apache Hadoop Enterprise Use Cases (1

    of 2) Vertical Use Case Data Type Financial Services New Account Risk Screens Text, Server Logs Fraud Prevention Server Logs Trading Risk Server Logs Maximize Deposit Spread Text, Server Logs Insurance Underwriting Geographic, Sensor, Text Accelerate Loan Processing Text Telecom Call Detail Records (CDRs) Machine, Geographic Infrastructure Investment Machine, Server Logs Next Product to Buy (NPTB) Clickstream Real-time Bandwidth Allocation Server Logs, Text, Sentiment New Product Development Machine, Geographic Retail 360° View of the Customer Clickstream, Text Analyze Brand Sentiment Sentiment Localized, Personalized Promotions Geographic Website Optimization Clickstream Optimal Store Layout Sensor Page 7
  8. © Hortonworks Inc. 2013 Apache Hadoop Enterprise Use Cases (2

    of 2) Vertical Use Case Data Type Manufacturing Supply Chain and Logistics Sensor Assembly Line Quality Assurance Sensor Proactive Maintenance Machine Crowdsourced Quality Assurance Sentiment Healthcare Genomic Sequencing Structured Real-time Data for Blood Sampling Sensor, Server Logs Rapid, Mobile Detection of Autism Unstructured Reducing Cost of Cancer Treatment Sensor, Unstructured Perpetual Storage of Research Data Sensor, Unstructured Page 8
  9. © Hortonworks Inc. 2012 New Account Risk Screens Business Problem

    •  Banks take thousands of new account applications daily •  Text-based 3rd party risk reports displayed to banker •  Bankers can (and do) override risk recs to open account •  Account charge-offs and fraud costs banks millions Solution •  HDP helps senior managers control new account risk •  Match banker decisions with multiple sources of information they use to make those decisions •  Correct risky behavior by sanctioning individuals, updating policies, improving training or identifying fraud. Financial Services Data: Text, Server Logs Page 9
  10. © Hortonworks Inc. 2012 Fraud Prevention Business Problem •  Financial

    institutions are always at risk of fraud •  Fraudsters test bank systems for vulnerabilities •  This testing leaves subtle patterns often undetected by bank employees or law enforcement •  Fraud losses costs banks millions Solution •  HDP reduces the cost to detect fraudulent activity •  HDP stores more types of data for longer •  Analysis of data in the “data lake” exposes fraudulent patterns that would have gone undetected Financial Services Data: Server Logs Page 10
  11. © Hortonworks Inc. 2012 Call Detail Records (CDRs) Business Problem

    •  Telcos perform forensics on dropped calls and sound quality •  Call detail records flow in at a rate of millions per second •  High volume makes pattern recognition and root cause analysis difficult, which need to happen in real-time •  Delay causes attrition and harms servicing margins Solution •  HDP can ingest millions of CDRs per second •  HDP facilitates data retention and root cause analysis •  Continuously improve call quality, customer satisfaction and servicing margins Telecom Data: Machine, Geo Page 11
  12. © Hortonworks Inc. 2012 Infrastructure Investment Business Problem •  Telecom

    marketing and capacity planning are coordinated •  Consumption of bandwidth and services can be out of sync with plans for new towers and transmission lines •  Mismatch between infrastructure investments and the actual return on investment puts revenue at risk Solution •  HDP helps telcos understand service consumption in a particular state, county or neighborhood •  Analyze Call Detail Records (CDRs) and network loads, more intelligently, over longer periods of time •  Plan infrastructure with more precision and less variability Telecom Data: Machine, Logs Page 12
  13. © Hortonworks Inc. 2012 360° View of the Customer Business

    Problem •  Retailers interact with customers across multiple channels •  Customer interaction and purchase data is often siloed •  Few retailers can correlate customer purchases with marketing campaigns and online browsing behavior •  Merging data in relational databases is expensive Solution •  HDP gives retailers a 360° view of customer behavior •  Store data longer & track phases of the customer lifecycle •  Gain competitive advantage: increase sales, reduce supply chain expenses and retain the best customers Retail Data: Clickstream, Text Page 13
  14. © Hortonworks Inc. 2012 Analyze Brand Sentiment Business Problem • 

    Enterprises lack a reliable way to track their brand health •  It is difficult to analyze how advertising, competitor moves, product launches or news stories affect the brand •  Internal brand studies can be slow, expensive and flawed Solution •  HDP allows quick, unbiased brand sentiment snapshots •  Analyze sentiment from Twitter, Facebook, LinkedIn or industry- specific social media streams •  Retailers better understand customer perceptions, to align their communications, products and promotions with those perceptions and expectations Retail Data: Sentiment Page 14
  15. © Hortonworks Inc. 2012 Supply Chain and Logistics Business Problem

    •  Manufacturers need just-in-time availability of components •  Stock-outs cause harmful production delays •  Sensors and RFID tags reduce the cost of capturing more supply chain data, which needs storage and processing Solution •  HDP stores unstructured, streaming, “dirty” sensor data •  Manufacturers get lead time to make alternative arrangements for supply chain disruptions •  Prevent stock-outs, reduce supply chain costs and improve margins for the finished product Manufacturing Data: Sensor Page 15
  16. © Hortonworks Inc. 2012 Assembly Line Quality Assurance Business Problem

    •  High-tech manufacturing uses sensors to capture data at critical steps in the manufacturing process •  Sensor data helps diagnose errors with returned products •  Much data is discarded, because of high storage costs •  Lean margins mean small budgets for data analysis Solution •  HDP stores unstructured, streaming, “dirty” sensor data •  Manufacturers can proactively analyze more data, over a longer time, to detect subtle issues otherwise undetected •  Sensor data managed with HDP can help a manufacturer reduce warranty costs and earn a reputation for quality Manufacturing Data: Sensor Page 16
  17. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Growth Pressures

    Existing Data Architectures APPLICATIONS   DATA  SYSTEMS   TRADITIONAL  REPOS   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   OPERATIONAL   TOOLS   MANAGE  &   MONITOR   Tradi:onal  Sources     (RDBMS,  OLTP,  OLAP)   DEV  &  DATA   TOOLS   BUILD  &   TEST   Page 17 Packaged   Analy:c  App   Custom   Analy:c  App   Data growth 8% annually
  18. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 An Emerging

    Data Architecture APPLICATIONS   DATA  SYSTEMS   TRADITIONAL  REPOS   RDBMS   EDW   MPP   DATA  SOURCES   OLTP,  POS   SYSTEMS   OPERATIONAL   TOOLS   MANAGE  &   MONITOR   Tradi:onal  Sources     (RDBMS,  OLTP,  OLAP)   New  Sources     (web  logs,  email,  sensors,  social  media)   DEV  &  DATA   TOOLS   BUILD  &   TEST   Packaged   Analy:c  App   ENTERPRISE   HADOOP  PLATFORM   Page 18 Custom   Analy:c  App   Data growth 85% annually
  19. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Agenda • 

    The Growth of Enterprise Data •  Hadoop Market Drivers •  An Overview •  The Future of Hadoop and Big Data Page 19
  20. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 A Brief

    History of Apache Hadoop Page 20 2013 2005: Yahoo! creates team under E14 to work on Hadoop Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 2012 2006 2011: Hortonworks created to focus on “Enterprise Hadoop“
  21. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Leadership Starts

    at the Core Page 21 •  Driving next generation Hadoop –  YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery •  420k+ lines authored since 2006 –  More than twice nearest contributor •  Deeply integrating w/ecosystem –  Enabling new deployment platforms –  (ex. Windows & Azure, Linux & VMware HA) –  Creating deeply engineered solutions –  (ex. Teradata big data appliance) •  All Apache, NO holdbacks –  100% of code contributed to Apache
  22. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 Agenda • 

    The Growth of Enterprise Data •  Hadoop Market Drivers •  Hortonworks – an Overview •  The Future of Hadoop and Big Data Page 22
  23. © Hortonworks Inc. 2013 The 1st Generation of Hadoop: Batch

    HADOOP 1.0 Built for Web-Scale Batch Apps Single  App   BATCH HDFS Single  App   INTERACTIVE Single  App   BATCH HDFS •  All other usage patterns must leverage that same infrastructure •  Forces the creation of silos for managing mixed workloads Single  App   BATCH HDFS Single  App   ONLINE
  24. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 The Enterprise

    Requirement: Beyond Batch To become an enterprise viable data platform, customers have told us they want to store ALL DATA in one place and interact with it in MULTIPLE WAYS – Simultaneously & with predictable levels of service Page 24 HDFS  (Redundant,  Reliable  Storage)   BATCH   INTERACTIVE   STREAMING   GRAPH   IN-­‐MEMORY   HPC  MPI   ONLINE   SEARCH  
  25. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 YARN: Taking

    Hadoop Beyond Batch •  Created to manage resource needs across all uses •  Ensures predictable performance & QoS for all apps •  Enables apps to run “IN” Hadoop rather than “ON” – Key to leveraging all other common services of the Hadoop platform: security, data lifecycle management, etc. Page 25 Applica:ons  Run  Na:vely  IN  Hadoop   HDFS2  (Redundant,  Reliable  Storage)   YARN  (Cluster  Resource  Management)       BATCH   (MapReduce)   INTERACTIVE   (Tez)   STREAMING   (Storm,  S4,…)   GRAPH   (Giraph)   IN-­‐MEMORY   (Spark)   HPC  MPI   (OpenMPI)   ONLINE   (HBase)   OTHER   (Search)   (Weave…)  
  26. © Hortonworks Inc. 2013 © Hortonworks Inc. 2013 The Future

    of the Hadoop and Big Data • The next generation data architecture evolving rapidly – Store ALL data in a Hadoop data reservoir – Push subsets of data to a final platform for processing • Hadoop 2.0 takes Hadoop beyond “Batch” – 2.0 YARN based architecture enabling mixed use workloads with enterprise resource management • Enabling a new generation of applications at scale – Based on new data types (sensor, sentiment, clickstream, etc.) or keeping existing types for much longer
  27. © Hortonworks Inc. 2013 Hortonworks Sandbox Page 27 Hands on

    tutorials integrated into Sandbox HDP environment for evaluation