Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop at Yahoo Hack Day

cj_harris5
April 27, 2013

Hadoop at Yahoo Hack Day

cj_harris5

April 27, 2013
Tweet

More Decks by cj_harris5

Other Decks in Technology

Transcript

  1. © Hortonworks Inc. 2013 Web giants proved the ROI in

    data products applying data science to large amounts of data Page 3 Amazon: 35% of product sales come from product recommendations Netflix: 75% of streaming video results from recommendations Prediction of click through rates
  2. © Hortonworks Inc. 2013 Data science is a natural next

    step after business intelligence Page 4 Value Refine Extract Enrich Data Science Dashboards Reports Score-cards Affinity Analysis Outlier Detection Clustering Recommendation Regression Classification Business Intelligence: measure & count; simple analytics Data Science: discovery & prediction; complex analytics; “data product” Discovery Prediction
  3. © Hortonworks Inc. 2013 Key use-cases in Finance/Insurance • Customer risk

    profiling: – How likely is this customer to pay back his mortgage? – How likely is this customer to get sick? • Fraud detection: – Detect illegal credit card activity and alert bank/consumer – Detect illegal insurance claims • Internal fraud detection (compliance): – Is this employee accessing financial information they are not allowed to access? Page 5
  4. © Hortonworks Inc. 2013 Key use-cases in Telco/Mobile • Customer life-time-value

    prediction – What is the LTV for customer X? • Marketing – Which new mobile phone should we offer to customer X so that they remain with us? – Location based advertising • Failure prediction – When will equipment X in cell tower Y fail? • Cell Tower Management – Predict load and bandwidth on cell towers to optimize network Page 6
  5. © Hortonworks Inc. 2013 Key use-cases in Healthcare • Clinical Decision

    Support: – What is the ideal treatment for this patient? • Cost management: – What is the expected overall cost of treatment for this patient over the life of the disease • Diagnostics: – Given these test results, what is the likelihood of cancer? • Epidemic management – Predict size and location of epidemic spread Page 7
  6. © Hortonworks Inc. 2013 A Brief History of Apache Hadoop

    Page 9 2013 Focus on INNOVATION 2005: Yahoo! creates team under E14 to work on Hadoop Focus on OPERATIONS 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 2012 2006 STABILITY 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo
  7. © Hortonworks Inc. 2013 Leadership that Starts at the Core

    Page 10 •  Driving next generation Hadoop –  YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery •  420k+ lines authored since 2006 –  More than twice nearest contributor •  Deeply integrating w/ecosystem –  Enabling new deployment platforms –  (ex. Windows & Azure, Linux & VMware HA) –  Creating deeply engineered solutions –  (ex. Teradata big data appliance) •  All Apache, NO holdbacks –  100% of code contributed to Apache
  8. © Hortonworks Inc. 2013 Operational Data Refinery Page 11 DATA

     SYSTEMS   DATA  SOURCES   1 3 1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Push to existing data warehouse for use with existing analytic tools 2 3 Refine Explore Enrich 2 APPLICATIONS   Collect data and apply a known algorithm to it in trusted operational process TRADITIONAL  REPOS   RDBMS   EDW   MPP   Business   Analy;cs   Custom   Applica;ons   Enterprise   Applica;ons   Tradi;onal  Sources     (RDBMS,  OLTP,  OLAP)   New  Sources     (web  logs,  email,  sensor  data,  social  media)  
  9. © Hortonworks Inc. 2013 Key Capability in Hadoop: Late binding

    Page 12 DATA     SERVICES   OPERATIONAL   SERVICES   HORTONWORKS     DATA  PLATFORM   HADOOP  CORE   WEB  LOGS,     CLICK  STREAMS   MACHINE   GENERATED   OLTP   Data  Mart  /   EDW   Client  Apps   Dynamically  Apply   Transforma8ons   Hortonworks  HDP   With  tradi;onal  ETL,  structure  must  be  agreed  upon  far  in  advance  and  is  difficult  to  change.   With  Hadoop,  capture  all  data,  structure  data  as  business  need  evolve.   WEB  LOGS,     CLICK  STREAMS   MACHINE   GENERATED   OLTP   ETL  Server   Data  Mart  /   EDW   Client  Apps   Store  Transformed   Data  
  10. © Hortonworks Inc. 2013 Big Data Exploration & Visualization Page

    13 DATA  SYSTEMS   DATA  SOURCES   Refine Explore Enrich APPLICATIONS   1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Explore and visualize with analytics tools supporting Hadoop 2 3 Collect data and perform iterative investigation for value 3 2 TRADITIONAL  REPOS   RDBMS   EDW   MPP   1 Business   Analy;cs   Tradi;onal  Sources     (RDBMS,  OLTP,  OLAP)   New  Sources     (web  logs,  email,  sensor  data,  social  media)   Custom   Applica;ons   Enterprise   Applica;ons  
  11. © Hortonworks Inc. 2013 Visualization Tooling • Robust visualization and business

    tooling • Ensures scalability when working with large datasets Page 14 Native Excel support Web browser support Mobile support
  12. © Hortonworks Inc. 2013 Application Enrichment Page 15 DATA  SYSTEMS

      DATA  SOURCES   Refine Explore Enrich APPLICATIONS   1 Capture Capture all data Process Parse, cleanse, apply structure & transform Exchange Incorporate data directly into applications 2 3 Collect data, analyze and present salient results for online apps 3 1 2 TRADITIONAL  REPOS   RDBMS   EDW   MPP   Tradi;onal  Sources     (RDBMS,  OLTP,  OLAP)   New  Sources     (web  logs,  email,  sensor  data,  social  media)   Custom   Applica;ons   Enterprise   Applica;ons   NOSQL  
  13. © Hortonworks Inc. 2013 Web giants proved the ROI in

    data products applying data science to large amounts of data Page 16 Amazon: 35% of product sales come from product recommendations Netflix: 75% of streaming video results from recommendations Prediction of click through rates
  14. © Hortonworks Inc. 2013 Interoperating With Your Tools Page 17

    APPLICATIONS   DATA  SYSTEMS   TRADITIONAL  REPOS   DEV  &  DATA   TOOLS   OPERATIONAL   TOOLS   Viewpoint Microsoft Applications DATA  SOURCES   MOBILE   DATA   OLTP,  POS   SYSTEMS   Tradi;onal  Sources     (RDBMS,  OLTP,  OLAP)   New  Sources     (web  logs,  email,  sensor  data,  social  media)  
  15. © Hortonworks Inc. 2013 Enhancing the Core of Apache Hadoop

    Page 19 HADOOP  CORE   PLATFORM  SERVICES   Enterprise Readiness HDFS   YARN  (in  2.0)   MAP  REDUCE   Deliver high-scale storage & processing with enterprise-ready platform services Unique Focus Areas: •  Bigger, faster, more flexible Continued focus on speed & scale and enabling near-real-time apps •  Tested & certified at scale Run ~1300 system tests on large Yahoo clusters for every release •  Enterprise-ready services High availability, disaster recovery, snapshots, security, …
  16. © Hortonworks Inc. 2013 Page 20 HADOOP  CORE   DATA

      SERVICES   Distributed Storage & Processing PLATFORM  SERVICES   Enterprise Readiness Data Services for Full Data Lifecycle WEBHDFS   HCATALOG   HIVE   PIG   HBASE   SQOOP   FLUME   Provide data services to store, process & access data in many ways Unique Focus Areas: •  Apache HCatalog Metadata services for consistent table access to Hadoop data •  Apache Hive Explore & process Hadoop data via SQL & ODBC-compliant BI tools •  Apache HBase NoSQL database for Hadoop •  WebHDFS Access Hadoop files via scalable REST API •  Talend Open Studio for Big Data Graphical data integration tools
  17. © Hortonworks Inc. 2013 Operational Services for Ease of Use

    Page 21 OPERATIONAL   SERVICES   DATA   SERVICES   Store, Process and Access Data HADOOP  CORE   Distributed Storage & Processing PLATFORM  SERVICES   Enterprise Readiness OOZIE   AMBARI   Include complete operational services for productive operations & management Unique Focus Area: •  Apache Ambari: Provision, manage & monitor a cluster; complete REST APIs to integrate with existing operational tools; job & task visualizer to diagnose issues
  18. © Hortonworks Inc. 2013 Hortonworks Process for Enterprise Hadoop Page

    23 Upstream Community Projects Downstream Enterprise Product Hortonworks Data Platform Design & Develop Distribute Integrate & Test Package & Certify Apache HCatalog Apache Pig Apache HBase Other Apache Projects Apache Hive Apache Ambari Apache Hadoop Test & Patch Design & Develop Release No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream Stable Project Releases Fixed Issues
  19. © Hortonworks Inc. 2013 OS   Cloud   VM  

    Appliance   Page 24 PLATFORM  SERVICES   HADOOP  CORE   DATA   SERVICES   OPERATIONAL   SERVICES   Manage & Operate at Scale Store, Process and Access Data Enterprise Readiness Only Hortonworks allows you to deploy seamlessly across any deployment option •  Linux & Windows •  Azure, Rackspace & other clouds •  Virtual platforms •  Big data appliances HORTONWORKS     DATA  PLATFORM  (HDP)   Distributed Storage & Processing Deployable Across a Range of Options
  20. © Hortonworks Inc. 2013 Refine-Explore-Enrich Demo Page 25 Hands on

    tutorials integrated into Sandbox HDP environment for evaluation The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop – no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: •  A free download: http:// hortonworks.com/products/hortonworks- sandbox/ •  A complete, self contained virtual machine with Apache Hadoop pre- configured •  A personal, portable and standalone Hadoop environment •  A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
  21. © Hortonworks Inc. 2013 Hortonworks & Microsoft Page 26 HDInsight"

    •  Big Data Insight for Millions, Massive expansion of Hadoop •  Simplifies Hadoop, Enterprise Ready •  Hortonworks Data Platform used for Hadoop on Windows Server and Azure •  An engineered, open source solution –  Hadoop engineered for Windows –  Hadoop powered Microsoft business tools –  Ops integration with MS System Center –  Bidirectional connectors for SQL Server –  Support for Hyper-V, deploy Hadoop on VMs –  Opens the .NET developer community to Hadoop –  Javascript for Hadoop –  Deploy on Azure in 10 minutes •  Excel •  PowerPivot (BI) •  PowerView (visualization) •  SharePoint +