Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Phoenix Data Conference 2014 - Nitin Motgi

Phoenix Data Conference 2014 - Nitin Motgi

The evolution of Hadoop & Rise of Data Applications

teamclairvoyant

October 25, 2014
Tweet

More Decks by teamclairvoyant

Other Decks in Technology

Transcript

  1. • Founder/CTO Cask Data, Inc. (http://cask.co) • Building Big Data

    Applications & Platforms since 2007 • Twitter handle: @nmotgi • Previously • Architect & Engineering Lead C.O.R.E @ Yahoo! • Altera and FedEx Who am I ?
  2. What is Hadoop ? THE$open$source$big$data$pla4orm$ Yarn%–%Computa-on%Layer% •  Many%programing%models% •  MapReduce,%SQL,%Streaming,%ML…%

    •  Mul-=users,%with%queues,%priori-es,%etc…% HDFS%–%Hadoop%Distributed%File%System% •  Data%replicated%on%3%computers% •  Automa9cally%replaces%lost%data%/% computers% •  Very%high%bandwidth,%not%IOPs%op9mized% • Scalable • Efficient storage and processing of petabytes of data • Scale linearly by adding commodity hardware • Reliable • Ability to self heal on hardware failures • Flexible • Store all types of data in many formats • Economical • Open Source - no vendor lock-in • Commodity hardware Storage Processing
  3. Early History • 2004 • Google publishes Google File Systems

    (GFS) and MapReduce papers • 2005 • Yahoo staffs Juggernaut, open source DFS & MapReduce • Differentiate via Open Source contributions • Avoid building proprietary systems that will be obsolesced • Leverage wider community for building one infrastructure • Doug Cutting starts Nutch DFS & MapReduce and joins Yahoo! • 2006 • Hadoop is born! - Science clusters were launched as early POC • Yahoo! commits to Hadoop and scaling Hadoop • Yahoo starts a Hadoop team
  4. Hadoop Use-cases Now! • Web/App Logs, Social Data Analytics, Marketing

    optimization • Sensor data, IoT, Machine Log analytics • Business Intelligence, Machine Learning • Fraud detection, Recommendation System, Ad Targeting, Customer Intimacy, • Data Lake / Data Hub • Telecom, Finance, Banking, Government • Retail, Call Centers, • Media, Advertising, Marketing, Gaming, Social • Oil Drilling, Utilities • Data Lake / Data Hub Horizontal Use-cases Industry
  5. SCIENCE HADOOP CLUSTER USER% BEHAVIOR! CATEGORIZATION+ MODELS+(weekly)! PRODUCTION HADOOP CLUSTER

    SERVING(SYSTEMS! ENGAGED&USERS! »!Machine)learning)to)build)ever) be3er)categoriza5on)models) »!Iden&fy)user)interests)using) Categoriza&on)models) SERVING MAPS (every 5 minutes) • Serving Maps • User - Interests • Five Minute Production • Weekly Categorization Models Yahoo! C.O.R.E
  6. Yahoo C.O.R.E Architecture HDFS HBase MR Flume Experimentation Framework Modeling

    Framework Ingestion Framework More than 2Billion user profile, each profile ~ 120K Models updated ~ 5 min 8M Events / 5 mins 10-15 experiments / day New Model Deployment in hours NLP, OLR, SVD, PCA, Multi- Arm Bandit, Business Rules Lambda Architecture - Realtime & Batch
  7. Traditional view of Data and Apps Live apps and data

    Offline data and analytics Data Warehouse CRM Web Site Traffic ERP Feedback loop ETL OLAP Analysis Data Mining Reporting
  8. Prevailing view of where Hadoop fits Inexpensive  way  to  store

      data  to  be  usefully   accessed  somewhere  else New  way  to  gain  insight   into  unstructured  data  via   data  science  and  analy:cs Data Warehouse CRM Web Site Traffic ERP Feedback loop Traditional ETL ETL ETL Reporting Data Mining OLAP Analysis CRM Web Site Traffic ERP Data Warehouse Feedback loop Feedback loop ETL ETL ETL Reporting Data Science Virtualization Hadoop Analytics Data Mining OLAP Analysis
  9. An emerging approach Data applications: combining analytics and action Data

    applications Real time data ingestion Batch, real time processing Multiple data access methods Built on common data platform Data applications unlock business value EDW Existing Transactions Real Time Hybrid Batch Hadoop Data Lake Existing Analytics Existing Hadoop Analytics App Database ETL ETL ETL Data Application Platform New Data Apps
  10. Hardware (Processor, Storage & Memory) OS (Kernel, Runtimes, Scheduler) Application

    Framework App App App Traditional Stack Commodity Hardware (Compute, Storage) Hadoop (Distributed OS) ? App App App Hadoop Stack Is Hadoop a OS ?
  11. Hadoop Application Model HDFS YARN (MR, Pig, Hive, Spark) HBASE

    YARN (Storm, Spark) Message Bus (Kafka, Flume, Scribe) Source Serving (Cassandra, MongoDB..) Accumulo Storage Batch Realtime Collection Interactive
  12. But, Hadoop is hard • Implosion of technologies to solve

    the same problem • E.g. Spark, Impala, Hive, Tez, Hawq • Hadoop talent is scarce and it’s expensive • Competing vendors with competing solutions in the market • Different vendors push a slightly different collection of software • Open source + proprietary modifications • Vendors providing vertical solutions to a problem • Ecosystem and tools not focused on developers Source : http://readwrite.com/2014/08/13/hadoop-slow-security-issues-still-popular
  13. But, Hadoop is hard • Lack of Hadoop skills /

    Big data developers • Data ingestion/ETL blocks progress • Security and governance requirements • Integration code or building core system services • Time needed for testing, production, deployment • Time to market / Time to value
  14. What do I want to do ? > create cluster

    <name> <template> <size> <optional settings> > start cluster <name> > stop cluster <name> > add services <service-list> to cluster <name> Create a cluster
  15. What else I want… > create dataset <name> with meta

    a=b, c=d > update meta a=e, c=e, x=y > set dataset <name> ttl <time> > truncate dataset <name> > delete dataset <name> > create dataset <name> with meta a=b, c=d > ingest event '<event>' with meta x=y, u=v into dataset <name> > ingest events from file <path-to-file> with meta x=y, u=v into dataset <name> > ingest file /data/file.csv of type CSV with meta a=b, c=d into dataset <name> > ingest file /data/file.txt of type TXT with meta a=b, c=d into dataset <name> > ingest file /data/file.avro of type AVRO with meta a=b, c=d into dataset <name> > create dataset <name> with meta a=b, c=d > attach kafka to <name> with properties a=b, c=d as <pipe> > start pipe <pipe> > stop pipe <pipe> > dettach <pipe> > attach twitter to <name> with properties a=b, c=d as <pipe> > attach jms to <name> with properties a=b, c=d as <pipe> > create dataset <name> with meta a=b, c=d > ingest events from file <path-to-file> into dataset <name> > create view on dataset <name> as <view-name> with schema '<json schema>' > run query "SELECT f1, f2, f3 FROM view-name" > schedule query “INSERT INTO <dataset> SELECT f1, f2, f3 FROM view-name” to run every 1 hour Push Data Pull Data Explore Data Manage Data
  16. > show dataset <name> lineage > show dataset <name> metadata

    > show dataset <name> permissions > show dataset <name> audit log > show programs using dataset <name> > create dataset <name> with properties a=b, c=d > ingest directory /data/*.avro of type AVRO into dataset <name> > create view on dataset <name> as <view-name> with schema '<json-schema>' > run query "SELECT f1, f2, f2 FROM <name>" > run query "SELECT f1, f2, f3 FROM <name>" sink to HBASE with <schema> and properties tablename=<name>,a=b > run query "SELECT f1, f2, f3 FROM <name>" sink to HIVE with <schema> and properties tablename=<name>,c=d > run transformation on dataset <name> using <file> and sink to HBASE with <schema> > deploy program spark-pca.jar as spark-pca > run spark-pca using dataset <name> start -1d to now > list datasets spark-pca > run rpc spark-pca-dataset.<method>(x,y,z) > run query "SELECT * FROM spark-pca-dataset" > show dataset <name> lineage > show dataset <name> audit log > show dataset <name> summary > show metrics spark-pca > show logs spark-pca I want more … Metadata of Data Transform & Store Process & Egress
  17. 100% Open Source Projects Virtualization for Hadoop Data and Apps

    Real-time streaming for the real world Clusters with a click CASK DATA APP PLATFORM cdap.io coopr.io tigon.io Thread Abstraction on YARN Transaction for Apache HBase tephra.io
  18. Cask Data Application Platform Application innovation • Enable a new

    class of applications to drive greater business value, including those requiring real-time and batch processing
 Simplified development • Simplify big data app development – more apps faster with less dependence on Hadoop expertise ! Production-ready applications • Avoid compromising operational transparency and control - security, logging, metrics, lineage, and more Data Virtualization Logical representations of data App Virtualization Standardized containers for apps
  19. Data Virtualization Logical representations of physical data as CDAP Datasets

    within the CDAP Runtime Environment Streams for data ingestion Reusable libraries for common Big Data access patterns Data available to multiple applications and different paradigms •Supports Kafka, Flume, REST, and custom implemented protocols •Time-stamped, ordered and horizontally scalable •Secondary indexes, Time Series, Key-Value, Objects, Geospatial, OLAP Cube and more •Libraries expose each pattern as RPCs, Batch Scans, and SQL Tables •Unified batch and real-time processing with the same data used concurrently by MapReduce, Hive, Spark, Flows and more •Expose data as REST services to quickly enable data as a service
  20. App Virtualization Applications deployed as CDAP Containers within the CDAP

    Runtime Environment Framework level guarantees Full development lifecycle and production deployment Standardization of applications across programming paradigms •Integrated transactions mean applications aren't required to be idempotent •Ingestion capabilities and processing engines provide partitioning, ordering and exactly-once execution •Portable and scalable from laptop to cluster with support for testing and continuous integration •Logging, metrics, security, and management with low developer overhead •Take advantage of Spark, Cascading, Hive, etc. and their User APIs without worrying about the details of how to integrate with each system •Real-time and batch applications can be packaged, deployed, and managed together.
  21. CDAP Architecture Provides a single-point of access for data, apps,

    service, and management APIs with integrated discovery, load balancing and horizontal scalability ROUTER TRANSACTION ENGINE RUNTIME SERVICES Enables ACID properties data operations from within any program container, real-time and batch Includes services for apps and data like security, discovery, and management throughout the app and data lifecycle
  22. Customer  in,macy     ! Drive  website  engagement  and  

    revenue  through  personaliza:on   based  on  real  &me  and   retrospec&ve  data   •Real-­‐:me  recommenda:ons   from  predic:ve  analy:cs   •Highly-­‐targeted  ads  for   contextual  experiences   •Large  scale  click  stream  analysis   to  drive  customer  account   expansion  
  23. Anomaly  detec,on         Data  applica:ons  can  detect

      paBerns  and  act  on  trends  and   anomalies   •Fraud  in  financial  transac:ons   •Breaches  in  IT  systems   •Threats  to  public  safety   •Medical  anomalies  providing   beBer  and  faster  diagnosis  for   improved  therapeu:c  and   public  health  outcomes.    
  24. Opera,onalized  analy,cs   ! Hadoop  can  ask  new  ques:ons  of

      old  data.  However,  many  solu:ons   will  benefit  from  asking  old   ques&ons  of  new  data.     •Predic:ve  maintenance   •Supply  chain  management   •Op:miza:on  of  complex   systems  such  as  manufacturing   lines,  computer  networks,  and   transporta:on  systems.  
  25. Thank You! ! github.com/caskdata/cdap ! github.com/caskdata/coopr ! github.com/caskdata/tephra ! github.com/caskdata/tigon

    Group : [email protected] IRC : #cdap Group : [email protected] IRC : #coopr Group : [email protected] IRC : #tephra Group : [email protected] IRC : #tigon Clone it, Build it, Play with it Website : http://cdap.io Website : coopr.io Website : tephra.io Website: tigon.io OR http://cask.co/downloads and Play with it ! http://coo.pr OR Contribute and help us make it better @nmotgi