Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

Talk by Sean Owen

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

November 25, 2014
Tweet

Transcript

  1. STATE OF PLAY SEAN OWEN DIRECTOR OF DATA SCIENCE CLOUDERA

  2. State of Play Data Science on Hadoop in 2015 Sean

    Owen // Director, Data Science @ Cloudera
  3. 2 About … • Engineer • Data Science @ Cloudera

    • Oryx project founder • Committer, erstwhile VP Apache Mahout • Apache Spark contributor / personality • Co-author, Mahout in Action / Advanced Analytics on Spark • sowen@cloudera.com / @sean_r_owen
  4. 3 Where Is My Magic Wand?

  5. 4 We Like Hadoop Because … • (Was) Shiny New

    Toy • Be Like Yahoo, Google, FB • Data as Strategy • Free – Just Add Hardware • Open, Standard • Cost-Savings Projects • Bigger and Faster is Better • Fewer Hacks to Survive Scale • Do The Previously Impossible It’s Aspirational It Costs Less We Get More Computing www.avalonconsulting.net/blog/485-thinking-beyond-shiny- and-new www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/
  6. 5 Incremental Today vs. Revolutionary Tomorrow • We set up

    a prototype Hadoop cluster as part of a big data POC • We cut our IT budget by 22% by moving some operations to Hadoop • Our SQL queries are 3 times faster and overnight reports finish in 39 minutes now • We do the same things with data, but do them notably better. • We want to become a real-time product business that reacts to new machine sensor data in seconds, not days • We want to predict which merchants will take out a business loan this month • We want a complete customer profile that “understands” what they want at any time • We think there is a magic wand available?
  7. 6 Phase 1. Collect Data Phase 2. Data Science? Phase

    3. Profit!
  8. 7 Demystifying with Data Science • Machine Learning is not

    new • Big Machine Learning is qualitatively different – More data beats algorithm improvement – Scale trumps noise and sample size effects – Can brute-force manual tasks • Feature selection • Hyperparameter tuning • Engineering “Big” is Difficult – Build new scalable data platforms – Re-engineering parallel algorithms
  9. 8 What is Data Science? What skill sets does it

    require? What tools are commonly used? How do we architect data products? How do we get started?
  10. 9 Three Camps

  11. 10 s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html

  12. 11 Business

  13. 12 Business

  14. 13 Engineering vs. Statistics Programming languages Systems languages Latency, throughput

    Huge data Online problems Automated Developers, Engineers Statistical environments, BI tools High-level languages Accuracy Medium-sized data Offline work Ad-hoc Statisticians, Analysts vs.
  15. 14 Data Science + Hadoop

  16. 15 Engineering, Statistics & Hadoop: Before Gap.

  17. 16 Engineering, Statistics & Hadoop: 2014 YARN RM

  18. 17 Apache Spark: Something for Everyone • Now Apache TLP

    – From UC Berkeley AMPLab – … inspired by MS DryadLINQ • Scala-based – Expressive, efficient – JVM-based • Scala-like abstractions – RDD: Resilient Distributed (immutable) Dataset – Distributed works like local – Like Apache Crunch is Collection-like • Read-Evaluate-Print-Loop – Interactive – No compile/deploy cycle needed • Python API too • Natively Distributed • Hadoop-friendly – Integrate with where data already is – ETL no longer separate • Subprojects: MLlib and more
  19. 18 Statisticians: Shell, Concise Syntax <row Id="4" ... Tags="...c#...winforms..."/> (4,"c#")

    (4,"winforms") ... (4,3104,1.0) (4,2148819,1.0) ... scala> val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "&lt;([^&]+)&gt;".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList tags.map((postID,_)) } } }
  20. 19 Engineers: Distributed, Manageable

  21. 20 2015 is Time to Operationalize

  22. 21 From Exploratory to Operational  Exploratory Analytics Operational Analytics

     Explore Data Pick Model Build Model at Scale, Offline Continuously Update Model Score Model in Real-Time
  23. 22 Lambda λ Architecture noun. 1. Name of a design

    idea you’ve had before but didn’t realize was a thing that needed a name.
  24. 23 Lambda Architecture λ: Streaming • Lambda Architecture – Batch

    Layer: compute full answer offline, in batch – Speed Layer: compute approximate answer online, in near-real-time – Serving Layer: stitch speed/batch answers together in real-time • Great fit for big, real-time ML • Ecosystem has right components now – Batch: Spark + MLlib – Speed: Spark Streaming – Serving: Tomcat / Jetty – Data Fabric: Kafka, HDFS
  25. 24 Oryx 2: Lambda for ML (alpha) github.com/OryxProject/oryx

  26. Thank You sowen@cloudera.com @sean_r_owen

  27. 17TH ~ 18th NOV 2014 MADRID (SPAIN)