Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of Play. Data Science on Hadoop in 2015 b...

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain 2014

Talk by Sean Owen

Big Data Spain

November 25, 2014
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. State of Play Data Science on Hadoop in 2015 Sean

    Owen // Director, Data Science @ Cloudera
  2. 2 About … • Engineer • Data Science @ Cloudera

    • Oryx project founder • Committer, erstwhile VP Apache Mahout • Apache Spark contributor / personality • Co-author, Mahout in Action / Advanced Analytics on Spark • [email protected] / @sean_r_owen
  3. 4 We Like Hadoop Because … • (Was) Shiny New

    Toy • Be Like Yahoo, Google, FB • Data as Strategy • Free – Just Add Hardware • Open, Standard • Cost-Savings Projects • Bigger and Faster is Better • Fewer Hacks to Survive Scale • Do The Previously Impossible It’s Aspirational It Costs Less We Get More Computing www.avalonconsulting.net/blog/485-thinking-beyond-shiny- and-new www.pianta.co.uk/massive-sale-now-on/ www.google.com/about/careers/locations/mayes-county/
  4. 5 Incremental Today vs. Revolutionary Tomorrow • We set up

    a prototype Hadoop cluster as part of a big data POC • We cut our IT budget by 22% by moving some operations to Hadoop • Our SQL queries are 3 times faster and overnight reports finish in 39 minutes now • We do the same things with data, but do them notably better. • We want to become a real-time product business that reacts to new machine sensor data in seconds, not days • We want to predict which merchants will take out a business loan this month • We want a complete customer profile that “understands” what they want at any time • We think there is a magic wand available?
  5. 7 Demystifying with Data Science • Machine Learning is not

    new • Big Machine Learning is qualitatively different – More data beats algorithm improvement – Scale trumps noise and sample size effects – Can brute-force manual tasks • Feature selection • Hyperparameter tuning • Engineering “Big” is Difficult – Build new scalable data platforms – Re-engineering parallel algorithms
  6. 8 What is Data Science? What skill sets does it

    require? What tools are commonly used? How do we architect data products? How do we get started?
  7. 13 Engineering vs. Statistics Programming languages Systems languages Latency, throughput

    Huge data Online problems Automated Developers, Engineers Statistical environments, BI tools High-level languages Accuracy Medium-sized data Offline work Ad-hoc Statisticians, Analysts vs.
  8. 17 Apache Spark: Something for Everyone • Now Apache TLP

    – From UC Berkeley AMPLab – … inspired by MS DryadLINQ • Scala-based – Expressive, efficient – JVM-based • Scala-like abstractions – RDD: Resilient Distributed (immutable) Dataset – Distributed works like local – Like Apache Crunch is Collection-like • Read-Evaluate-Print-Loop – Interactive – No compile/deploy cycle needed • Python API too • Natively Distributed • Hadoop-friendly – Integrate with where data already is – ETL no longer separate • Subprojects: MLlib and more
  9. 18 Statisticians: Shell, Concise Syntax <row Id="4" ... Tags="...c#...winforms..."/> (4,"c#")

    (4,"winforms") ... (4,3104,1.0) (4,2148819,1.0) ... scala> val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id=\"(\\d+)\".+Tags=\"([^\"]+)\"".r val tagRegex = "&lt;([^&]+)&gt;".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList tags.map((postID,_)) } } }
  10. 21 From Exploratory to Operational  Exploratory Analytics Operational Analytics

     Explore Data Pick Model Build Model at Scale, Offline Continuously Update Model Score Model in Real-Time
  11. 22 Lambda λ Architecture noun. 1. Name of a design

    idea you’ve had before but didn’t realize was a thing that needed a name.
  12. 23 Lambda Architecture λ: Streaming • Lambda Architecture – Batch

    Layer: compute full answer offline, in batch – Speed Layer: compute approximate answer online, in near-real-time – Serving Layer: stitch speed/batch answers together in real-time • Great fit for big, real-time ML • Ecosystem has right components now – Batch: Spark + MLlib – Speed: Spark Streaming – Serving: Tomcat / Jetty – Data Fabric: Kafka, HDFS