Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data [sorry] & Data Science: What Does a Data Scientist Do?

Big Data [sorry] & Data Science: What Does a Data Scientist Do?

talk by Carlos Somohano at The Cloud and Big Data: HDInsight on Azure London 25/01/13

Data Science London

January 26, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Hacking Statistics Visualization Machine Learning Programming Science Data Mining Math

    The Data Scientist – Billed Platypus The Platypus – Billed Data Scientist
  2. Man on the Moon – Small Data! Apollo X1 Speed:

    3,500 km/hour Weight: 13,500 kg Lots of complex data Man on the Moon Distance: 356,000 Km Never been there before Must return to Earth Computer Program Date: 1,969 64 Kb, 2Kb RAM, Fortran Must work 1st time
  3. Apollo XI, 1969 64 Kb SkyDive Stratos, 2012 Tens of

    Gigabytes Think About It – We live in Crazy Times!
  4. What is Big Data? IT mumbo-jumbo A fashionable term typically

    used by some IT vendors to remarket old fashioned software & hardware
  5. What is Big Data? The n-Vs Volume … Variety …

    Velocity … (add your own V here…) So What?
  6. Change! Water Cooler Chat We need to parallelize data operations

    but it’s too costly & complex … The business can’t get access to all the relevant data, we need external data… We can’t match customer master data to live customer interactions… We can’t just force everything into a star-schema… These BI reports and charts don’t tell us anything we didn’t know… We are missing the ETL window, the data we needed didn’t arrive on time… We can’t predict with confidence if we can’t explore data & develop our own models
  7. What is Big Data? Force of Change Big Data forces

    you to change the way you collect, store, manage, analyze and visualize data
  8. Big Data = Crude Oil [not New Oil] Think data

    as ‘crude oil.’ Big Data is about extracting the ‘crude oil,’ transporting it in ‘mega-tankers,’ siphoning it through ‘pipelines,’ and storing it in massive ‘silos’… All ‘this’ is about IT Big Data… fine and well… … BUT
  9. The Science [and Art] of… Discovering what we don’t know

    from data Obtaining predictive, actionable insight from data Creating Data Products that have business impact now Communicating relevant business stories from data Building confidence in decisions that drive business value
  10. Brief History of Data Science 6th C BC - 1st

    C BC – The Greeks! Pyrrhonism, Skepticism & Empiricism… 1974 – Peter Naur @UoC Datalogy & Data Science 2001 – William S. Cleveland @CSU "Data Science: An Action Plan …: 2002 – Committee on Data for Science & Technology (CODATA) 2003 – Journal of Data Science 2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 2010 – Drew Conway @NYU The Data Science Venn Diagram 2010 – Hillary Mason & Chris Wiggins @Dataists “ 2010 – Mike Loukadis @O’Reilly “What is Data Science?” 2011 – DJ Patil @LinkedIn data scientist vs. data analyst
  11. Jeff Hammerbacher, 2009 “... on any given day, a team

    member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data- intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization."
  12. Mike Loukides, 2010 "Data science enables the creation of data

    products." "Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contribute to the products they use. That's the beginning of data science."
  13. Hilary Mason & Chris Wiggins,2010 "Data science is clearly a

    blend of the hackers’ arts, statistics and machine learning...; and the expertise in mathematics and the domain of the data for the analysis to be interpretable... It requires creative decisions and open-mindedness in a scientific context.
  14. DJ Patil, 2011 ”We realized that as our organizations grew,

    we both had to figure out what to call the people on our teams. "Business analyst” and "Data analyst” seemed too limiting. The focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new”
  15. Hacking Statistics Visualization Machine Learning Programming Science Data Mining Math

    The Data Scientist – Billed Platypus The Platypus – Billed Data Scientist
  16. Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows

    Machine Learning, Statistics, Probability Applies Scientific Method. Runs Experiments Is good at Coding & Hacking Able to deal with IT Data Engineering Knows how to build data products Able to find answers to known unknowns Tells relevant business stories from data Has Domain Knowledge }
  17. 10 Things [most] Data Scientists Do 1  Ask Good Questions.

    What is What… …we don’t know? …we’d like to know? 2  Define and Test an Hypothesis. Run experiments 3  Scoop, Scrap, Sink, & Sample Business Relevant Data 4  Munge and Wrestle Data. Tame Data 5  Explore Data, Discover Data Playfully. Discover unknowns. 6  Model Data. Model Algorithms. 7  Understand Data Relationships 8  Tell the Machine How to Learn from Data 9  Create Data Products that Deliver Actionable Insight 10  Tell Relevant Business Stories from Data
  18. [Sort of a] Data Scientist Toolkit §  Java, R, Python…

    (bonus: Clojure, Haskell, Scala) §  Hadoop, HDFS & MapReduce… (bonus: Spark, Storm) §  HBase, Pig & Hive… (bonus: Shark, Impala, Cascalog) §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) §  SQL, RDBMS, DW, OLAP… §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas) §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… §  SPSS, Matlab, SAS… (the enterprise man) §  NoSQL, Mongo DB, Couchbase, Cassandra… §  And Yes! … MS-Excel: the most used, most underrated DS tool
  19. [Some] Data Science Principles 1  Socio-Technical Systems (STS) are complex!

    2  Data is never at rest 3  Data is dirty, deal with it 4  SVoT = LOL! 5  Data munging & data wrestling > 70% time 6  Simplification. Reduction. Distillation 7  Curiosity. Empiricism. Skepticism
  20. Knowns & Unknowns There are known knowns. These are things

    we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know Donald Rumsfeld
  21. DIKUW FTW! D I K U W Data Information Knowledge

    Understanding Wisdom Raw What How to Why When Numbers Description Experience Cause & Effect Prediction Letters Context Tested Proven What’s best Symbols Relationship Instruction Signals Reports Programs models PAST FUTURE Data Engineer Data Analyst Data Miner Data Scientist Unknown Unknowns Known Unknowns Known Knowns
  22. Data Discovery The new reality for Business Intelligence and Big

    Data, Applied Data Labs Data Scientist Data Analyst
  23. Data Models vs. Algorithmic Models “Statistical Modeling: The Two Cultures”

    Leo Breiman, 2001 Algorithmic Modeling Data Modeling How well ‘my data model’ works Statisticians, Data Analysts, Data Miners Linear Regression Logistic Regression Known Distributions Confidence Intervals Predictor Variables & Goodness of Fit Y ß F( X, random noise, parameters) The world produces data in a black-box Data Scientists Machine Learning, AI & Neural Nets Random Forests, SVM, GBT Unknown Multivariate Distributions Iterative Predictive Accuracy ß X Y ß VS. Black Box Random Forests We understand the world We don’t understand the world
  24. Learning from Data is Tricky Statistical vs. Machine Learning Supervised

    vs. Unsupervised Learning Induction vs. Deduction Sampling & Confidence Intervals Probability & Distribution Deviation & Variance Correlation vs. Causation Causation & Prediction
  25. More Data or Better Models? More Data Beats Better Algorithms,

    Omar Tawakoi @BlueKai Better Algorithms Beat More Data, Mark Torrance @RocketFuel More Data or Better Models, Xavier Armitrain @Netflix On Chomsky & 2 Cultures of Statistical Learning, Peter Norvig @Google Specialist Knowledge is Useless & Unhelpful, Jeremy Howard @Kaggle
  26. Data Science Process - 1 The World Ingest Raw Data

    Munch Data The Dataset 1  Known Unknowns? 2  We’d like to know…? 3  Outcomes? 4  What Data? 5  Hypothesis? Independency? Correlation? Covariance? Causality? Dimensionality? Missing Values? Relevant? MapReduce ETL, ELT Data Wrangle Data Cleansing Data Jujitsu Dim Reduction Sample Select, Join, Bind Transactions Web-Scraping Web-clicks & logs Sensor Data Mobile Data Docs, Emails, XLS Social Feeds, RSS Flume & Sink HDFS Product Manufactured Goods shipped Product purchased Phone Calls Made Energy Consumed Fraud Committed Repair Requested System
  27. Data Science Process - II Learn From Data Data Product

    Actionable Predictive Immediate Impact Business Value Easy to explain Objectives Levers Modeling Simulation Optimization Visualization Description & Inference Data & Algorithm Models Machine Learning Networks & Graphs Regression & Prediction Classification & Clustering Experiments & Iteration The Dataset Explore Data Discover Data Deliver Insight Represent Data Visualize Insight
  28. A Data Product Is… … Curated and crafted from raw

    data … A result of exploration and iterations … A machine that learns from data … An answer to known unknowns or unknown unknowns … A mechanism that triggers immediate business value … A probabilistic window of future events or behavior
  29. Data Jiu-Jitsu Data Jiu-Jitsu: ability to turn big data into

    data products that generate immediate business value (DJ Patil @LinkedIn) Data Data Scientist Jiu Jitsu Fight Data Product $$$$
  30. Developing Data Products Objectives Levers Data Models What Outcome Am

    I Trying to Achieve? What Inputs Can We Control? What Data Can We Collect? How the Levers Influence the Objectives Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  31. Objective-Based Data Products Data Modeler Simulator Optimizer What Outcome Am

    I Trying to Achieve? Actionable Outcome The Model Assembly Line Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  32. 1  Products the customer may like 2  Price Elasticity 3 

    Probability of Purchase w/o Recommendation 4  Purchase Sequence 5  Causality Model 6  Patience Model Customer Lifecycle Value Data Modeler Simulator Optimizer Optimize CLV Product Recommendations Visualizer Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  33. Automated Fruits Procurement 12,000 stores 300 Fruits Avg. Shelf life

    < 3 days Confirm Purchase Orders In less than 2 hours Safety Stock levels? Demand vs Stock? Price vs. Demand? Anomalies? Fruit Shortages? Fruit Write-offs? Adapted from Blueyonder
  34. Strawberries & the Weather Why these huge stock write-offs? Sudden

    increase in temperature No sales vs X,XXX sales predicted A Predictive Model that calculates strawberry purchases based on Weather forecast Store temperature Freezer sensor data Remaining stock per shelf live Sales TPoS feeds Web searches, social mentions Adapted from Blueyonder
  35. Personalized Social Recommendations Prediction: Personalized Skills Recommendation Collaborative Filtering: Matching

    Skills to People Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
  36. Colas- In Which US State I Invest Mktg. $? What

    the Business Analyst Sent What the Data Scientist did…
  37. Interested in Data Science? Join our community http://www.meetup.com/Data-Science-London/ Follow us

    on Twitter @ds_ldn Check out our blog http://datasciencelondon.org