Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Developer to Data Scientist (KCDC 2018)

From Developer to Data Scientist (KCDC 2018)

Due to recent advances in technology, humanity is collecting vast amounts of data at an unprecedented rate, making the skills necessary to mine insights from this data increasingly valuable. So what does it take for a Developer to enter the world of data science?

Join me on a journey into the world of big data and machine learning where we will explore what the work actually looks like, identify which skills are most important, and design a road map for how you too can join this exciting and profitable industry.

Gaines Kergosien

July 12, 2018
Tweet

More Decks by Gaines Kergosien

Other Decks in Technology

Transcript

  1. @GainesK
    From Developer to Data Scientist - Gaines Kergosien @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    From Developer to Data Scientist
    A journey into the world of data analysis.

    View Slide

  2. TITANIUM SPONSORS
    Platinum Sponsors
    Gold Sponsors

    View Slide

  3. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    “I keep saying that the sexy job in the next 10 years
    will be statisticians. And I’m not kidding.”
    – Google’s Chief Economist
    The Demand

    View Slide

  4. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Worldwide spending on big data and analytics could
    top $210 billion by 2020 (40% growth from 2017).
    – IDC
    The Demand

    View Slide

  5. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Worldwide revenues for big data and business
    analytics (BDA) will grow from $130.1 billion in 2016 to
    more than $203 billion in 2020.
    - Worldwide Semiannual Big Data and Analytics Spending Guide
    The Demand

    View Slide

  6. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    “The U.S. economy could be short as many as 250,000
    data scientists by 2024.”
    – McKinsey Global
    The Demand

    View Slide

  7. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    2017

    View Slide

  8. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    2017
    2016 Infrastructure

    View Slide

  9. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    2017
    2016 Analytics

    View Slide

  10. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    2017
    2016 Applications

    View Slide

  11. @GainesK
    From Developer to Data Scientist - Gaines Kergosien

    View Slide

  12. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Big Data

    View Slide

  13. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Big Data
    Volume

    View Slide

  14. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Volume

    View Slide

  15. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Big Data
    Volume Variety
    • Records
    • Transactions
    • Tables & Files
    • Structured
    • Unstructured
    • Semi-structured

    View Slide

  16. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Unstructured Text
    • Books
    • Blog Posts
    • Comments
    • Tweets
    • Photos
    • Video
    • Audio
    The Variety

    View Slide

  17. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Big Data
    Volume Variety
    Velocity
    • Real Time
    • Near Time
    • Batch
    • Streams
    • Records
    • Transactions
    • Tables & Files
    • Structured
    • Unstructured
    • Semi-structured

    View Slide

  18. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Velocity
    Twitter
    • 6,000 tweets per second
    • 500 million tweets/day
    Facebook
    • 300 million photos/day
    NY Stock Exchange
    • captures 1TB of trade information each session

    View Slide

  19. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Big Data
    Big Data
    Volume Variety
    Velocity
    • Real Time
    • Near Time
    • Batch
    • Streams
    • Records
    • Transactions
    • Tables & Files
    • Structured
    • Unstructured
    • Semi-structured

    View Slide

  20. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Skills
    Subject Matter Expertise
    Statistics
    • Choose Procedures
    • Diagnose Problems
    • Develop Procedures
    Hacking Expertise
    • Technical Skills
    • Creativity
    • Values
    • Goals
    • Constraints
    Machine
    Learning
    Traditional
    Research
    Traditional
    Software
    Data
    Science

    View Slide

  21. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Skills
    Subject
    Matter
    Expertise
    Hacking
    Expertise
    Social
    Sciences
    Statistics
    Machine
    Learning
    Traditional
    Software
    Data
    Science
    Traditional
    Research
    Traditional
    Research
    Holistic
    Research
    Socially
    Unaware
    Domain
    Unaware
    Holistic
    Software

    View Slide

  22. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Overlap
    Data Science
    Big Data
    Big
    Data
    Science
    Big Data
    Volume Variety
    Velocity

    View Slide

  23. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Data Lake
    • Data Mining
    • Unstructured Data
    • Dark Data
    • Fast Data
    • Edge Analytics
    • Predictive Analytics
    • Data Visualization
    The Buzzwords

    View Slide

  24. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Data
    • Define
    • Collect
    • Store
    • Explore
    The Phrase “Data Science”
    Science
    • Hypothesis
    • Plan Approach
    • Analysis
    • Report Results

    View Slide

  25. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Data scientists apply sophisticated quantitative and
    computer science skills to both structure and analyze
    massive stores or continuous streams of unstructured
    data, with the intent to derive insights and prescribe
    action.
    – Burtchworks
    The Data Scientist

    View Slide

  26. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Data Scientist – Simple Definition

    View Slide

  27. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Data Acquisition
    • Data Cleaning/Transformation
    • Analytics
    • Prescribing Actions
    • Programming/Automation
    The Job – Core Skills

    View Slide

  28. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Educate the business
    • Design big data architecture
    • Look for problems to solve
    • Research new techniques
    • Collate data for analysis (ETL)*
    • Implement algorithms
    • Present insights
    The Job – Daily Activities

    View Slide

  29. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Maturity – Years of Experience
    https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf

    View Slide

  30. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Education
    https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf

    View Slide

  31. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The US Distribution
    https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf

    View Slide

  32. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Areas of Study
    https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf

    View Slide

  33. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Industries
    2016 2017
    https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf

    View Slide

  34. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Classification – Is this A or B?
    • Anomaly Detection – Is this weird?
    • Regression – How much -or- how many?
    • Clustering – How is this organized?
    • Reinforcement Learning – What should I do next?
    The Five Questions

    View Slide

  35. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Analysis Tools

    View Slide

  36. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Tool Trends
    Python
    KNIME
    RapidMiner
    R
    SPSS
    SAS
    Hadoop

    View Slide

  37. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • SQL
    • Excel
    • Python
    • R
    • MySQL
    The Top Tools

    View Slide

  38. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • SPSS
    • Matlab
    • Julia
    • Kafka/Storm
    • R
    • Python
    • Java/Scala
    • Stata
    • SAS
    The Languages
    http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html

    View Slide

  39. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Languages – SAS, Phython or R?
    http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/

    View Slide

  40. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Languages – Trends
    http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/

    View Slide

  41. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Languages – Industries
    http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/

    View Slide

  42. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Languages – Education

    View Slide

  43. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Languages – Trends

    View Slide

  44. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    The Languages – The Future

    View Slide

  45. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • R Statistical Programming Language
    • Based on the S programming language
    • R Development Environment
    • Statistical and Visual Analysis
    • Cross-Platform
    • Free Open Source
    • Active User Community
    • Over 9,000 Extension Packages
    The R

    View Slide

  46. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Created in 1991 to emphasize productivity and code
    readability
    • Easier learning curve than R
    • Free Open Source
    • Active User Community
    The Python

    View Slide

  47. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Pig
    • Hive
    • Hbase
    • Storm
    • Spark
    • etc.
    The Hadoop Collective

    View Slide

  48. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Random
    • Simple Random
    • Stratified
    • Systematic/Sequential
    • Cluster
    Nonrandom
    • Judgement
    • Convenience
    • Snowball
    The Wrangling – Sampling
    Stratified Sampling
    •Sequential Sampling

    View Slide

  49. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Discard
    • Infer
    The Wrangling – Data Reconciliation

    View Slide

  50. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Normalize Numeric Values
    • Standard Unit of Measure
    • Subtract Average (Mean = 0)
    • Divide by Standard Deviation
    Reduce Dimensionality
    • Irrelevant Input Variables
    • Redundant Input Variables
    The Wrangling
    Add Derivative Values
    • Generalize Attributes
    • Discretize Attributes to
    Categories
    • Binarize Categorical Attributes
    Design Training Data
    • Select
    • Combine
    • Aggregate
    Power and Log transformation
    • Approximate Normal
    Distribution

    View Slide

  51. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • basic statistics (ie. p-value)
    • statistical modeling
    • statistical tests
    • experiment design
    • distributions
    • maximum likelihood estimators
    • probability theory
    • linear algebra
    • multivariable calculus
    The Math

    View Slide

  52. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Tableau (enterprise visualization products) - www.tableau.com
    • ggvis (R visualization package) - ggvis.rstudio.com
    • ggplot (plotting system) - ggplot.yhathq.com
    • D3.js (declarative DOM manipulation) - d3js.org
    • Vega (visualization grammar)- trifacta.github.com/vega
    • Rickshaw (charting library - code.shutterstock.com/rickshaw
    • modest maps (map library) - modestmaps.com
    • Chart.js (plotting library) - www.chartjs.org
    The Visualization Tools

    View Slide

  53. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    “Machine Learning is the science of getting computers
    to learn and act like humans do, and improve their
    learning over time in autonomous fashion, by feeding
    them data and information in the form of observations
    and real-world interactions.”
    • Representation
    • Evaluation
    • Optimization
    Machine Learning

    View Slide

  54. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Machine Learning - Supervised

    View Slide

  55. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Machine Learning – Unsupervised

    View Slide

  56. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Generative Adversarial Networks (GAN)

    View Slide

  57. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Narrow AI
    • General AI
    Artificial Intelligence

    View Slide

  58. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • IBM Watson
    • Microsoft Azure Machine Learning API
    • Google Prediction API
    • Amazon Machine Learning API
    • BigML
    The Cloud - MLaaS & AIaaS

    View Slide

  59. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Concepts
    • k-nearest neighbors
    • random forests
    • ensemble methods
    • …use Python libraries!
    Tools
    • Weka - www.cs.waikato.ac.nz/ml/weka/
    The Machine Learning

    View Slide

  60. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Report
    • Presentation
    • Demo
    • Prototype
    • Component
    The Results

    View Slide

  61. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Data Analyst (A)
    • Data Engineer (B)
    • Academic (Ab)
    • Generalist (AB)
    The Skills

    View Slide

  62. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    • Coursera - www.coursera.org
    • EdX- www.edx.org
    • Udacity - www.udacity.com
    • Kaggle - www.kaggle.com
    • Youtube - projects.iq.harvard.edu/stat110/youtube
    • Boot Camps
    The Training

    View Slide

  63. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    1. Fundamentals
    2. Statistics
    3. Programming
    4. ML
    5. Text Mining
    6. Visualization
    7. Big Data
    8. Data Munging
    9. Toolbox
    The Path

    View Slide

  64. @GainesK
    From Developer to Data Scientist - Gaines Kergosien
    Q & A
    Slides at DotNetDude.net
    Subject
    Matter
    Expertise
    Hacking
    Expertise
    Social
    Sciences
    Statistics
    Machine
    Learning
    Traditional
    Software
    Data
    Science
    Traditional
    Research
    Traditional
    Research
    Holistic
    Research
    Socially
    Unaware
    Domain
    Unaware
    Holistic
    Software
    Big Data
    Volume Variety
    Velocity
    Data Science
    Big Data
    Big
    Data
    Science

    View Slide