Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Developer to Data Scientist (KCDC 2018)

From Developer to Data Scientist (KCDC 2018)

Due to recent advances in technology, humanity is collecting vast amounts of data at an unprecedented rate, making the skills necessary to mine insights from this data increasingly valuable. So what does it take for a Developer to enter the world of data science?

Join me on a journey into the world of big data and machine learning where we will explore what the work actually looks like, identify which skills are most important, and design a road map for how you too can join this exciting and profitable industry.

Gaines Kergosien

July 12, 2018
Tweet

More Decks by Gaines Kergosien

Other Decks in Technology

Transcript

  1. @GainesK From Developer to Data Scientist - Gaines Kergosien @GainesK

    From Developer to Data Scientist - Gaines Kergosien From Developer to Data Scientist A journey into the world of data analysis.
  2. @GainesK From Developer to Data Scientist - Gaines Kergosien “I

    keep saying that the sexy job in the next 10 years will be statisticians. And I’m not kidding.” – Google’s Chief Economist The Demand
  3. @GainesK From Developer to Data Scientist - Gaines Kergosien Worldwide

    spending on big data and analytics could top $210 billion by 2020 (40% growth from 2017). – IDC The Demand
  4. @GainesK From Developer to Data Scientist - Gaines Kergosien Worldwide

    revenues for big data and business analytics (BDA) will grow from $130.1 billion in 2016 to more than $203 billion in 2020. - Worldwide Semiannual Big Data and Analytics Spending Guide The Demand
  5. @GainesK From Developer to Data Scientist - Gaines Kergosien “The

    U.S. economy could be short as many as 250,000 data scientists by 2024.” – McKinsey Global The Demand
  6. @GainesK From Developer to Data Scientist - Gaines Kergosien Big

    Data Volume Variety • Records • Transactions • Tables & Files • Structured • Unstructured • Semi-structured
  7. @GainesK From Developer to Data Scientist - Gaines Kergosien Unstructured

    Text • Books • Blog Posts • Comments • Tweets • Photos • Video • Audio The Variety
  8. @GainesK From Developer to Data Scientist - Gaines Kergosien Big

    Data Volume Variety Velocity • Real Time • Near Time • Batch • Streams • Records • Transactions • Tables & Files • Structured • Unstructured • Semi-structured
  9. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Velocity Twitter • 6,000 tweets per second • 500 million tweets/day Facebook • 300 million photos/day NY Stock Exchange • captures 1TB of trade information each session
  10. @GainesK From Developer to Data Scientist - Gaines Kergosien Big

    Data Big Data Volume Variety Velocity • Real Time • Near Time • Batch • Streams • Records • Transactions • Tables & Files • Structured • Unstructured • Semi-structured
  11. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Skills Subject Matter Expertise Statistics • Choose Procedures • Diagnose Problems • Develop Procedures Hacking Expertise • Technical Skills • Creativity • Values • Goals • Constraints Machine Learning Traditional Research Traditional Software Data Science
  12. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Skills Subject Matter Expertise Hacking Expertise Social Sciences Statistics Machine Learning Traditional Software Data Science Traditional Research Traditional Research Holistic Research Socially Unaware Domain Unaware Holistic Software
  13. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Overlap Data Science Big Data Big Data Science Big Data Volume Variety Velocity
  14. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Data Lake • Data Mining • Unstructured Data • Dark Data • Fast Data • Edge Analytics • Predictive Analytics • Data Visualization The Buzzwords
  15. @GainesK From Developer to Data Scientist - Gaines Kergosien Data

    • Define • Collect • Store • Explore The Phrase “Data Science” Science • Hypothesis • Plan Approach • Analysis • Report Results
  16. @GainesK From Developer to Data Scientist - Gaines Kergosien Data

    scientists apply sophisticated quantitative and computer science skills to both structure and analyze massive stores or continuous streams of unstructured data, with the intent to derive insights and prescribe action. – Burtchworks The Data Scientist
  17. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Data Acquisition • Data Cleaning/Transformation • Analytics • Prescribing Actions • Programming/Automation The Job – Core Skills
  18. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Educate the business • Design big data architecture • Look for problems to solve • Research new techniques • Collate data for analysis (ETL)* • Implement algorithms • Present insights The Job – Daily Activities
  19. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Maturity – Years of Experience https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf
  20. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Education https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf
  21. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    US Distribution https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf
  22. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Areas of Study https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf
  23. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Industries 2016 2017 https://www.burtchworks.com/files/2014/07/Burtch-Works-Study_DS-2017-final.pdf
  24. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Classification – Is this A or B? • Anomaly Detection – Is this weird? • Regression – How much -or- how many? • Clustering – How is this organized? • Reinforcement Learning – What should I do next? The Five Questions
  25. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Tool Trends Python KNIME RapidMiner R SPSS SAS Hadoop
  26. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    SQL • Excel • Python • R • MySQL The Top Tools
  27. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    SPSS • Matlab • Julia • Kafka/Storm • R • Python • Java/Scala • Stata • SAS The Languages http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
  28. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Languages – SAS, Phython or R? http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/
  29. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Languages – Trends http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/
  30. @GainesK From Developer to Data Scientist - Gaines Kergosien The

    Languages – Industries http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/
  31. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    R Statistical Programming Language • Based on the S programming language • R Development Environment • Statistical and Visual Analysis • Cross-Platform • Free Open Source • Active User Community • Over 9,000 Extension Packages The R
  32. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Created in 1991 to emphasize productivity and code readability • Easier learning curve than R • Free Open Source • Active User Community The Python
  33. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Pig • Hive • Hbase • Storm • Spark • etc. The Hadoop Collective
  34. @GainesK From Developer to Data Scientist - Gaines Kergosien Random

    • Simple Random • Stratified • Systematic/Sequential • Cluster Nonrandom • Judgement • Convenience • Snowball The Wrangling – Sampling Stratified Sampling •Sequential Sampling
  35. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Discard • Infer The Wrangling – Data Reconciliation
  36. @GainesK From Developer to Data Scientist - Gaines Kergosien Normalize

    Numeric Values • Standard Unit of Measure • Subtract Average (Mean = 0) • Divide by Standard Deviation Reduce Dimensionality • Irrelevant Input Variables • Redundant Input Variables The Wrangling Add Derivative Values • Generalize Attributes • Discretize Attributes to Categories • Binarize Categorical Attributes Design Training Data • Select • Combine • Aggregate Power and Log transformation • Approximate Normal Distribution
  37. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    basic statistics (ie. p-value) • statistical modeling • statistical tests • experiment design • distributions • maximum likelihood estimators • probability theory • linear algebra • multivariable calculus The Math
  38. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Tableau (enterprise visualization products) - www.tableau.com • ggvis (R visualization package) - ggvis.rstudio.com • ggplot (plotting system) - ggplot.yhathq.com • D3.js (declarative DOM manipulation) - d3js.org • Vega (visualization grammar)- trifacta.github.com/vega • Rickshaw (charting library - code.shutterstock.com/rickshaw • modest maps (map library) - modestmaps.com • Chart.js (plotting library) - www.chartjs.org The Visualization Tools
  39. @GainesK From Developer to Data Scientist - Gaines Kergosien “Machine

    Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.” • Representation • Evaluation • Optimization Machine Learning
  40. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Narrow AI • General AI Artificial Intelligence
  41. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    IBM Watson • Microsoft Azure Machine Learning API • Google Prediction API • Amazon Machine Learning API • BigML The Cloud - MLaaS & AIaaS
  42. @GainesK From Developer to Data Scientist - Gaines Kergosien Concepts

    • k-nearest neighbors • random forests • ensemble methods • …use Python libraries! Tools • Weka - www.cs.waikato.ac.nz/ml/weka/ The Machine Learning
  43. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Report • Presentation • Demo • Prototype • Component The Results
  44. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Data Analyst (A) • Data Engineer (B) • Academic (Ab) • Generalist (AB) The Skills
  45. @GainesK From Developer to Data Scientist - Gaines Kergosien •

    Coursera - www.coursera.org • EdX- www.edx.org • Udacity - www.udacity.com • Kaggle - www.kaggle.com • Youtube - projects.iq.harvard.edu/stat110/youtube • Boot Camps The Training
  46. @GainesK From Developer to Data Scientist - Gaines Kergosien 1.

    Fundamentals 2. Statistics 3. Programming 4. ML 5. Text Mining 6. Visualization 7. Big Data 8. Data Munging 9. Toolbox The Path
  47. @GainesK From Developer to Data Scientist - Gaines Kergosien Q

    & A Slides at DotNetDude.net Subject Matter Expertise Hacking Expertise Social Sciences Statistics Machine Learning Traditional Software Data Science Traditional Research Traditional Research Holistic Research Socially Unaware Domain Unaware Holistic Software Big Data Volume Variety Velocity Data Science Big Data Big Data Science