Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science : Raw Data to Actionable Insights

Data Science : Raw Data to Actionable Insights

Invited talk to students, as part of "Futuristic Technologies in Computer Science" by Computer Society of India. Oct, 2020.

0aa2ebd008cdd198af5e9765062bb265?s=128

dharmeshkakadia

October 10, 2020
Tweet

Transcript

  1. Microsoft Dharmesh Kakadia Microsoft

  2. Microsoft • Interested in bringing AI Applications to masses •

    Sr. Applied Scientist with Microsoft • Worked across Office and Azure products • Prior to that spent couple of years at Microsoft Research • Among other things, author of “Apache Mesos Essentials” • Opinions are mine and not Microsoft’s • You can find me as @dharmeshkakadia everywhere
  3. Microsoft

  4. Microsoft Problem + Data Magic = Solution

  5. Microsoft

  6. Microsoft Problem + Data Magic = Solution Science

  7. Microsoft “Any sufficiently advanced technology is indistinguishable from magic“ -

    Arthur C. Clarke
  8. Microsoft Data science is an inter- disciplinary field that uses

    scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://en.wikipedia.org/wiki/Data_science Software Development Domain Knowledge Statistics
  9. Microsoft Data science is an inter- disciplinary field that uses

    scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://xkcd.com/1838/ Software Development Domain Knowledge Statistics
  10. Microsoft

  11. Microsoft • Classification (spam or not spam) • Recommendations (Amazon

    and Netflix recommendations) • Pattern detection and grouping • Anomaly detection (fraud detection) • Recognition (image, text, audio, video, facial) • Actionable insights (via dashboards, reports, visualizations, …) • Scoring and ranking (credit score) • Optimization (risk management) • Forecasts (sales and revenue) Wide range of applications in industry
  12. Microsoft

  13. Microsoft • Understanding the problem • Data gathering and augmentation

    • Data cleaning • Data exploration • Feature Engineering • Modeling and experimentation • Data visualization and actionable insights https://www.pinterest.com/pin/607704543450983191/
  14. Microsoft Can data science help with this problem? • Conclusive

    • Measurable • Specific http://radar.oreilly.com/2013/04/why-why-why.html What kind of Data science problem is this? • Classification • Regression • Time series prediction …
  15. Microsoft • Collect relevant data from the product • Find

    applicable public datasets • Domain specific datasets • Knowledge bases • Wikipedia • Research datasets • Search results • Entity linking • 3rd party data vendors https://www.alamy.com/stock-image-business-man-digging-a-hole-in-the-ground-to-search-for-useful-information-166716913.html
  16. Microsoft • Less attractive and sometime frustrating, but a crucial

    step • Verify frequency distributions • Remove outliers • Remove NULLS • Add/Replace default values • Uniform Formatting • Type conversion • Domain specific cleaning approaches • Space removal • Image segmentation https://www.pinterest.com/pin/607704543450983191/
  17. Microsoft • Understand what is in a dataset and the

    characteristics of the data by using statistical techniques • Focus on characterizations such as • size, quantity, accuracy • Frequency distributions • Variance • Ranges of data • Visual exploration (Boxplot, histogram, …) https://www.pinterest.com/pin/607704543450983191/
  18. Microsoft • Process of using domain knowledge to extract features

    from raw data via data mining techniques • They can be used to improve the performance of machine learning algorithms. • A lot more artsy and domain knowledge dependent • Better feature leads to • More flexibility in the modeling • Less complex model • Better accuracy https://jyu-theartofml.github.io/posts/feature_eng
  19. Microsoft • Trying out different algorithms and techniques • Measuring

    accuracy • Repeat and refine
  20. Microsoft • Data visualization is the graphical representation of information

    and data • Goal is to engage with all the stakeholders and convey the findings of the data scientists and provide a recommendation for solution • Sometimes, extended to monitor the impact once the solution is delivered https://hbr.org/2016/06/visualizations-that-really-work
  21. Microsoft

  22. Microsoft • Problem: Users are not continuing to use our

    e-learning app • Data gathering: • Feature usage • Engagement with features • Time spent… • Data cleaning : Remove outliers • Exploration: User segmentation • Feature engineering: Avg. session time, Quiz completion rate • Modeling: What features predict user retention? • Visualization & Presentation: Sorted list of features contributing to retention (Correlation coefficient) • Action: Make people use this 3 features within first 2 weeks of signup
  23. Microsoft • Problem: Meetings, User research • Data gathering: Telemetry,

    Databases, Public datasets,.. • Data cleaning : Python, R, Matlab, ... • Exploration: Juypter notebooks, SQL, • Feature engineering: Spark, Pandas • Modeling: Tensorflow, Pytorch, Spark, Pandas , Sklearn, … • Visualization & Presentation: Tableau, Powerpoint, Dashboards,… • Action: New Product features,
  24. Microsoft

  25. Microsoft • Courses in Math, Programming, AI, Databases, … •

    Online courses focused on Data science and machine learning • Books • Contribute to open source projects (Google summer of Code is a good example) • Online communities (reddit, stackoverflow,..) • Hackathons • Participate in solving real world problems on Kaggle
  26. Microsoft

  27. Microsoft