Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science : Raw Data to Actionable Insights

Data Science : Raw Data to Actionable Insights

Invited talk to students, as part of "Futuristic Technologies in Computer Science" by Computer Society of India. Oct, 2020.


October 10, 2020

More Decks by dharmeshkakadia

Other Decks in Technology


  1. Microsoft • Interested in bringing AI Applications to masses •

    Sr. Applied Scientist with Microsoft • Worked across Office and Azure products • Prior to that spent couple of years at Microsoft Research • Among other things, author of “Apache Mesos Essentials” • Opinions are mine and not Microsoft’s • You can find me as @dharmeshkakadia everywhere
  2. Microsoft Data science is an inter- disciplinary field that uses

    scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://en.wikipedia.org/wiki/Data_science Software Development Domain Knowledge Statistics
  3. Microsoft Data science is an inter- disciplinary field that uses

    scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://xkcd.com/1838/ Software Development Domain Knowledge Statistics
  4. Microsoft • Classification (spam or not spam) • Recommendations (Amazon

    and Netflix recommendations) • Pattern detection and grouping • Anomaly detection (fraud detection) • Recognition (image, text, audio, video, facial) • Actionable insights (via dashboards, reports, visualizations, …) • Scoring and ranking (credit score) • Optimization (risk management) • Forecasts (sales and revenue) Wide range of applications in industry
  5. Microsoft • Understanding the problem • Data gathering and augmentation

    • Data cleaning • Data exploration • Feature Engineering • Modeling and experimentation • Data visualization and actionable insights https://www.pinterest.com/pin/607704543450983191/
  6. Microsoft Can data science help with this problem? • Conclusive

    • Measurable • Specific http://radar.oreilly.com/2013/04/why-why-why.html What kind of Data science problem is this? • Classification • Regression • Time series prediction …
  7. Microsoft • Collect relevant data from the product • Find

    applicable public datasets • Domain specific datasets • Knowledge bases • Wikipedia • Research datasets • Search results • Entity linking • 3rd party data vendors https://www.alamy.com/stock-image-business-man-digging-a-hole-in-the-ground-to-search-for-useful-information-166716913.html
  8. Microsoft • Less attractive and sometime frustrating, but a crucial

    step • Verify frequency distributions • Remove outliers • Remove NULLS • Add/Replace default values • Uniform Formatting • Type conversion • Domain specific cleaning approaches • Space removal • Image segmentation https://www.pinterest.com/pin/607704543450983191/
  9. Microsoft • Understand what is in a dataset and the

    characteristics of the data by using statistical techniques • Focus on characterizations such as • size, quantity, accuracy • Frequency distributions • Variance • Ranges of data • Visual exploration (Boxplot, histogram, …) https://www.pinterest.com/pin/607704543450983191/
  10. Microsoft • Process of using domain knowledge to extract features

    from raw data via data mining techniques • They can be used to improve the performance of machine learning algorithms. • A lot more artsy and domain knowledge dependent • Better feature leads to • More flexibility in the modeling • Less complex model • Better accuracy https://jyu-theartofml.github.io/posts/feature_eng
  11. Microsoft • Data visualization is the graphical representation of information

    and data • Goal is to engage with all the stakeholders and convey the findings of the data scientists and provide a recommendation for solution • Sometimes, extended to monitor the impact once the solution is delivered https://hbr.org/2016/06/visualizations-that-really-work
  12. Microsoft • Problem: Users are not continuing to use our

    e-learning app • Data gathering: • Feature usage • Engagement with features • Time spent… • Data cleaning : Remove outliers • Exploration: User segmentation • Feature engineering: Avg. session time, Quiz completion rate • Modeling: What features predict user retention? • Visualization & Presentation: Sorted list of features contributing to retention (Correlation coefficient) • Action: Make people use this 3 features within first 2 weeks of signup
  13. Microsoft • Problem: Meetings, User research • Data gathering: Telemetry,

    Databases, Public datasets,.. • Data cleaning : Python, R, Matlab, ... • Exploration: Juypter notebooks, SQL, • Feature engineering: Spark, Pandas • Modeling: Tensorflow, Pytorch, Spark, Pandas , Sklearn, … • Visualization & Presentation: Tableau, Powerpoint, Dashboards,… • Action: New Product features,
  14. Microsoft • Courses in Math, Programming, AI, Databases, … •

    Online courses focused on Data science and machine learning • Books • Contribute to open source projects (Google summer of Code is a good example) • Online communities (reddit, stackoverflow,..) • Hackathons • Participate in solving real world problems on Kaggle