Slide 1

Slide 1 text

Microsoft Dharmesh Kakadia Microsoft

Slide 2

Slide 2 text

Microsoft • Interested in bringing AI Applications to masses • Sr. Applied Scientist with Microsoft • Worked across Office and Azure products • Prior to that spent couple of years at Microsoft Research • Among other things, author of “Apache Mesos Essentials” • Opinions are mine and not Microsoft’s • You can find me as @dharmeshkakadia everywhere

Slide 3

Slide 3 text

Microsoft

Slide 4

Slide 4 text

Microsoft Problem + Data Magic = Solution

Slide 5

Slide 5 text

Microsoft

Slide 6

Slide 6 text

Microsoft Problem + Data Magic = Solution Science

Slide 7

Slide 7 text

Microsoft “Any sufficiently advanced technology is indistinguishable from magic“ - Arthur C. Clarke

Slide 8

Slide 8 text

Microsoft Data science is an inter- disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://en.wikipedia.org/wiki/Data_science Software Development Domain Knowledge Statistics

Slide 9

Slide 9 text

Microsoft Data science is an inter- disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://xkcd.com/1838/ Software Development Domain Knowledge Statistics

Slide 10

Slide 10 text

Microsoft

Slide 11

Slide 11 text

Microsoft • Classification (spam or not spam) • Recommendations (Amazon and Netflix recommendations) • Pattern detection and grouping • Anomaly detection (fraud detection) • Recognition (image, text, audio, video, facial) • Actionable insights (via dashboards, reports, visualizations, …) • Scoring and ranking (credit score) • Optimization (risk management) • Forecasts (sales and revenue) Wide range of applications in industry

Slide 12

Slide 12 text

Microsoft

Slide 13

Slide 13 text

Microsoft • Understanding the problem • Data gathering and augmentation • Data cleaning • Data exploration • Feature Engineering • Modeling and experimentation • Data visualization and actionable insights https://www.pinterest.com/pin/607704543450983191/

Slide 14

Slide 14 text

Microsoft Can data science help with this problem? • Conclusive • Measurable • Specific http://radar.oreilly.com/2013/04/why-why-why.html What kind of Data science problem is this? • Classification • Regression • Time series prediction …

Slide 15

Slide 15 text

Microsoft • Collect relevant data from the product • Find applicable public datasets • Domain specific datasets • Knowledge bases • Wikipedia • Research datasets • Search results • Entity linking • 3rd party data vendors https://www.alamy.com/stock-image-business-man-digging-a-hole-in-the-ground-to-search-for-useful-information-166716913.html

Slide 16

Slide 16 text

Microsoft • Less attractive and sometime frustrating, but a crucial step • Verify frequency distributions • Remove outliers • Remove NULLS • Add/Replace default values • Uniform Formatting • Type conversion • Domain specific cleaning approaches • Space removal • Image segmentation https://www.pinterest.com/pin/607704543450983191/

Slide 17

Slide 17 text

Microsoft • Understand what is in a dataset and the characteristics of the data by using statistical techniques • Focus on characterizations such as • size, quantity, accuracy • Frequency distributions • Variance • Ranges of data • Visual exploration (Boxplot, histogram, …) https://www.pinterest.com/pin/607704543450983191/

Slide 18

Slide 18 text

Microsoft • Process of using domain knowledge to extract features from raw data via data mining techniques • They can be used to improve the performance of machine learning algorithms. • A lot more artsy and domain knowledge dependent • Better feature leads to • More flexibility in the modeling • Less complex model • Better accuracy https://jyu-theartofml.github.io/posts/feature_eng

Slide 19

Slide 19 text

Microsoft • Trying out different algorithms and techniques • Measuring accuracy • Repeat and refine

Slide 20

Slide 20 text

Microsoft • Data visualization is the graphical representation of information and data • Goal is to engage with all the stakeholders and convey the findings of the data scientists and provide a recommendation for solution • Sometimes, extended to monitor the impact once the solution is delivered https://hbr.org/2016/06/visualizations-that-really-work

Slide 21

Slide 21 text

Microsoft

Slide 22

Slide 22 text

Microsoft • Problem: Users are not continuing to use our e-learning app • Data gathering: • Feature usage • Engagement with features • Time spent… • Data cleaning : Remove outliers • Exploration: User segmentation • Feature engineering: Avg. session time, Quiz completion rate • Modeling: What features predict user retention? • Visualization & Presentation: Sorted list of features contributing to retention (Correlation coefficient) • Action: Make people use this 3 features within first 2 weeks of signup

Slide 23

Slide 23 text

Microsoft • Problem: Meetings, User research • Data gathering: Telemetry, Databases, Public datasets,.. • Data cleaning : Python, R, Matlab, ... • Exploration: Juypter notebooks, SQL, • Feature engineering: Spark, Pandas • Modeling: Tensorflow, Pytorch, Spark, Pandas , Sklearn, … • Visualization & Presentation: Tableau, Powerpoint, Dashboards,… • Action: New Product features,

Slide 24

Slide 24 text

Microsoft

Slide 25

Slide 25 text

Microsoft • Courses in Math, Programming, AI, Databases, … • Online courses focused on Data science and machine learning • Books • Contribute to open source projects (Google summer of Code is a good example) • Online communities (reddit, stackoverflow,..) • Hackathons • Participate in solving real world problems on Kaggle

Slide 26

Slide 26 text

Microsoft

Slide 27

Slide 27 text

Microsoft