Sr. Applied Scientist with Microsoft • Worked across Office and Azure products • Prior to that spent couple of years at Microsoft Research • Among other things, author of “Apache Mesos Essentials” • Opinions are mine and not Microsoft’s • You can find me as @dharmeshkakadia everywhere
scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://en.wikipedia.org/wiki/Data_science Software Development Domain Knowledge Statistics
scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. https://xkcd.com/1838/ Software Development Domain Knowledge Statistics
• Data cleaning • Data exploration • Feature Engineering • Modeling and experimentation • Data visualization and actionable insights https://www.pinterest.com/pin/607704543450983191/
• Measurable • Specific http://radar.oreilly.com/2013/04/why-why-why.html What kind of Data science problem is this? • Classification • Regression • Time series prediction …
applicable public datasets • Domain specific datasets • Knowledge bases • Wikipedia • Research datasets • Search results • Entity linking • 3rd party data vendors https://www.alamy.com/stock-image-business-man-digging-a-hole-in-the-ground-to-search-for-useful-information-166716913.html
characteristics of the data by using statistical techniques • Focus on characterizations such as • size, quantity, accuracy • Frequency distributions • Variance • Ranges of data • Visual exploration (Boxplot, histogram, …) https://www.pinterest.com/pin/607704543450983191/
from raw data via data mining techniques • They can be used to improve the performance of machine learning algorithms. • A lot more artsy and domain knowledge dependent • Better feature leads to • More flexibility in the modeling • Less complex model • Better accuracy https://jyu-theartofml.github.io/posts/feature_eng
and data • Goal is to engage with all the stakeholders and convey the findings of the data scientists and provide a recommendation for solution • Sometimes, extended to monitor the impact once the solution is delivered https://hbr.org/2016/06/visualizations-that-really-work
e-learning app • Data gathering: • Feature usage • Engagement with features • Time spent… • Data cleaning : Remove outliers • Exploration: User segmentation • Feature engineering: Avg. session time, Quiz completion rate • Modeling: What features predict user retention? • Visualization & Presentation: Sorted list of features contributing to retention (Correlation coefficient) • Action: Make people use this 3 features within first 2 weeks of signup
Online courses focused on Data science and machine learning • Books • Contribute to open source projects (Google summer of Code is a good example) • Online communities (reddit, stackoverflow,..) • Hackathons • Participate in solving real world problems on Kaggle