$30 off During Our Annual Pro Sale. View Details »

Data Science : Raw Data to Actionable Insights

Data Science : Raw Data to Actionable Insights

Invited talk to students, as part of "Futuristic Technologies in Computer Science" by Computer Society of India. Oct, 2020.

dharmeshkakadia

October 10, 2020
Tweet

More Decks by dharmeshkakadia

Other Decks in Technology

Transcript

  1. Microsoft
    Dharmesh Kakadia
    Microsoft

    View Slide

  2. Microsoft
    • Interested in bringing AI Applications to
    masses
    • Sr. Applied Scientist with Microsoft
    • Worked across Office and Azure products
    • Prior to that spent couple of years at
    Microsoft Research
    • Among other things, author of “Apache
    Mesos Essentials”
    • Opinions are mine and not Microsoft’s
    • You can find me as @dharmeshkakadia
    everywhere

    View Slide

  3. Microsoft

    View Slide

  4. Microsoft
    Problem + Data Magic = Solution

    View Slide

  5. Microsoft

    View Slide

  6. Microsoft
    Problem + Data Magic = Solution
    Science

    View Slide

  7. Microsoft
    “Any sufficiently advanced technology
    is indistinguishable from magic“
    - Arthur C. Clarke

    View Slide

  8. Microsoft
    Data science is an inter-
    disciplinary field that uses
    scientific methods, processes,
    algorithms and systems to
    extract knowledge and insights
    from many structural
    and unstructured data.
    https://en.wikipedia.org/wiki/Data_science
    Software
    Development
    Domain
    Knowledge
    Statistics

    View Slide

  9. Microsoft
    Data science is an inter-
    disciplinary field that uses
    scientific methods, processes,
    algorithms and systems to
    extract knowledge and insights
    from many structural
    and unstructured data.
    https://xkcd.com/1838/
    Software
    Development
    Domain
    Knowledge
    Statistics

    View Slide

  10. Microsoft

    View Slide

  11. Microsoft
    • Classification (spam or not spam)
    • Recommendations (Amazon and Netflix
    recommendations)
    • Pattern detection and grouping
    • Anomaly detection (fraud detection)
    • Recognition (image, text, audio, video, facial)
    • Actionable insights (via dashboards, reports,
    visualizations, …)
    • Scoring and ranking (credit score)
    • Optimization (risk management)
    • Forecasts (sales and revenue)
    Wide range of applications in industry

    View Slide

  12. Microsoft

    View Slide

  13. Microsoft
    • Understanding the problem
    • Data gathering and augmentation
    • Data cleaning
    • Data exploration
    • Feature Engineering
    • Modeling and experimentation
    • Data visualization and actionable
    insights
    https://www.pinterest.com/pin/607704543450983191/

    View Slide

  14. Microsoft
    Can data science help with this problem?
    • Conclusive
    • Measurable
    • Specific
    http://radar.oreilly.com/2013/04/why-why-why.html
    What kind of Data science problem is this?
    • Classification
    • Regression
    • Time series prediction

    View Slide

  15. Microsoft
    • Collect relevant data from the
    product
    • Find applicable public datasets
    • Domain specific datasets
    • Knowledge bases
    • Wikipedia
    • Research datasets
    • Search results
    • Entity linking
    • 3rd party data vendors
    https://www.alamy.com/stock-image-business-man-digging-a-hole-in-the-ground-to-search-for-useful-information-166716913.html

    View Slide

  16. Microsoft
    • Less attractive and sometime frustrating, but
    a crucial step
    • Verify frequency distributions
    • Remove outliers
    • Remove NULLS
    • Add/Replace default values
    • Uniform Formatting
    • Type conversion
    • Domain specific cleaning approaches
    • Space removal
    • Image segmentation
    https://www.pinterest.com/pin/607704543450983191/

    View Slide

  17. Microsoft
    • Understand what is in a dataset
    and the characteristics of the
    data by using statistical
    techniques
    • Focus on characterizations such
    as
    • size, quantity, accuracy
    • Frequency distributions
    • Variance
    • Ranges of data
    • Visual exploration (Boxplot,
    histogram, …)
    https://www.pinterest.com/pin/607704543450983191/

    View Slide

  18. Microsoft
    • Process of using domain knowledge to extract features from raw data via data
    mining techniques
    • They can be used to improve the performance of machine learning algorithms.
    • A lot more artsy and domain knowledge dependent
    • Better feature leads to
    • More flexibility in the modeling
    • Less complex model
    • Better accuracy
    https://jyu-theartofml.github.io/posts/feature_eng

    View Slide

  19. Microsoft
    • Trying out different algorithms and
    techniques
    • Measuring accuracy
    • Repeat and refine

    View Slide

  20. Microsoft
    • Data visualization is the graphical
    representation of information and data
    • Goal is to engage with all the
    stakeholders and convey the findings of
    the data scientists and provide a
    recommendation for solution
    • Sometimes, extended to monitor the
    impact once the solution is delivered
    https://hbr.org/2016/06/visualizations-that-really-work

    View Slide

  21. Microsoft

    View Slide

  22. Microsoft
    • Problem: Users are not continuing to use our e-learning app
    • Data gathering:
    • Feature usage
    • Engagement with features
    • Time spent…
    • Data cleaning : Remove outliers
    • Exploration: User segmentation
    • Feature engineering: Avg. session time, Quiz completion rate
    • Modeling: What features predict user retention?
    • Visualization & Presentation: Sorted list of features contributing to retention
    (Correlation coefficient)
    • Action: Make people use this 3 features within first 2 weeks of signup

    View Slide

  23. Microsoft
    • Problem: Meetings, User research
    • Data gathering: Telemetry, Databases, Public datasets,..
    • Data cleaning : Python, R, Matlab, ...
    • Exploration: Juypter notebooks, SQL,
    • Feature engineering: Spark, Pandas
    • Modeling: Tensorflow, Pytorch, Spark, Pandas , Sklearn, …
    • Visualization & Presentation: Tableau, Powerpoint, Dashboards,…
    • Action: New Product features,

    View Slide

  24. Microsoft

    View Slide

  25. Microsoft
    • Courses in Math, Programming, AI, Databases, …
    • Online courses focused on Data science and
    machine learning
    • Books
    • Contribute to open source projects (Google
    summer of Code is a good example)
    • Online communities (reddit, stackoverflow,..)
    • Hackathons
    • Participate in solving real world problems on
    Kaggle

    View Slide

  26. Microsoft

    View Slide

  27. Microsoft

    View Slide