Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Workflow for Data Science

Introduction to Workflow for Data Science

Introduction, examples, use cases and tools used in data science projects.

Santiago Lizardo

December 11, 2019
Tweet

More Decks by Santiago Lizardo

Other Decks in Technology

Transcript

  1. About the talk 3 • High level overview of the

    data science workflow • Use case in bug triage • Tooling for machine learning • Small demo
  2. Business question examples 9 • Predictions • Are we going

    to grow in more than X% next quarter? • Is this customer going to leave in the next 5 months? • Classification • Is this a server or a workstation? • What type of server this is? (web server, email, file, proxy, …) • Anomaly detection • Is this spike in uploads normal?
  3. Industry standards • Knowledge discovery in databases (KDD) • Analytics

    Solutions Unified Method – DM (ASUM-DM) • Cross-Industry Standard Process – DM (CRISP-DM) • Team Data Science Process (TDSP) 20
  4. Data wrangling • Data… • …sourcing • …exploration (EDA) •

    …cleaning • …imputation • …enrichment (Feature engineering) 28
  5. Data wrangling | Data ingestion • Identify the data source(s)

    • Static dumps (CSV, TSV, Parquet, …) • Live databases (SQL, NoSQL, …) • Data scrapping (Web) • Database (SQL) • Exposing it to the data science project 29
  6. Use case: Triage • Every morning new bugs are reviewed

    • Engineering team is identified and assigned • Done based on description among other attributes 46
  7. Use case: Triage • Can it be programmed? • Too

    many rules • Use information about previous 30K tickets to classify new ones • Error margin is acceptable 47
  8. Use case: Triage 48 • Jira ticket attributes to read

    from: • Summary • Components • Labels • Area • Repository • Jira ticket attribute to update: • Team
  9. Use case: Triage 49 • High level plan • Extract

    data from Jira à CSV • Use Python and Pandas to read CSV and clean it up • Use NLP to extract the main info from text • Remove stopwords, punctuaction, etc • Use stemming to get to the root of words • Use XGBclassifier from scikit-learn • Hook Python script to ML algorithm
  10. Tooling • Local notebooks • Jupyter notebooks (.ipynb) • Cloud

    notebooks • Google Collaboratory • Amazon Sagemaker 52
  11. Closing thoughts • Data science is for the many not

    the few • Save data now for what you might want to ask in the future • Workflows are useful tools in projects • Data science can be hard, so can be some business requirements 55