Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Academy

PyData Academy

Data scientist’s curriculum – Help solve problems in data science with Python. Initiative by Continuum Analytics (http://continuum.io)

Talk from PyData 2012 in NYC. Video: https://vimeo.com/53095577

Github repository: https://github.com/ContinuumIO/PyDataAcademy/

E3d999f42d1aa9fd563d06eb56bcf742?s=128

Stefan Urbanek

October 27, 2012
Tweet

Transcript

  1. PyData Academy data scientist’s curriculum Štefan Urbánek ▪ @Stiivi ▪

    stefan.urbanek@continuum.io ▪ October 2012
  2. Help solve problems in data science with Python

  3. ▪ provide wider knowledge of data-related topics ▪ give insight

    into data processing for data analysts ▪ give insight into data analysis for data engineers
  4. Knowledge Areas

  5. Governance Presentation, Analysis, Publishing Analytical Modeling Cleansing, Transformation and Integration

    Extraction Discovery and Acquisition Data Sources Tools and Technologies Audit of Existing Resource Searching and Finding Data Relevance Crowd Sourcing Completeness and Stopping Programming Basics Crawling Scraping Parsing Loading to Data Store Automation Normalisation Merge/Join Mapping Fuzzy Matching Handling Manual Corrections Entity Uniqueness Treating Duplicates Natural Language Processing Indexing and Optimisation Data Granularity Data Formats and Standards Concept Modelling Handling Changing Dimensions ETL Process Management Data Quality Management Auditability and Provenance Reference Data Management Metadata Regression Outliers Clustering Graph/Network Metrics Pivoting OLAP Business Rules Visualisation and Plotting Sorting and Filtering Visualisation Method Selection Publishing Online Map Geo-Tagging Story Telling Data Processing Pipeline Advanced Programming Using Reference Data Simulation Manual Digitisation 2 3 1 N 2 2 1 N 2 1 2 1 2 1 2 3 3 2 2 2 2 1 2 3 2 2 2 2 1 3 2 2 2 2 1 1 1 3 1 2 2 N 2 2 3 3 2 2 2 2 1 N 1 web pages text documents structured documents databases scientific data Bulk Digitisation 3 Data Pipes 3 2 3 1 N 1 N 1 N SQL 1 1 2 3 N Level: non-technical beginner advanced expert
  6. Knowledge Areas based on data processing pipeline Data Governance Analysis

    and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities
  7. Extraction Discovery and Acquisition Cleansing, Transformation and Integration Analytical Modeling

    Data Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities
  8. Analysis Presentation and Publishing Automated Decisioning Data Governance Analysis and

    Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities
  9. Quality Management Process Management Data Provenance Master Data Management Data

    Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities
  10. Advanced Python SQL Python Basics Unit Testing Python Optimization Data

    Governance Analysis and Presentation Extraction, Transformation, Loading Data Sources Technologies and Utilities
  11. Modular Organization

  12. Module tutorial required module example excercise recommended requirement recommended follow-up

    example dataset example result required knowledge slide deck screencast ipython notebook source package 1 2 3 T U N Module
  13. T tool focused U utility knowledge 1 2 3 N

    business strategy/non-tech beginner advanced expert
  14. Tracks ▪ collection of context-related modules ▪ problem area or

    skill-set oriented numpy scipy pandas "Production python" Data visualization Optimization 1 T 2 U 3 1 1 1 T T
  15. Features ▪ hands-on ▪ curated real datasets for real problems

    ▪ narration + slides + notebooks + exercises + ...
  16. attribution share alike

  17. github.com/ContinuumIO/PyDataAcademy … coming soon.