Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Academy

PyData Academy

Data scientist’s curriculum – Help solve problems in data science with Python. Initiative by Continuum Analytics (http://continuum.io)

Talk from PyData 2012 in NYC. Video: https://vimeo.com/53095577

Github repository: https://github.com/ContinuumIO/PyDataAcademy/

Stefan Urbanek

October 27, 2012
Tweet

More Decks by Stefan Urbanek

Other Decks in Education

Transcript

  1. PyData
    Academy
    data scientist’s curriculum
    Štefan Urbánek ■ @Stiivi ■ [email protected] ■ October 2012

    View Slide

  2. Help solve problems in data science
    with Python

    View Slide

  3. ■ provide wider knowledge of data-related topics
    ■ give insight into data processing for data analysts
    ■ give insight into data analysis for data engineers

    View Slide

  4. Knowledge Areas

    View Slide

  5. Governance
    Presentation, Analysis, Publishing
    Analytical Modeling
    Cleansing, Transformation and Integration
    Extraction
    Discovery and Acquisition
    Data Sources
    Tools and Technologies
    Audit of Existing
    Resource
    Searching and
    Finding
    Data Relevance
    Crowd Sourcing
    Completeness
    and Stopping
    Programming
    Basics
    Crawling
    Scraping
    Parsing
    Loading to Data
    Store
    Automation
    Normalisation
    Merge/Join Mapping
    Fuzzy Matching
    Handling Manual
    Corrections
    Entity Uniqueness
    Treating
    Duplicates
    Natural Language
    Processing
    Indexing and
    Optimisation
    Data Granularity
    Data Formats and
    Standards
    Concept
    Modelling
    Handling
    Changing
    Dimensions
    ETL Process
    Management
    Data Quality
    Management
    Auditability and
    Provenance
    Reference Data
    Management
    Metadata
    Regression Outliers
    Clustering
    Graph/Network
    Metrics
    Pivoting
    OLAP Business Rules
    Visualisation and
    Plotting
    Sorting and
    Filtering
    Visualisation
    Method Selection
    Publishing Online Map Geo-Tagging
    Story Telling
    Data Processing Pipeline
    Advanced
    Programming
    Using Reference
    Data
    Simulation
    Manual
    Digitisation
    2 3
    1
    N
    2
    2
    1
    N
    2
    1
    2
    1
    2
    1 2
    3
    3 2
    2
    2
    2
    1
    2 3
    2
    2 2 2
    1
    3
    2 2
    2 2
    1 1
    1
    3
    1
    2
    2
    N
    2 2 3
    3 2 2 2
    2
    1
    N 1
    web pages
    text documents
    structured
    documents
    databases
    scientific data
    Bulk Digitisation
    3
    Data Pipes
    3
    2
    3
    1
    N
    1
    N
    1
    N
    SQL
    1
    1 2 3
    N
    Level: non-technical beginner advanced expert

    View Slide

  6. Knowledge Areas
    based on data processing pipeline
    Data Governance
    Analysis and
    Presentation
    Extraction, Transformation, Loading
    Data
    Sources
    Technologies and Utilities

    View Slide

  7. Extraction
    Discovery and
    Acquisition
    Cleansing,
    Transformation
    and Integration
    Analytical Modeling
    Data Governance
    Analysis and
    Presentation
    Extraction, Transformation, Loading
    Data
    Sources
    Technologies and Utilities

    View Slide

  8. Analysis
    Presentation
    and Publishing
    Automated
    Decisioning
    Data Governance
    Analysis and
    Presentation
    Extraction, Transformation, Loading
    Data
    Sources
    Technologies and Utilities

    View Slide

  9. Quality
    Management
    Process
    Management
    Data
    Provenance
    Master Data
    Management
    Data Governance
    Analysis and
    Presentation
    Extraction, Transformation, Loading
    Data
    Sources
    Technologies and Utilities

    View Slide

  10. Advanced
    Python
    SQL
    Python Basics Unit Testing
    Python
    Optimization
    Data Governance
    Analysis and
    Presentation
    Extraction, Transformation, Loading
    Data
    Sources
    Technologies and Utilities

    View Slide

  11. Modular
    Organization

    View Slide

  12. Module
    tutorial
    required
    module example excercise
    recommended
    requirement
    recommended
    follow-up
    example dataset
    example result
    required knowledge
    slide deck
    screencast
    ipython notebook
    source package
    1 2 3 T U
    N
    Module

    View Slide

  13. T tool focused
    U utility knowledge
    1
    2
    3
    N business strategy/non-tech
    beginner
    advanced
    expert

    View Slide

  14. Tracks
    ■ collection of context-related modules
    ■ problem area or skill-set oriented
    numpy scipy pandas
    "Production
    python"
    Data visualization Optimization
    1 T 2 U 3
    1
    1
    1
    T T

    View Slide

  15. Features
    ■ hands-on
    ■ curated real datasets for real problems
    ■ narration + slides + notebooks + exercises + ...

    View Slide

  16. attribution share alike

    View Slide

  17. github.com/ContinuumIO/PyDataAcademy
    … coming soon.

    View Slide