$30 off During Our Annual Pro Sale. View Details »

The Open Source Data Tooling Landscape

The Open Source Data Tooling Landscape

Given for Coiled webinar on August 24, 2021.

Carol Willing
PRO

August 24, 2021
Tweet

More Decks by Carol Willing

Other Decks in Technology

Transcript

  1. The Open Source Data
    Tooling Landscape
    Carol Willing


    VP of Learning


    Noteable
    web: noteable.io


    email: carol AT noteable.io


    twitter: @WillingCarol


    github: willingc

    View Slide

  2. Headline Slide
    Sub-headline
    The 10 Best Practices
    for Remote Software
    Engineering


    Focusing on the human element of remote software engineer
    productivity


    Vanessa Sochat


    DOI:10.1145/3459613


    Attribution: xkcd
    1 Today

    View Slide

  3. Common Data
    Challenges
    Exploring Solutions with


    Open Source Data Tools
    2 Data

    View Slide

  4. SCALE

    View Slide

  5. SPEED

    View Slide

  6. CONNECTIONS

    View Slide

  7. CHOICES

    View Slide

  8. The Data Pipeline
    Perspectives
    Attribution: Red Bull
    3 People

    View Slide

  9. The Data Pipeline
    Executives


    Opportunity and Fear

    View Slide

  10. The Data Pipeline
    Engineers


    Infrastructure and Process
    Executives


    Opportunity and Fear

    View Slide

  11. The Data Pipeline
    Engineers


    Infrastructure and Process
    Data Scientists


    Algorithms and Models
    Executives


    Opportunity and Fear

    View Slide

  12. The Data Pipeline
    Engineers


    Infrastructure and Process
    Data Scientists


    Algorithms and Models
    Executives


    Opportunity and Fear
    Users


    Productivity and Needs

    View Slide

  13. Attribution: Red Bull
    Start small...

    View Slide

  14. @WillingCarol 14
    Justine Dupont surfs the greatest wave of her life in Nazaré, Portuga
    l

    © Rafael G. Riancho / Red Bull Content Poo
    l

    ...and scale.

    View Slide

  15. Open Source Data
    Tooling Landscape
    4 Ecosystem

    View Slide

  16. Python


    R


    Julia


    Fortran


    SQL
    C++


    Go


    Rust


    Java


    Scala
    4 Ecosystem
    Programming Languages
    JavaScript


    TypeScript
    Data Analysis Workflows Interactivity

    View Slide

  17. 4 Ecosystem Data Work
    fl
    ow
    Project


    Definition
    Data


    Collection
    Computation


    and Modeling
    Evaluation
    Deploy at


    Scale Monitoring
    Data


    Preparation
    Exploratory


    Analysis
    Share


    Results
    Revisit


    Goals

    View Slide

  18. Challenges
    ‣ Foundation (existing infrastructure to cloud)


    ‣ Variability (DIY to Hosted/Managed Service)


    ‣ Complexity


    ‣ Language ecosystems


    ‣ Growth

    View Slide

  19. Challenges


    (cont.)
    ‣ Best practices / de facto standards


    ‣ Jargon


    ‣ Abstractions


    ‣ Hype
    CRISP-DM
    Attribution: IBM
    Cross-industry standard process for data mining


    1996

    View Slide

  20. 4 Ecosystem Taxonomy
    Business Goals


    People


    Ethics
    Model creation


    Training


    Testing
    Project


    Definition
    Data


    Collection
    Computation


    and Modeling
    Cleaning


    Labeling


    Validating
    Data


    Preparation
    Ingest
    Exploratory


    Analysis
    Descriptive


    statistics


    Visualization
    Evaluation
    Deploy at


    Scale
    Monitoring
    Share


    Results
    Revisit


    Goals
    Charts


    Reports


    Dashboard


    Web app
    Scheduling


    CI/CD


    Platform
    Metrics


    Comparison


    Satisfy goals
    Automation


    Infrastructure


    Model


    Observability
    Technical


    Business


    Ethical

    View Slide

  21. 4 Ecosystem Julia Taxonomy
    Business Goals


    People


    Ethics
    Model creation


    Training


    Testing
    Project


    Definition
    Data


    Collection
    Computation


    and Modeling
    Cleaning


    Labeling


    Validating
    Data


    Preparation
    Ingest
    Exploratory


    Analysis
    Descriptive


    statistics


    Visualization
    Evaluation
    Deploy at


    Scale
    Monitoring
    Share


    Results
    Revisit


    Goals
    Charts


    Reports


    Dashboard


    Web app
    Workflow


    Scheduling


    CI/CD


    Platform
    Metrics


    Comparison


    Satisfy goals
    Automation


    Infrastructure


    Model


    Observability
    Technical


    Business


    Ethical
    DrWatson.jl


    ParameterSchedulers.jl


    Pluto.jl


    IJulia


    JupyterLab


    nteract


    VSCode
    Plots.jl (Viz)


    Gadfly.jl (Viz)


    Makie.jl (Viz - GPU)
    Flux.jl (ML)


    Knet.jl (ML/BL)


    MLJ.jl (ML)


    Mocha.jl (ML/DL)


    Tensorflow.jl (ML/DL wrapper)


    JuMP (optimization)
    Dataframes.jl
    ProgressMeters.jl

    View Slide

  22. 4 Ecosystem Python Taxonomy
    Business Goals


    People


    Ethics
    Model creation


    Training


    Testing
    Project


    Definition
    Data


    Collection
    Computation


    and Modeling
    Cleaning


    Labeling


    Validating
    Data


    Preparation
    Ingest
    Exploratory


    Analysis
    Descriptive


    statistics


    Visualization
    Evaluation
    Deploy at


    Scale
    Monitoring
    Share


    Results
    Revisit


    Goals
    Charts


    Reports


    Dashboard


    Web app
    Workflow


    Scheduling


    CI/CD


    Platform
    Metrics


    Comparison


    Satisfy goals
    Automation


    Infrastructure


    Model


    Observability
    Technical


    Business


    Ethical
    Dask


    JupyterHub


    Binder


    Kubernetes


    papermill


    Dagster


    Airflow


    prefect


    scipy


    statsmodel


    JupyterLab


    nteract


    VSCode
    matplotlib


    seaborn


    altair


    plotly
    numpy


    scikit-learn


    pytorch


    tensorflow
    pandas


    PyJanitor


    dask
    datasette


    evidently
    bokeh


    panel


    voila


    dash


    python scripts
    napari


    geopandas
    feast


    keras


    fastai


    fairlearn


    View Slide

  23. 4 Ecosystem R Taxonomy
    Business Goals


    People


    Ethics
    Model creation


    Training


    Testing
    Project


    Definition
    Data


    Collection
    Computation


    and Modeling
    Cleaning


    Labeling


    Validating
    Data


    Preparation
    Ingest
    Exploratory


    Analysis
    Descriptive


    statistics


    Visualization
    Evaluation
    Deploy at


    Scale
    Monitoring
    Share


    Results
    Revisit


    Goals
    Charts


    Reports


    Dashboard


    Web app
    Scheduling


    CI/CD


    Platform
    Metrics


    Comparison


    Satisfy goals
    Automation


    Infrastructure


    Model


    Observability
    Technical


    Business


    Ethical
    RStudio


    JupyterLab


    IRkernel


    ggplot
    tidyverse


    dplyr


    tidyr


    lubridate


    readr


    readxl


    googlesheets4
    ggplot2


    rmarkdown


    Shiny


    plumber
    purrr


    reticulate


    Keras


    Tensorflow
    sparklyr
    ropensci.org
    knitr


    forcats
    mlr3


    CNTK


    theanos

    View Slide

  24. Algorithmic
    Business Thinking
    (ABT)
    5 Management
    Paul McDonagh-Smith


    MIT Sloan School of Management
    https://mitsloan.mit.edu/faculty/directory/paul-mcdonagh-smith
    https://www.youtube.com/watch?v=bqtn2tYg-kw

    View Slide

  25. @WillingCarol 25
    Justine Dupont surfs the greatest wave of her life in Nazaré, Portuga
    l

    © Rafael G. Riancho / Red Bull Content Poo
    l

    Got data at scale?


    Use open source tools.

    View Slide

  26. web: noteable.io


    email: carol AT noteable.io


    twitter: @WillingCarol


    github: willingc
    Thank you
    The Open Source Data
    Tooling Landscape
    Carol Willing


    VP of Learning


    Noteable

    View Slide

  27. 6 Additional Resources
    https://krzjoa.github.io/awesome-python-data-science/#/
    https://github.com/FavioVazquez/ds-cheatsheets
    https://www.the-modeling-agency.com/crisp-dm.pdf
    https://github.com/academic/awesome-datascience

    View Slide