Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Open Source Data Tooling Landscape

The Open Source Data Tooling Landscape

Given for Coiled webinar on August 24, 2021.

Carol Willing

August 24, 2021
Tweet

More Decks by Carol Willing

Other Decks in Technology

Transcript

  1. The Open Source Data Tooling Landscape Carol Willing VP of

    Learning Noteable web: noteable.io email: carol AT noteable.io twitter: @WillingCarol github: willingc
  2. Headline Slide Sub-headline The 10 Best Practices for Remote Software

    Engineering Focusing on the human element of remote software engineer productivity Vanessa Sochat DOI:10.1145/3459613 Attribution: xkcd 1 Today
  3. The Data Pipeline Engineers Infrastructure and Process Data Scientists Algorithms

    and Models Executives Opportunity and Fear Users Productivity and Needs
  4. @WillingCarol 14 Justine Dupont surfs the greatest wave of her

    life in Nazaré, Portuga l © Rafael G. Riancho / Red Bull Content Poo l ...and scale.
  5. Python R Julia Fortran SQL C++ Go Rust Java Scala

    4 Ecosystem Programming Languages JavaScript TypeScript Data Analysis Workflows Interactivity
  6. 4 Ecosystem Data Work fl ow Project Definition Data Collection

    Computation and Modeling Evaluation Deploy at Scale Monitoring Data Preparation Exploratory Analysis Share Results Revisit Goals
  7. Challenges ‣ Foundation (existing infrastructure to cloud) ‣ Variability (DIY

    to Hosted/Managed Service) ‣ Complexity ‣ Language ecosystems ‣ Growth
  8. Challenges (cont.) ‣ Best practices / de facto standards ‣

    Jargon ‣ Abstractions ‣ Hype CRISP-DM Attribution: IBM Cross-industry standard process for data mining 1996
  9. 4 Ecosystem Taxonomy Business Goals People Ethics Model creation Training

    Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical
  10. 4 Ecosystem Julia Taxonomy Business Goals People Ethics Model creation

    Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Workflow Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical DrWatson.jl ParameterSchedulers.jl Pluto.jl IJulia JupyterLab nteract VSCode Plots.jl (Viz) Gadfly.jl (Viz) Makie.jl (Viz - GPU) Flux.jl (ML) Knet.jl (ML/BL) MLJ.jl (ML) Mocha.jl (ML/DL) Tensorflow.jl (ML/DL wrapper) JuMP (optimization) Dataframes.jl ProgressMeters.jl
  11. 4 Ecosystem Python Taxonomy Business Goals People Ethics Model creation

    Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Workflow Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical Dask JupyterHub Binder Kubernetes papermill Dagster Airflow prefect scipy statsmodel JupyterLab nteract VSCode matplotlib seaborn altair plotly numpy scikit-learn pytorch tensorflow pandas PyJanitor dask datasette evidently bokeh panel voila dash python scripts napari geopandas feast keras fastai fairlearn
  12. 4 Ecosystem R Taxonomy Business Goals People Ethics Model creation

    Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical RStudio JupyterLab IRkernel ggplot tidyverse dplyr tidyr lubridate readr readxl googlesheets4 ggplot2 rmarkdown Shiny plumber purrr reticulate Keras Tensorflow sparklyr ropensci.org knitr forcats mlr3 CNTK theanos
  13. Algorithmic Business Thinking (ABT) 5 Management Paul McDonagh-Smith MIT Sloan

    School of Management https://mitsloan.mit.edu/faculty/directory/paul-mcdonagh-smith https://www.youtube.com/watch?v=bqtn2tYg-kw
  14. @WillingCarol 25 Justine Dupont surfs the greatest wave of her

    life in Nazaré, Portuga l © Rafael G. Riancho / Red Bull Content Poo l Got data at scale? Use open source tools.
  15. web: noteable.io email: carol AT noteable.io twitter: @WillingCarol github: willingc

    Thank you The Open Source Data Tooling Landscape Carol Willing VP of Learning Noteable