Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Open Source Data Tooling Landscape

The Open Source Data Tooling Landscape

Given for Coiled webinar on August 24, 2021.

C8eedb2bca5728f0f73294b5b5a0222e?s=128

Carol Willing
PRO

August 24, 2021
Tweet

Transcript

  1. The Open Source Data Tooling Landscape Carol Willing VP of

    Learning Noteable web: noteable.io email: carol AT noteable.io twitter: @WillingCarol github: willingc
  2. Headline Slide Sub-headline The 10 Best Practices for Remote Software

    Engineering Focusing on the human element of remote software engineer productivity Vanessa Sochat DOI:10.1145/3459613 Attribution: xkcd 1 Today
  3. Common Data Challenges Exploring Solutions with Open Source Data Tools

    2 Data
  4. SCALE

  5. SPEED

  6. CONNECTIONS

  7. CHOICES

  8. The Data Pipeline Perspectives Attribution: Red Bull 3 People

  9. The Data Pipeline Executives Opportunity and Fear

  10. The Data Pipeline Engineers Infrastructure and Process Executives Opportunity and

    Fear
  11. The Data Pipeline Engineers Infrastructure and Process Data Scientists Algorithms

    and Models Executives Opportunity and Fear
  12. The Data Pipeline Engineers Infrastructure and Process Data Scientists Algorithms

    and Models Executives Opportunity and Fear Users Productivity and Needs
  13. Attribution: Red Bull Start small...

  14. @WillingCarol 14 Justine Dupont surfs the greatest wave of her

    life in Nazaré, Portuga l © Rafael G. Riancho / Red Bull Content Poo l ...and scale.
  15. Open Source Data Tooling Landscape 4 Ecosystem

  16. Python R Julia Fortran SQL C++ Go Rust Java Scala

    4 Ecosystem Programming Languages JavaScript TypeScript Data Analysis Workflows Interactivity
  17. 4 Ecosystem Data Work fl ow Project Definition Data Collection

    Computation and Modeling Evaluation Deploy at Scale Monitoring Data Preparation Exploratory Analysis Share Results Revisit Goals
  18. Challenges ‣ Foundation (existing infrastructure to cloud) ‣ Variability (DIY

    to Hosted/Managed Service) ‣ Complexity ‣ Language ecosystems ‣ Growth
  19. Challenges (cont.) ‣ Best practices / de facto standards ‣

    Jargon ‣ Abstractions ‣ Hype CRISP-DM Attribution: IBM Cross-industry standard process for data mining 1996
  20. 4 Ecosystem Taxonomy Business Goals People Ethics Model creation Training

    Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical
  21. 4 Ecosystem Julia Taxonomy Business Goals People Ethics Model creation

    Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Workflow Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical DrWatson.jl ParameterSchedulers.jl Pluto.jl IJulia JupyterLab nteract VSCode Plots.jl (Viz) Gadfly.jl (Viz) Makie.jl (Viz - GPU) Flux.jl (ML) Knet.jl (ML/BL) MLJ.jl (ML) Mocha.jl (ML/DL) Tensorflow.jl (ML/DL wrapper) JuMP (optimization) Dataframes.jl ProgressMeters.jl
  22. 4 Ecosystem Python Taxonomy Business Goals People Ethics Model creation

    Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Workflow Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical Dask JupyterHub Binder Kubernetes papermill Dagster Airflow prefect scipy statsmodel JupyterLab nteract VSCode matplotlib seaborn altair plotly numpy scikit-learn pytorch tensorflow pandas PyJanitor dask datasette evidently bokeh panel voila dash python scripts napari geopandas feast keras fastai fairlearn
  23. 4 Ecosystem R Taxonomy Business Goals People Ethics Model creation

    Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical RStudio JupyterLab IRkernel ggplot tidyverse dplyr tidyr lubridate readr readxl googlesheets4 ggplot2 rmarkdown Shiny plumber purrr reticulate Keras Tensorflow sparklyr ropensci.org knitr forcats mlr3 CNTK theanos
  24. Algorithmic Business Thinking (ABT) 5 Management Paul McDonagh-Smith MIT Sloan

    School of Management https://mitsloan.mit.edu/faculty/directory/paul-mcdonagh-smith https://www.youtube.com/watch?v=bqtn2tYg-kw
  25. @WillingCarol 25 Justine Dupont surfs the greatest wave of her

    life in Nazaré, Portuga l © Rafael G. Riancho / Red Bull Content Poo l Got data at scale? Use open source tools.
  26. web: noteable.io email: carol AT noteable.io twitter: @WillingCarol github: willingc

    Thank you The Open Source Data Tooling Landscape Carol Willing VP of Learning Noteable
  27. 6 Additional Resources https://krzjoa.github.io/awesome-python-data-science/#/ https://github.com/FavioVazquez/ds-cheatsheets https://www.the-modeling-agency.com/crisp-dm.pdf https://github.com/academic/awesome-datascience