Slide 1

Slide 1 text

The Open Source Data Tooling Landscape Carol Willing VP of Learning Noteable web: noteable.io email: carol AT noteable.io twitter: @WillingCarol github: willingc

Slide 2

Slide 2 text

Headline Slide Sub-headline The 10 Best Practices for Remote Software Engineering Focusing on the human element of remote software engineer productivity Vanessa Sochat DOI:10.1145/3459613 Attribution: xkcd 1 Today

Slide 3

Slide 3 text

Common Data Challenges Exploring Solutions with Open Source Data Tools 2 Data

Slide 4

Slide 4 text

SCALE

Slide 5

Slide 5 text

SPEED

Slide 6

Slide 6 text

CONNECTIONS

Slide 7

Slide 7 text

CHOICES

Slide 8

Slide 8 text

The Data Pipeline Perspectives Attribution: Red Bull 3 People

Slide 9

Slide 9 text

The Data Pipeline Executives Opportunity and Fear

Slide 10

Slide 10 text

The Data Pipeline Engineers Infrastructure and Process Executives Opportunity and Fear

Slide 11

Slide 11 text

The Data Pipeline Engineers Infrastructure and Process Data Scientists Algorithms and Models Executives Opportunity and Fear

Slide 12

Slide 12 text

The Data Pipeline Engineers Infrastructure and Process Data Scientists Algorithms and Models Executives Opportunity and Fear Users Productivity and Needs

Slide 13

Slide 13 text

Attribution: Red Bull Start small...

Slide 14

Slide 14 text

@WillingCarol 14 Justine Dupont surfs the greatest wave of her life in Nazaré, Portuga l © Rafael G. Riancho / Red Bull Content Poo l ...and scale.

Slide 15

Slide 15 text

Open Source Data Tooling Landscape 4 Ecosystem

Slide 16

Slide 16 text

Python R Julia Fortran SQL C++ Go Rust Java Scala 4 Ecosystem Programming Languages JavaScript TypeScript Data Analysis Workflows Interactivity

Slide 17

Slide 17 text

4 Ecosystem Data Work fl ow Project Definition Data Collection Computation and Modeling Evaluation Deploy at Scale Monitoring Data Preparation Exploratory Analysis Share Results Revisit Goals

Slide 18

Slide 18 text

Challenges ‣ Foundation (existing infrastructure to cloud) ‣ Variability (DIY to Hosted/Managed Service) ‣ Complexity ‣ Language ecosystems ‣ Growth

Slide 19

Slide 19 text

Challenges (cont.) ‣ Best practices / de facto standards ‣ Jargon ‣ Abstractions ‣ Hype CRISP-DM Attribution: IBM Cross-industry standard process for data mining 1996

Slide 20

Slide 20 text

4 Ecosystem Taxonomy Business Goals People Ethics Model creation Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical

Slide 21

Slide 21 text

4 Ecosystem Julia Taxonomy Business Goals People Ethics Model creation Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Workflow Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical DrWatson.jl ParameterSchedulers.jl Pluto.jl IJulia JupyterLab nteract VSCode Plots.jl (Viz) Gadfly.jl (Viz) Makie.jl (Viz - GPU) Flux.jl (ML) Knet.jl (ML/BL) MLJ.jl (ML) Mocha.jl (ML/DL) Tensorflow.jl (ML/DL wrapper) JuMP (optimization) Dataframes.jl ProgressMeters.jl

Slide 22

Slide 22 text

4 Ecosystem Python Taxonomy Business Goals People Ethics Model creation Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Workflow Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical Dask JupyterHub Binder Kubernetes papermill Dagster Airflow prefect scipy statsmodel JupyterLab nteract VSCode matplotlib seaborn altair plotly numpy scikit-learn pytorch tensorflow pandas PyJanitor dask datasette evidently bokeh panel voila dash python scripts napari geopandas feast keras fastai fairlearn

Slide 23

Slide 23 text

4 Ecosystem R Taxonomy Business Goals People Ethics Model creation Training Testing Project Definition Data Collection Computation and Modeling Cleaning Labeling Validating Data Preparation Ingest Exploratory Analysis Descriptive statistics Visualization Evaluation Deploy at Scale Monitoring Share Results Revisit Goals Charts Reports Dashboard Web app Scheduling CI/CD Platform Metrics Comparison Satisfy goals Automation Infrastructure Model Observability Technical Business Ethical RStudio JupyterLab IRkernel ggplot tidyverse dplyr tidyr lubridate readr readxl googlesheets4 ggplot2 rmarkdown Shiny plumber purrr reticulate Keras Tensorflow sparklyr ropensci.org knitr forcats mlr3 CNTK theanos

Slide 24

Slide 24 text

Algorithmic Business Thinking (ABT) 5 Management Paul McDonagh-Smith MIT Sloan School of Management https://mitsloan.mit.edu/faculty/directory/paul-mcdonagh-smith https://www.youtube.com/watch?v=bqtn2tYg-kw

Slide 25

Slide 25 text

@WillingCarol 25 Justine Dupont surfs the greatest wave of her life in Nazaré, Portuga l © Rafael G. Riancho / Red Bull Content Poo l Got data at scale? Use open source tools.

Slide 26

Slide 26 text

web: noteable.io email: carol AT noteable.io twitter: @WillingCarol github: willingc Thank you The Open Source Data Tooling Landscape Carol Willing VP of Learning Noteable

Slide 27

Slide 27 text

6 Additional Resources https://krzjoa.github.io/awesome-python-data-science/#/ https://github.com/FavioVazquez/ds-cheatsheets https://www.the-modeling-agency.com/crisp-dm.pdf https://github.com/academic/awesome-datascience