Given for Coiled webinar on August 24, 2021.
The Open Source DataTooling LandscapeCarol WillingVP of LearningNoteableweb: noteable.ioemail: carol AT noteable.iotwitter: @WillingCarolgithub: willingc
View Slide
Headline SlideSub-headlineThe 10 Best Practicesfor Remote SoftwareEngineeringFocusing on the human element of remote software engineerproductivityVanessa SochatDOI:10.1145/3459613Attribution: xkcd1 Today
Common DataChallengesExploring Solutions withOpen Source Data Tools2 Data
SCALE
SPEED
CONNECTIONS
CHOICES
The Data PipelinePerspectivesAttribution: Red Bull3 People
The Data PipelineExecutivesOpportunity and Fear
The Data PipelineEngineersInfrastructure and ProcessExecutivesOpportunity and Fear
The Data PipelineEngineersInfrastructure and ProcessData ScientistsAlgorithms and ModelsExecutivesOpportunity and Fear
The Data PipelineEngineersInfrastructure and ProcessData ScientistsAlgorithms and ModelsExecutivesOpportunity and FearUsersProductivity and Needs
Attribution: Red BullStart small...
@WillingCarol 14Justine Dupont surfs the greatest wave of her life in Nazaré, Portugal© Rafael G. Riancho / Red Bull Content Pool...and scale.
Open Source DataTooling Landscape4 Ecosystem
PythonRJuliaFortranSQLC++GoRustJavaScala4 EcosystemProgramming LanguagesJavaScriptTypeScriptData Analysis Workflows Interactivity
4 Ecosystem Data WorkflowProjectDefinitionDataCollectionComputationand ModelingEvaluationDeploy atScale MonitoringDataPreparationExploratoryAnalysisShareResultsRevisitGoals
Challenges‣ Foundation (existing infrastructure to cloud)‣ Variability (DIY to Hosted/Managed Service)‣ Complexity‣ Language ecosystems‣ Growth
Challenges(cont.)‣ Best practices / de facto standards‣ Jargon‣ Abstractions‣ HypeCRISP-DMAttribution: IBMCross-industry standard process for data mining1996
4 Ecosystem TaxonomyBusiness GoalsPeopleEthicsModel creationTrainingTestingProjectDefinitionDataCollectionComputationand ModelingCleaningLabelingValidatingDataPreparationIngestExploratoryAnalysisDescriptivestatisticsVisualizationEvaluationDeploy atScaleMonitoringShareResultsRevisitGoalsChartsReportsDashboardWeb appSchedulingCI/CDPlatformMetricsComparisonSatisfy goalsAutomationInfrastructureModelObservabilityTechnicalBusinessEthical
4 Ecosystem Julia TaxonomyBusiness GoalsPeopleEthicsModel creationTrainingTestingProjectDefinitionDataCollectionComputationand ModelingCleaningLabelingValidatingDataPreparationIngestExploratoryAnalysisDescriptivestatisticsVisualizationEvaluationDeploy atScaleMonitoringShareResultsRevisitGoalsChartsReportsDashboardWeb appWorkflowSchedulingCI/CDPlatformMetricsComparisonSatisfy goalsAutomationInfrastructureModelObservabilityTechnicalBusinessEthicalDrWatson.jlParameterSchedulers.jlPluto.jlIJuliaJupyterLabnteractVSCodePlots.jl (Viz)Gadfly.jl (Viz)Makie.jl (Viz - GPU)Flux.jl (ML)Knet.jl (ML/BL)MLJ.jl (ML)Mocha.jl (ML/DL)Tensorflow.jl (ML/DL wrapper)JuMP (optimization)Dataframes.jlProgressMeters.jl
4 Ecosystem Python TaxonomyBusiness GoalsPeopleEthicsModel creationTrainingTestingProjectDefinitionDataCollectionComputationand ModelingCleaningLabelingValidatingDataPreparationIngestExploratoryAnalysisDescriptivestatisticsVisualizationEvaluationDeploy atScaleMonitoringShareResultsRevisitGoalsChartsReportsDashboardWeb appWorkflowSchedulingCI/CDPlatformMetricsComparisonSatisfy goalsAutomationInfrastructureModelObservabilityTechnicalBusinessEthicalDaskJupyterHubBinderKubernetespapermillDagsterAirflowprefectscipystatsmodelJupyterLabnteractVSCodematplotlibseabornaltairplotlynumpyscikit-learnpytorchtensorflowpandasPyJanitordaskdatasetteevidentlybokehpanelvoiladashpython scriptsnaparigeopandasfeastkerasfastaifairlearn
4 Ecosystem R TaxonomyBusiness GoalsPeopleEthicsModel creationTrainingTestingProjectDefinitionDataCollectionComputationand ModelingCleaningLabelingValidatingDataPreparationIngestExploratoryAnalysisDescriptivestatisticsVisualizationEvaluationDeploy atScaleMonitoringShareResultsRevisitGoalsChartsReportsDashboardWeb appSchedulingCI/CDPlatformMetricsComparisonSatisfy goalsAutomationInfrastructureModelObservabilityTechnicalBusinessEthicalRStudioJupyterLabIRkernelggplottidyversedplyrtidyrlubridatereadrreadxlgooglesheets4ggplot2rmarkdownShinyplumberpurrrreticulateKerasTensorflowsparklyrropensci.orgknitrforcatsmlr3CNTKtheanos
AlgorithmicBusiness Thinking(ABT)5 ManagementPaul McDonagh-SmithMIT Sloan School of Managementhttps://mitsloan.mit.edu/faculty/directory/paul-mcdonagh-smithhttps://www.youtube.com/watch?v=bqtn2tYg-kw
@WillingCarol 25Justine Dupont surfs the greatest wave of her life in Nazaré, Portugal© Rafael G. Riancho / Red Bull Content PoolGot data at scale?Use open source tools.
web: noteable.ioemail: carol AT noteable.iotwitter: @WillingCarolgithub: willingcThank youThe Open Source DataTooling LandscapeCarol WillingVP of LearningNoteable
6 Additional Resourceshttps://krzjoa.github.io/awesome-python-data-science/#/https://github.com/FavioVazquez/ds-cheatsheetshttps://www.the-modeling-agency.com/crisp-dm.pdfhttps://github.com/academic/awesome-datascience