Slide 1

Slide 1 text

© 2016 Continuum Analytics - Confidential & Proprietary © 2017 Continuum Analytics - Confidential & Proprietary Anaconda for GIS professionals GeoRodeo 2017 Christine Doig, Continuum Analytics May 19th, 2017

Slide 2

Slide 2 text

© 2017 Continuum Analytics - Confidential & Proprietary 2 • Introduction to Anaconda • Data Science Development & Deployment with Anaconda Projects • Anaconda Projects GIS examples • OSS libraries for GIS professionals: • Bokeh, interactive data visualizations • Datashader, graphics pipeline system for creating meaningful representations of large amounts of data • Dask, flexible parallel computing library for analytics • Other libraries: GeoViews and Holoviews Agenda

Slide 3

Slide 3 text

Introduction to Anaconda

Slide 4

Slide 4 text

© 2017 Continuum Analytics - Confidential & Proprietary 4 Anaconda, the leading Data Science ecosystem with over 4M users

Slide 5

Slide 5 text

© 2017 Continuum Analytics - Confidential & Proprietary 5 Numba dask xlwings Airflow Blaze Distributed 
 Systems Business 
 Intelligence Web Scientific 
 Computing / HPC Machine Learning
 / Statistics ANACONDA DISTRIBUTION Python & R distribution with 1000+ curated packages that makes it easy to get started with Data Science

Slide 6

Slide 6 text

© 2016 Continuum Analytics - Confidential & Proprietary 6 https://www.continuum.io/downloads

Slide 7

Slide 7 text

© 2016 Continuum Analytics - Confidential & Proprietary 7 What’s in ANACONDA DISTRIBUTION?

Slide 8

Slide 8 text

© 2016 Continuum Analytics - Confidential & Proprietary 8 • Install data science libraries $ conda install pandas • Manage package versions $ conda install pandas=0.14 • Create isolated environments $ conda create -n myenv python=3.5 pandas=0.18 • Update package version $ conda update pandas

Slide 9

Slide 9 text

© 2016 Continuum Analytics - Confidential & Proprietary 9 …

Slide 10

Slide 10 text

© 2016 Continuum Analytics - Confidential & Proprietary 10 anaconda-project.yml • Define and manage: • project package dependencies • deployment commands • data • …

Slide 11

Slide 11 text

© 2016 Continuum Analytics - Confidential & Proprietary 11 • Launch applications • Manage package versions and environments • Create and upload projects

Slide 12

Slide 12 text

Data Science Development and Deployment

Slide 13

Slide 13 text

© 2017 Continuum Analytics - Confidential & Proprietary 13 Biz Analyst Data Scientists Explore, Analyze & Collaborate

Slide 14

Slide 14 text

© 2017 Continuum Analytics - Confidential & Proprietary 14 DevOps Scale, Deploy & Operate Developer Data Engineers

Slide 15

Slide 15 text

© 2017 Continuum Analytics - Confidential & Proprietary 15 How do you… • Download and install data science libraries? • Manage versions and dependencies? • Upgrade libraries? • Isolate dependencies between projects? Challenges in data science development WITH ANACONDA DISTRIBUTION & CONDA

Slide 16

Slide 16 text

© 2016 Continuum Analytics - Confidential & Proprietary 16 What do data scientists develop? Workflows Data Query Visualize Clean & Tidy Predict, Simulate, & Optimize R P In N In A P M Interactive data visualizations and dashboards Jupyter notebooks Scripts Predictive models Processed Data

Slide 17

Slide 17 text

© 2016 Continuum Analytics - Confidential & Proprietary 17 Laptop Data Science Development scikit-learn Bokeh Tensorflow Jupyter pandas matplotlib seaborn dask numba script 1 script 2 notebook A dataset Z script 3 Python, R

Slide 18

Slide 18 text

© 2017 Continuum Analytics - Confidential & Proprietary 18 How do you… • Share your data science project with others? • Ensure that you can reproduce your analysis? • Deploy your project? Challenges in data science development and deployment WITH ANACONDA PROJECTS

Slide 19

Slide 19 text

© 2016 Continuum Analytics - Confidential & Proprietary Laptop Server Project 1 Project 2 Project 3 Project 1 Project 2 Project 3 Data Science Development Data Science Deployment

Slide 20

Slide 20 text

© 2016 Continuum Analytics - Confidential & Proprietary Laptop Project 1 Project 2 Project 3 Project 1 Project 2 Project 3 Data Science Development Data Science Development and Deployment Anaconda Enterprise Container 1 Container 2 Container 3 Container 4

Slide 21

Slide 21 text

© 2016 Continuum Analytics - Confidential & Proprietary • Dependencies • Data • Deployment commands • Security • Scalability • Availability Anaconda Enterprise 21

Slide 22

Slide 22 text

© 2017 Continuum Analytics - Confidential & Proprietary 22 Innovator Program http://go.continuum.io/anaconda-enterprise-innovator/

Slide 23

Slide 23 text

Anaconda Projects GIS examples

Slide 24

Slide 24 text

© 2017 Continuum Analytics - Confidential & Proprietary 24

Slide 25

Slide 25 text

© 2017 Continuum Analytics - Confidential & Proprietary 25

Slide 26

Slide 26 text

© 2017 Continuum Analytics - Confidential & Proprietary 26 Datashading a 1-billion-point Open Street Map database

Slide 27

Slide 27 text

© 2017 Continuum Analytics - Confidential & Proprietary 27 2010 US Census data (by population density and race)

Slide 28

Slide 28 text

© 2017 Continuum Analytics - Confidential & Proprietary 28 Gerrymandering (congressional districts and race) Houston

Slide 29

Slide 29 text

© 2017 Continuum Analytics - Confidential & Proprietary 29 https://anaconda.org/koverholt/projects - datashader_nyctaxi - deck_gl_geojson https://anaconda.org/jbednar/osm-1billion/notebook https://anaconda.org/jbednar/census/notebook https://anaconda.org/jbednar/census-hv-dask/notebook Examples available:

Slide 30

Slide 30 text

Bokeh

Slide 31

Slide 31 text

© 2017 Continuum Analytics - Confidential & Proprietary 31 Interactive visualization framework that targets modern web browsers for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/ Bokeh

Slide 32

Slide 32 text

© 2017 Continuum Analytics - Confidential & Proprietary 32 http://bokeh.pydata.org/en/latest/docs/gallery/texas.html

Slide 33

Slide 33 text

Datashader

Slide 34

Slide 34 text

© 2017 Continuum Analytics - Confidential & Proprietary Motivation 34 • Visualize large amounts of data in a meaningful way • Interactively explore the data

Slide 35

Slide 35 text

© 2017 Continuum Analytics - Confidential & Proprietary Datashader 35 Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook

Slide 36

Slide 36 text

© 2017 Continuum Analytics - Confidential & Proprietary Datashader 36 graphics pipeline system for creating meaningful representations of large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race

Slide 37

Slide 37 text

Dask

Slide 38

Slide 38 text

© 2017 Continuum Analytics - Confidential & Proprietary Motivation 38 • Data > memory in laptop • Similar to a pandas solution

Slide 39

Slide 39 text

© 2017 Continuum Analytics - Confidential & Proprietary Dask Dataframes 39 >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads

Slide 40

Slide 40 text

© 2017 Continuum Analytics - Confidential & Proprietary Distributed 40 http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed computing in Python. It extends dask APIs to moderate sized clusters.

Slide 41

Slide 41 text

© 2017 Continuum Analytics - Confidential & Proprietary Web UI 41 Dask.distributed includes a web interface to help deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.

Slide 42

Slide 42 text

HoloViews & GeoViews

Slide 43

Slide 43 text

© 2017 Continuum Analytics - Confidential & Proprietary 43 HoloViews is a Python library that makes analyzing and visualizing scientific or engineering data much simpler, more intuitive, and more easily reproducible. http://holoviews.org/index.html

Slide 44

Slide 44 text

© 2017 Continuum Analytics - Confidential & Proprietary 44 GeoViews is a Python library that makes it easy to explore and visualize geographical, meteorological, and oceanographic datasets, such as those used in weather, climate, and remote sensing research. http://geo.holoviews.org/

Slide 45

Slide 45 text

© 2017 Continuum Analytics - Confidential & Proprietary Resources 45 Bokeh documentation: http://bokeh.pydata.org/en/latest/ Bokeh demos: https://demo.bokehplots.com/ Datashader documentation: http://datashader.readthedocs.org/ Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/ Datashader webinar: http://go.continuum.io/datashader/ Geoviews blogpost: https://www.continuum.io/blog/developer-blog/introducing-geoviews Geoviews documentation: http://geo.holoviews.org/index.html Holoviews documentation: http://holoviews.org/index.html