Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GeoRodeo 2017 Anaconda

GeoRodeo 2017 Anaconda

6cc5be6a122c6e768981003fd2e24789?s=128

Christine Doig

May 19, 2017
Tweet

Transcript

  1. © 2016 Continuum Analytics - Confidential & Proprietary © 2017

    Continuum Analytics - Confidential & Proprietary Anaconda for GIS professionals GeoRodeo 2017 Christine Doig, Continuum Analytics May 19th, 2017
  2. © 2017 Continuum Analytics - Confidential & Proprietary 2 •

    Introduction to Anaconda • Data Science Development & Deployment with Anaconda Projects • Anaconda Projects GIS examples • OSS libraries for GIS professionals: • Bokeh, interactive data visualizations • Datashader, graphics pipeline system for creating meaningful representations of large amounts of data • Dask, flexible parallel computing library for analytics • Other libraries: GeoViews and Holoviews Agenda
  3. Introduction to Anaconda

  4. © 2017 Continuum Analytics - Confidential & Proprietary 4 Anaconda,

    the leading Data Science ecosystem with over 4M users
  5. © 2017 Continuum Analytics - Confidential & Proprietary 5 Numba

    dask xlwings Airflow Blaze Distributed 
 Systems Business 
 Intelligence Web Scientific 
 Computing / HPC Machine Learning
 / Statistics ANACONDA DISTRIBUTION Python & R distribution with 1000+ curated packages that makes it easy to get started with Data Science
  6. © 2016 Continuum Analytics - Confidential & Proprietary 6 https://www.continuum.io/downloads

  7. © 2016 Continuum Analytics - Confidential & Proprietary 7 What’s

    in ANACONDA DISTRIBUTION?
  8. © 2016 Continuum Analytics - Confidential & Proprietary 8 •

    Install data science libraries $ conda install pandas • Manage package versions $ conda install pandas=0.14 • Create isolated environments $ conda create -n myenv python=3.5 pandas=0.18 • Update package version $ conda update pandas
  9. © 2016 Continuum Analytics - Confidential & Proprietary 9 …

  10. © 2016 Continuum Analytics - Confidential & Proprietary 10 anaconda-project.yml

    • Define and manage: • project package dependencies • deployment commands • data • …
  11. © 2016 Continuum Analytics - Confidential & Proprietary 11 •

    Launch applications • Manage package versions and environments • Create and upload projects
  12. Data Science Development and Deployment

  13. © 2017 Continuum Analytics - Confidential & Proprietary 13 Biz

    Analyst Data Scientists Explore, Analyze & Collaborate
  14. © 2017 Continuum Analytics - Confidential & Proprietary 14 DevOps

    Scale, Deploy & Operate Developer Data Engineers
  15. © 2017 Continuum Analytics - Confidential & Proprietary 15 How

    do you… • Download and install data science libraries? • Manage versions and dependencies? • Upgrade libraries? • Isolate dependencies between projects? Challenges in data science development WITH ANACONDA DISTRIBUTION & CONDA
  16. © 2016 Continuum Analytics - Confidential & Proprietary 16 What

    do data scientists develop? Workflows Data Query Visualize Clean & Tidy Predict, Simulate, & Optimize R P In N In A P M Interactive data visualizations and dashboards Jupyter notebooks Scripts Predictive models Processed Data
  17. © 2016 Continuum Analytics - Confidential & Proprietary 17 Laptop

    Data Science Development scikit-learn Bokeh Tensorflow Jupyter pandas matplotlib seaborn dask numba script 1 script 2 notebook A dataset Z script 3 Python, R
  18. © 2017 Continuum Analytics - Confidential & Proprietary 18 How

    do you… • Share your data science project with others? • Ensure that you can reproduce your analysis? • Deploy your project? Challenges in data science development and deployment WITH ANACONDA PROJECTS
  19. © 2016 Continuum Analytics - Confidential & Proprietary Laptop Server

    Project 1 Project 2 Project 3 Project 1 Project 2 Project 3 Data Science Development Data Science Deployment
  20. © 2016 Continuum Analytics - Confidential & Proprietary Laptop Project

    1 Project 2 Project 3 Project 1 Project 2 Project 3 Data Science Development Data Science Development and Deployment Anaconda Enterprise Container 1 Container 2 Container 3 Container 4
  21. © 2016 Continuum Analytics - Confidential & Proprietary • Dependencies

    • Data • Deployment commands • Security • Scalability • Availability Anaconda Enterprise 21
  22. © 2017 Continuum Analytics - Confidential & Proprietary 22 Innovator

    Program http://go.continuum.io/anaconda-enterprise-innovator/
  23. Anaconda Projects GIS examples

  24. © 2017 Continuum Analytics - Confidential & Proprietary 24

  25. © 2017 Continuum Analytics - Confidential & Proprietary 25

  26. © 2017 Continuum Analytics - Confidential & Proprietary 26 Datashading

    a 1-billion-point Open Street Map database
  27. © 2017 Continuum Analytics - Confidential & Proprietary 27 2010

    US Census data (by population density and race)
  28. © 2017 Continuum Analytics - Confidential & Proprietary 28 Gerrymandering

    (congressional districts and race) Houston
  29. © 2017 Continuum Analytics - Confidential & Proprietary 29 https://anaconda.org/koverholt/projects

    - datashader_nyctaxi - deck_gl_geojson https://anaconda.org/jbednar/osm-1billion/notebook https://anaconda.org/jbednar/census/notebook https://anaconda.org/jbednar/census-hv-dask/notebook Examples available:
  30. Bokeh

  31. © 2017 Continuum Analytics - Confidential & Proprietary 31 Interactive

    visualization framework that targets modern web browsers for presentation • No JavaScript • Python, R, Scala and Lua bindings • Easy to embed in web applications • Server apps: data can be updated, and UI and selection events can be processed to trigger more visual updates. http://bokeh.pydata.org/en/latest/ Bokeh
  32. © 2017 Continuum Analytics - Confidential & Proprietary 32 http://bokeh.pydata.org/en/latest/docs/gallery/texas.html

  33. Datashader

  34. © 2017 Continuum Analytics - Confidential & Proprietary Motivation 34

    • Visualize large amounts of data in a meaningful way • Interactively explore the data
  35. © 2017 Continuum Analytics - Confidential & Proprietary Datashader 35

    Overplotting: Oversaturation: Undersampling: https://anaconda.org/jbednar/plotting_pitfalls/notebook
  36. © 2017 Continuum Analytics - Confidential & Proprietary Datashader 36

    graphics pipeline system for creating meaningful representations of large amounts of data • Provides automatic, nearly parameter-free visualization of datasets • Allows extensive customization of each step in the data-processing pipeline • Supports automatic downsampling and re- rendering with Bokeh and the Jupyter notebook • Works well with dask and numba to handle very large datasets in and out of core (with examples using billions of datapoints) https://github.com/bokeh/datashader NYC census data by race
  37. Dask

  38. © 2017 Continuum Analytics - Confidential & Proprietary Motivation 38

    • Data > memory in laptop • Similar to a pandas solution
  39. © 2017 Continuum Analytics - Confidential & Proprietary Dask Dataframes

    39 >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998 Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads
  40. © 2017 Continuum Analytics - Confidential & Proprietary Distributed 40

    http://distributed.readthedocs.io/en/latest/ Distributed is a lightweight library for distributed computing in Python. It extends dask APIs to moderate sized clusters.
  41. © 2017 Continuum Analytics - Confidential & Proprietary Web UI

    41 Dask.distributed includes a web interface to help deliver information about the current state of the network helps to track progress, identify performance issues, and debug failures over a normal web page in real time.
  42. HoloViews & GeoViews

  43. © 2017 Continuum Analytics - Confidential & Proprietary 43 HoloViews

    is a Python library that makes analyzing and visualizing scientific or engineering data much simpler, more intuitive, and more easily reproducible. http://holoviews.org/index.html
  44. © 2017 Continuum Analytics - Confidential & Proprietary 44 GeoViews

    is a Python library that makes it easy to explore and visualize geographical, meteorological, and oceanographic datasets, such as those used in weather, climate, and remote sensing research. http://geo.holoviews.org/
  45. © 2017 Continuum Analytics - Confidential & Proprietary Resources 45

    Bokeh documentation: http://bokeh.pydata.org/en/latest/ Bokeh demos: https://demo.bokehplots.com/ Datashader documentation: http://datashader.readthedocs.org/ Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/ Datashader webinar: http://go.continuum.io/datashader/ Geoviews blogpost: https://www.continuum.io/blog/developer-blog/introducing-geoviews Geoviews documentation: http://geo.holoviews.org/index.html Holoviews documentation: http://holoviews.org/index.html