Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GeoRodeo 2017 Anaconda

GeoRodeo 2017 Anaconda

Christine Doig

May 19, 2017
Tweet

More Decks by Christine Doig

Other Decks in Programming

Transcript

  1. © 2016 Continuum Analytics - Confidential & Proprietary
    © 2017 Continuum Analytics - Confidential & Proprietary
    Anaconda
    for GIS professionals
    GeoRodeo 2017
    Christine Doig, Continuum Analytics
    May 19th, 2017

    View Slide

  2. © 2017 Continuum Analytics - Confidential & Proprietary 2
    • Introduction to Anaconda
    • Data Science Development & Deployment with Anaconda Projects
    • Anaconda Projects GIS examples
    • OSS libraries for GIS professionals:
    • Bokeh, interactive data visualizations
    • Datashader, graphics pipeline system for creating meaningful
    representations of large amounts of data
    • Dask, flexible parallel computing library for analytics
    • Other libraries: GeoViews and Holoviews
    Agenda

    View Slide

  3. Introduction to Anaconda

    View Slide

  4. © 2017 Continuum Analytics - Confidential & Proprietary 4
    Anaconda, the leading Data Science ecosystem with over 4M users

    View Slide

  5. © 2017 Continuum Analytics - Confidential & Proprietary 5
    Numba
    dask
    xlwings
    Airflow
    Blaze
    Distributed 

    Systems
    Business 

    Intelligence
    Web
    Scientific 

    Computing / HPC
    Machine Learning

    / Statistics
    ANACONDA
    DISTRIBUTION
    Python & R distribution with 1000+
    curated packages that makes it
    easy to get started with Data
    Science

    View Slide

  6. © 2016 Continuum Analytics - Confidential & Proprietary 6
    https://www.continuum.io/downloads

    View Slide

  7. © 2016 Continuum Analytics - Confidential & Proprietary 7
    What’s in ANACONDA DISTRIBUTION?

    View Slide

  8. © 2016 Continuum Analytics - Confidential & Proprietary 8
    • Install data science libraries
    $ conda install pandas
    • Manage package versions
    $ conda install pandas=0.14
    • Create isolated environments
    $ conda create -n myenv python=3.5 pandas=0.18
    • Update package version
    $ conda update pandas

    View Slide

  9. © 2016 Continuum Analytics - Confidential & Proprietary 9

    View Slide

  10. © 2016 Continuum Analytics - Confidential & Proprietary 10
    anaconda-project.yml
    • Define and manage:
    • project package dependencies
    • deployment commands
    • data
    • …

    View Slide

  11. © 2016 Continuum Analytics - Confidential & Proprietary 11
    • Launch applications
    • Manage package
    versions and
    environments
    • Create and upload
    projects

    View Slide

  12. Data Science Development and Deployment

    View Slide

  13. © 2017 Continuum Analytics - Confidential & Proprietary 13
    Biz Analyst
    Data Scientists
    Explore, Analyze & Collaborate

    View Slide

  14. © 2017 Continuum Analytics - Confidential & Proprietary 14
    DevOps
    Scale, Deploy & Operate
    Developer
    Data Engineers

    View Slide

  15. © 2017 Continuum Analytics - Confidential & Proprietary 15
    How do you…
    • Download and install data science libraries?
    • Manage versions and dependencies?
    • Upgrade libraries?
    • Isolate dependencies between projects?
    Challenges in data science development
    WITH ANACONDA DISTRIBUTION & CONDA

    View Slide

  16. © 2016 Continuum Analytics - Confidential & Proprietary 16
    What do data scientists develop?
    Workflows
    Data
    Query Visualize
    Clean
    & Tidy
    Predict,
    Simulate,
    & Optimize
    R
    P
    In
    N
    In
    A
    P
    M
    Interactive data visualizations and
    dashboards
    Jupyter notebooks
    Scripts
    Predictive models
    Processed
    Data

    View Slide

  17. © 2016 Continuum Analytics - Confidential & Proprietary 17
    Laptop
    Data Science Development
    scikit-learn
    Bokeh Tensorflow
    Jupyter pandas
    matplotlib
    seaborn
    dask
    numba
    script 1 script 2 notebook A dataset Z
    script 3
    Python, R

    View Slide

  18. © 2017 Continuum Analytics - Confidential & Proprietary 18
    How do you…
    • Share your data science project with others?
    • Ensure that you can reproduce your analysis?
    • Deploy your project?
    Challenges in data science development and
    deployment
    WITH ANACONDA PROJECTS

    View Slide

  19. © 2016 Continuum Analytics - Confidential & Proprietary
    Laptop Server
    Project 1 Project 2 Project 3 Project 1 Project 2 Project 3
    Data Science Development Data Science Deployment

    View Slide

  20. © 2016 Continuum Analytics - Confidential & Proprietary
    Laptop
    Project 1 Project 2 Project 3
    Project 1 Project 2 Project 3
    Data Science Development Data Science Development and Deployment
    Anaconda Enterprise
    Container 1
    Container 2
    Container 3 Container 4

    View Slide

  21. © 2016 Continuum Analytics - Confidential & Proprietary
    • Dependencies
    • Data
    • Deployment commands
    • Security
    • Scalability
    • Availability
    Anaconda Enterprise
    21

    View Slide

  22. © 2017 Continuum Analytics - Confidential & Proprietary 22
    Innovator Program
    http://go.continuum.io/anaconda-enterprise-innovator/

    View Slide

  23. Anaconda Projects GIS examples

    View Slide

  24. © 2017 Continuum Analytics - Confidential & Proprietary 24

    View Slide

  25. © 2017 Continuum Analytics - Confidential & Proprietary 25

    View Slide

  26. © 2017 Continuum Analytics - Confidential & Proprietary 26
    Datashading a 1-billion-point Open Street Map database

    View Slide

  27. © 2017 Continuum Analytics - Confidential & Proprietary 27
    2010 US Census data (by population density and race)

    View Slide

  28. © 2017 Continuum Analytics - Confidential & Proprietary 28
    Gerrymandering (congressional districts and race)
    Houston

    View Slide

  29. © 2017 Continuum Analytics - Confidential & Proprietary 29
    https://anaconda.org/koverholt/projects
    - datashader_nyctaxi
    - deck_gl_geojson
    https://anaconda.org/jbednar/osm-1billion/notebook
    https://anaconda.org/jbednar/census/notebook
    https://anaconda.org/jbednar/census-hv-dask/notebook
    Examples available:

    View Slide

  30. Bokeh

    View Slide

  31. © 2017 Continuum Analytics - Confidential & Proprietary 31
    Interactive visualization
    framework that targets modern
    web browsers for presentation
    • No JavaScript
    • Python, R, Scala and Lua
    bindings
    • Easy to embed in web
    applications
    • Server apps: data can be
    updated, and UI and
    selection events can be
    processed to trigger more
    visual updates.
    http://bokeh.pydata.org/en/latest/
    Bokeh

    View Slide

  32. © 2017 Continuum Analytics - Confidential & Proprietary 32
    http://bokeh.pydata.org/en/latest/docs/gallery/texas.html

    View Slide

  33. Datashader

    View Slide

  34. © 2017 Continuum Analytics - Confidential & Proprietary
    Motivation
    34
    • Visualize large amounts of data in a meaningful way
    • Interactively explore the data

    View Slide

  35. © 2017 Continuum Analytics - Confidential & Proprietary
    Datashader
    35
    Overplotting:
    Oversaturation:
    Undersampling:
    https://anaconda.org/jbednar/plotting_pitfalls/notebook

    View Slide

  36. © 2017 Continuum Analytics - Confidential & Proprietary
    Datashader
    36
    graphics pipeline system for
    creating meaningful
    representations of
    large amounts of data
    • Provides automatic, nearly parameter-free
    visualization of datasets
    • Allows extensive customization of each step in
    the data-processing pipeline
    • Supports automatic downsampling and re-
    rendering with Bokeh and the Jupyter notebook
    • Works well with dask and numba to handle very
    large datasets in and out of core (with
    examples using billions of datapoints)
    https://github.com/bokeh/datashader NYC census data by race

    View Slide

  37. Dask

    View Slide

  38. © 2017 Continuum Analytics - Confidential & Proprietary
    Motivation
    38
    • Data > memory in laptop
    • Similar to a pandas solution

    View Slide

  39. © 2017 Continuum Analytics - Confidential & Proprietary
    Dask Dataframes
    39
    >>> import pandas as pd
    >>> df = pd.read_csv('iris.csv')
    >>> df.head()
    sepal_length sepal_width petal_length
    petal_width species
    0 5.1 3.5 1.4
    0.2 Iris-setosa
    1 4.9 3.0 1.4
    0.2 Iris-setosa
    2 4.7 3.2 1.3
    0.2 Iris-setosa
    3 4.6 3.1 1.5
    0.2 Iris-setosa
    4 5.0 3.6 1.4
    0.2 Iris-setosa
    >>> max_sepal_length_setosa = df[df.species ==
    'setosa'].sepal_length.max()
    5.7999999999999998
    >>> import dask.dataframe as dd
    >>> ddf = dd.read_csv('*.csv')
    >>> ddf.head()
    sepal_length sepal_width petal_length petal_width
    species
    0 5.1 3.5 1.4 0.2
    Iris-setosa
    1 4.9 3.0 1.4 0.2
    Iris-setosa
    2 4.7 3.2 1.3 0.2
    Iris-setosa
    3 4.6 3.1 1.5 0.2
    Iris-setosa
    4 5.0 3.6 1.4 0.2
    Iris-setosa

    >>> d_max_sepal_length_setosa = ddf[ddf.species ==
    'setosa'].sepal_length.max()
    >>> d_max_sepal_length_setosa.compute()
    5.7999999999999998
    Dask dataframes look and feel like pandas
    dataframes, but operate on datasets larger than
    memory using multiple threads

    View Slide

  40. © 2017 Continuum Analytics - Confidential & Proprietary
    Distributed
    40
    http://distributed.readthedocs.io/en/latest/
    Distributed is a lightweight library for distributed computing in Python.
    It extends dask APIs to moderate sized clusters.

    View Slide

  41. © 2017 Continuum Analytics - Confidential & Proprietary
    Web UI
    41
    Dask.distributed includes a web
    interface to help deliver information
    about the current state of the
    network helps to track progress,
    identify performance issues, and
    debug failures over a normal web
    page in real time.

    View Slide

  42. HoloViews & GeoViews

    View Slide

  43. © 2017 Continuum Analytics - Confidential & Proprietary 43
    HoloViews is a Python library that makes analyzing and visualizing scientific or
    engineering data much simpler, more intuitive, and more easily reproducible.
    http://holoviews.org/index.html

    View Slide

  44. © 2017 Continuum Analytics - Confidential & Proprietary 44
    GeoViews is a Python library that makes it easy to explore and visualize geographical,
    meteorological, and oceanographic datasets, such as those used in weather, climate,
    and remote sensing research.
    http://geo.holoviews.org/

    View Slide

  45. © 2017 Continuum Analytics - Confidential & Proprietary
    Resources
    45
    Bokeh documentation: http://bokeh.pydata.org/en/latest/
    Bokeh demos: https://demo.bokehplots.com/
    Datashader documentation: http://datashader.readthedocs.org/
    Bokeh + datashader tutorial: https://github.com/bokeh/bokeh-notebooks
    Bokeh webinar: http://go.continuum.io/hassle-free-data-science-apps/
    Datashader webinar: http://go.continuum.io/datashader/
    Geoviews blogpost: https://www.continuum.io/blog/developer-blog/introducing-geoviews
    Geoviews documentation: http://geo.holoviews.org/index.html
    Holoviews documentation: http://holoviews.org/index.html

    View Slide