Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python for Data Science at Pivotal

Ian Huston
August 12, 2015

Python for Data Science at Pivotal

Talk given at Python Ireland meetup

Python is a key part of the technology stack used at Pivotal for data science projects. In this talk I will outline how Python and the PyData ecosystem of projects has been used in a variety of customer projects, in small-, medium- and big-data scenarios.

Ian Huston

August 12, 2015
Tweet

More Decks by Ian Huston

Other Decks in Technology

Transcript

  1. Python for Data
    Science at Pivotal
    •  Ian Huston, Data Scientist
    Ian Huston, Data Scientist
    Python Ireland August 2015

    View Slide

  2. 2
    © Copyright 2015 Pivotal. All rights reserved.
    Who am I?
    Ÿ  Ian Huston
    Ÿ  @ianhuston
    Ÿ  www.ianhuston.net
    Ÿ  Data Scientist
    Ÿ  Use PyData stack for
    predictive analytics and
    machine learning
    Ÿ  Previously a theoretical
    physicist using Python for
    numerical simulations & HPC

    View Slide

  3. 3
    © Copyright 2015 Pivotal. All rights reserved.
    Who are Pivotal?
    OPEN DATA
    PLATFORM
    Pivotal
    Big Data Suite

    View Slide

  4. 4
    © Copyright 2015 Pivotal. All rights reserved.
    NOW HIRING IN DUBLIN!

    View Slide

  5. Data Scientist (n.):

    Person who is better at statistics than any
    software engineer and better at software
    engineering than any statistician.

    - Josh Wills

    View Slide

  6. 6
    © Copyright 2015 Pivotal. All rights reserved.
    Plan
    1  Python for Data packages and tools
    2  Python in your database
    3  Python in the cloud

    View Slide

  7. View Slide

  8. 8
    © Copyright 2015 Pivotal. All rights reserved.
    Why Python?
    Ÿ  Powerful & simple syntax – great for interactive work
    Ÿ  Backed up with fast C & Fortran numerical libraries
    Ÿ  Growing community and set of libraries
    Ÿ  R is still extremely popular in data science
    Ÿ  GIL and multi-core support make scaling Python difficult

    View Slide

  9. 1.  PyData packages !
    and tools

    View Slide

  10. 10
    © Copyright 2015 Pivotal. All rights reserved.
    Python for Data community
    Ÿ  Pycon Ireland Data Science track!
    Ÿ  PyData conferences + videos
    –  London, Berlin, multiple US locations
    Ÿ  #PyData on Twitter

    View Slide

  11. 11
    © Copyright 2015 Pivotal. All rights reserved.
    Packages - Data Manipulation
    Ÿ  Low level array operations
    Ÿ  Data tables and in-memory manipulation
    Ÿ  Parallel out-of-core array manipulation
    Ÿ  High level interface for databases and
    different computational backends
    NumPy
    Dask

    View Slide

  12. 12
    © Copyright 2015 Pivotal. All rights reserved.
    Packages - Modelling
    Ÿ  FFTs, integration, other general algorithms
    Ÿ  Statistical distributions and tests
    Ÿ  Machine Learning pipelines
    Ÿ  Bayesian Probabilistic Programming
    SciPy
    PyMC3

    View Slide

  13. 13
    © Copyright 2015 Pivotal. All rights reserved.
    Packages - Visualisation
    Ÿ  Widely used and powerful plotting package
    Ÿ  Opinionated but beautiful data visualisations
    Ÿ  Interactive plotting with server option
    Ÿ  Graphics API with translation between
    languages (e.g. Python -> D3)
    seaborn
    Bokeh

    View Slide

  14. 14
    © Copyright 2015 Pivotal. All rights reserved.
    IPython Notebooks
    http://nbviewer.ipython.org/gist/fonnesbeck/2352771

    View Slide

  15. 15
    © Copyright 2015 Pivotal. All rights reserved.

    View Slide

  16. 16
    © Copyright 2015 Pivotal. All rights reserved.
    PREDICT THE
    DESTINATION

    View Slide

  17. 17
    © Copyright 2015 Pivotal. All rights reserved.
    PREDICT THE
    RANGE

    View Slide

  18. 18
    © Copyright 2015 Pivotal. All rights reserved.
    Connected Car
    http://tinyurl.com/pivotal-car
    https://github.com/pivotal/IoT-ConnectedCar

    View Slide

  19. 2.  In-Database Python!
    (and R, Java, C, etc)

    View Slide

  20. 20
    © Copyright 2015 Pivotal. All rights reserved.
    Bring your code to the data
    Ÿ  Procedural Python – support in PostgreSQL + others
    Ÿ  Use the expressive power of Python inside the database
    Ÿ  Reduce/remove large data movements
    Ÿ  Couple with distributed databases for simple parallelisation

    View Slide

  21. CREATE  FUNCTION    
       pymax  (a  integer,  b  integer)  
    RETURNS  integer  
    AS  $$  
       if  a  >  b:  
           return  a  
       return  b  
    $$  LANGUAGE  plpythonu;  
     
    SQL wrapper
    Language
    Normal Python

    View Slide

  22. 22
    © Copyright 2015 Pivotal. All rights reserved.
    Data Parallelism
    Ÿ  Little or no effort is required to break up the problem into a
    number of parallel tasks, and there exists no dependency
    (or communication) between those parallel tasks.
    Ÿ  Examples:
    –  Measure the height of each student in a classroom (explicitly
    parallelizable by student)
    –  MapReduce
    –  map() function in Python

    View Slide

  23. 23
    © Copyright 2015 Pivotal. All rights reserved.
    PostgreSQL
    PostgreSQL PostgreSQL PostgreSQL PostgreSQL

    View Slide

  24. PostgreSQL

    View Slide

  25. BENEFITS:
    Reuse existing Python code
    Access Python libraries
    Implicit parallelism

    View Slide

  26. 26
    © Copyright 2015 Pivotal. All rights reserved.
    Natural Language Processing in-database
    Ÿ  Business Problem: 

    Want to understand what is being discussed in millions of
    documents and whether authors feel positive about us
    Ÿ  Topic Modelling:

    Characterise documents based on topics contained within
    Ÿ  Sentiment Analysis:

    Score documents based on ‘sentiment’ (positive or negative)
    Natural Language ToolKit (NLTK)

    View Slide

  27. 27
    © Copyright 2015 Pivotal. All rights reserved.
    Topic and Sentiment Analysis Pipeline
    Documents
    Load into
    database
    Parallel
    Parsing of
    JSON using
    PL/Python
    Topic
    Modelling
    Sentiment
    Analysis
    D3.js
    http://vimeo.com/79558274
    NLTK

    View Slide

  28. 3.  Python in the Cloud

    View Slide

  29. What do data scientists need?

    View Slide

  30. Cloud Applications Haiku

    Here is my source code
    Run it on the cloud for me
    I do not care how.
    -  Onsi Fakhouri
    @onsijoe

    View Slide

  31. What is Cloud Foundry?
    http://cloudfoundry.org

    Open Source
    Multi-Cloud Platform

    Simple App Deployment,
    Scaling & Availability

    View Slide

  32. $ cf push
     

    View Slide

  33. 33
    © Copyright 2015 Pivotal. All rights reserved.
    Simple Flask App Demo
    Ÿ  Simple one page “Hello World” web app
    Ÿ  Video: https://www.youtube.com/watch?v=QOfD6tnoAB8
    Ÿ  Demonstrates:
    –  Installation of requirements
    –  Scaling properties
    Ÿ  Need to Provide:
    –  App files
    –  Dependencies listed in requirements.txt file
    –  Optional manifest.yml file with configuration for deployment

    View Slide

  34. C
    F

    R
    O!
    U!
    T!
    E!
    R
    2. Set up domain
    Cloud
    Controller
    Instance
    1. Upload code
    4. Copy app into
    containerised
    instances
    3. Install Python
    &
    Dependencies
    5. Start app
    and accept
    connections
    Send request to URL
    WHAT JUST
    HAPPENED?
    Source
    Code
    Instance
    $  cf  push  
    Browser
    5. Load balance
    between
    instances

    View Slide

  35. 35
    © Copyright 2015 Pivotal. All rights reserved.
    Python on Cloud Foundry
    Ÿ  First class language (with Go, Java, Ruby, Node.js, PHP)
    Ÿ  Automatic app type detection
    –  Looks for requirements.txt or setup.py
    Ÿ  Buildpack takes care of
    –  Detecting that a Python app is being pushed
    –  Installing Python interpreter
    –  Installing packages in requirements.txt using pip
    –  Starting web app as requested (e.g. python myapp.py)

    View Slide

  36. 36
    © Copyright 2015 Pivotal. All rights reserved.
    Official Python Buildpack
    ü  Great for simple pip based requirements
    ü  Well tested and officially maintained
    ü  Covers both Python 2 and 3
    ✗ Suffers from the Python Packaging Problem:
    -  Hard to build packages with C, C++ or Fortran extensions
    -  Complicated local configuration of libraries and paths needed
    -  Takes a long time to build main PyData packages from source

    View Slide

  37. 37
    © Copyright 2015 Pivotal. All rights reserved.
    Using conda for package management
    Ÿ  http://conda.pydata.org
    Ÿ  Benefits:
    –  Uses precompiled binary packages
    –  No fiddling with Fortran or C compilers and library paths
    –  Known good combinations of main package versions
    –  Really simple environment management (better than virtualenv)
    –  Easy to run Python 2 and 3 side-by-side
    Go try it out if you haven’t already!

    View Slide

  38. 38
    © Copyright 2015 Pivotal. All rights reserved.
    How to use the conda buildpack
    https://github.com/ihuston/python-conda-buildpack
    Ÿ  Specify as a custom buildpack when pushing app with
    manifest or -­‐b command line option.
    Ÿ  Export your current environment to a environment.yml file
    Ÿ  Or write requirements.txt (pip) and conda_requirements.txt
    Ÿ  Send me feedback & pull requests!

    View Slide

  39. R
    E
    S
    T

    A
    P
    I
    Send data as JSON
    Data
    Ingest
    Model
    Create Model
    Redis
    Kicking off
    periodic
    retraining
    Save training
    data
    Save model
    object
    Send JSON data
    without label
    Receive prediction
    from trained model
    instance
    Deployed at:!
    http://dsoncf.cfapps.io!
    Code:
    https://github.com/pivotalsoftware/ds-cfpylearning
    PREDICTION API
    ARCHITECTURE
    $  cf  create-­‐service  
    rediscloud  
    PLAN_NAME  
    INSTANCE_NAME  

    View Slide

  40. TRANSPORT
    DISRUPTION!
    PREDICTIONS
    http://ds-demo-transport.cfapps.io

    View Slide

  41. 41
    © Copyright 2015 Pivotal. All rights reserved.

    Show off your data
    science related Cloud
    Foundry apps:

    Twitter: @dsoncf
    http://dsoncf.com

    View Slide

  42. 42
    © Copyright 2015 Pivotal. All rights reserved.
    Resources
    Ÿ  PyData.org
    Ÿ  PL/Python – see PostgreSQL docs
    Ÿ  CloudFoundry.org
    We’re hiring in Dublin & London:
    pivotal.io/careers Kevin Olsen [email protected]

    View Slide

  43. 43
    © Copyright 2015 Pivotal. All rights reserved.
    @ianhuston

    View Slide

  44. 44
    © Copyright 2015 Pivotal. All rights reserved.
    Appendix

    View Slide

  45. C
    F

    R
    O!
    U!
    T!
    E!
    R
    2. Set up domain
    Cloud
    Controller
    Instance
    1. Upload code
    4. Copy app into
    containerised
    instances
    3. Install Python
    &
    Dependencies
    5. Start app
    and accept
    connections
    Send request to URL
    WHAT JUST
    HAPPENED?
    Source
    Code
    Instance
    $  cf  push  
    Browser
    5. Load balance
    between
    instances

    View Slide

  46. 46
    © Copyright 2015 Pivotal. All rights reserved.
    What just happened?
    1.  Application code is uploaded to CF
    2.  Domain URL is set up ready for routing
    3.  Cloud controller builds application in container:
    –  Python interpreter selected
    –  Dependencies installed with pip
    4.  Container is replicated to provide instances
    5.  App starts and Router load balances requests

    View Slide

  47. 47
    © Copyright 2015 Pivotal. All rights reserved.
    Containers vs Buildpacks
    runtime layer
    OS image
    application layer
    Container (e.g. Docker)
    system brings fixed
    host OS Kernel
    * Devs may bring a custom
    buildpack
    runtime layer*
    OS image
    application layer
    Buildpack
    App container
    System Provides
    Dev Provides
    system brings fixed
    host OS Kernel

    View Slide