Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python for Data Science at Pivotal

Ian Huston
August 12, 2015

Python for Data Science at Pivotal

Talk given at Python Ireland meetup

Python is a key part of the technology stack used at Pivotal for data science projects. In this talk I will outline how Python and the PyData ecosystem of projects has been used in a variety of customer projects, in small-, medium- and big-data scenarios.

Ian Huston

August 12, 2015
Tweet

More Decks by Ian Huston

Other Decks in Technology

Transcript

  1. Python for Data Science at Pivotal •  Ian Huston, Data

    Scientist Ian Huston, Data Scientist Python Ireland August 2015
  2. 2 © Copyright 2015 Pivotal. All rights reserved. Who am

    I? Ÿ  Ian Huston Ÿ  @ianhuston Ÿ  www.ianhuston.net Ÿ  Data Scientist Ÿ  Use PyData stack for predictive analytics and machine learning Ÿ  Previously a theoretical physicist using Python for numerical simulations & HPC
  3. 3 © Copyright 2015 Pivotal. All rights reserved. Who are

    Pivotal? OPEN DATA PLATFORM Pivotal Big Data Suite
  4. Data Scientist (n.): Person who is better at statistics than

    any software engineer and better at software engineering than any statistician. - Josh Wills
  5. 6 © Copyright 2015 Pivotal. All rights reserved. Plan 1 

    Python for Data packages and tools 2  Python in your database 3  Python in the cloud
  6. 8 © Copyright 2015 Pivotal. All rights reserved. Why Python?

    Ÿ  Powerful & simple syntax – great for interactive work Ÿ  Backed up with fast C & Fortran numerical libraries Ÿ  Growing community and set of libraries Ÿ  R is still extremely popular in data science Ÿ  GIL and multi-core support make scaling Python difficult
  7. 10 © Copyright 2015 Pivotal. All rights reserved. Python for

    Data community Ÿ  Pycon Ireland Data Science track! Ÿ  PyData conferences + videos –  London, Berlin, multiple US locations Ÿ  #PyData on Twitter
  8. 11 © Copyright 2015 Pivotal. All rights reserved. Packages -

    Data Manipulation Ÿ  Low level array operations Ÿ  Data tables and in-memory manipulation Ÿ  Parallel out-of-core array manipulation Ÿ  High level interface for databases and different computational backends NumPy Dask
  9. 12 © Copyright 2015 Pivotal. All rights reserved. Packages -

    Modelling Ÿ  FFTs, integration, other general algorithms Ÿ  Statistical distributions and tests Ÿ  Machine Learning pipelines Ÿ  Bayesian Probabilistic Programming SciPy PyMC3
  10. 13 © Copyright 2015 Pivotal. All rights reserved. Packages -

    Visualisation Ÿ  Widely used and powerful plotting package Ÿ  Opinionated but beautiful data visualisations Ÿ  Interactive plotting with server option Ÿ  Graphics API with translation between languages (e.g. Python -> D3) seaborn Bokeh
  11. 14 © Copyright 2015 Pivotal. All rights reserved. IPython Notebooks

    http://nbviewer.ipython.org/gist/fonnesbeck/2352771
  12. 18 © Copyright 2015 Pivotal. All rights reserved. Connected Car

    http://tinyurl.com/pivotal-car https://github.com/pivotal/IoT-ConnectedCar
  13. 20 © Copyright 2015 Pivotal. All rights reserved. Bring your

    code to the data Ÿ  Procedural Python – support in PostgreSQL + others Ÿ  Use the expressive power of Python inside the database Ÿ  Reduce/remove large data movements Ÿ  Couple with distributed databases for simple parallelisation
  14. CREATE  FUNCTION        pymax  (a  integer,  b  integer)

      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;     SQL wrapper Language Normal Python
  15. 22 © Copyright 2015 Pivotal. All rights reserved. Data Parallelism

    Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python
  16. 26 © Copyright 2015 Pivotal. All rights reserved. Natural Language

    Processing in-database Ÿ  Business Problem: 
 Want to understand what is being discussed in millions of documents and whether authors feel positive about us Ÿ  Topic Modelling:
 Characterise documents based on topics contained within Ÿ  Sentiment Analysis:
 Score documents based on ‘sentiment’ (positive or negative) Natural Language ToolKit (NLTK)
  17. 27 © Copyright 2015 Pivotal. All rights reserved. Topic and

    Sentiment Analysis Pipeline Documents Load into database Parallel Parsing of JSON using PL/Python Topic Modelling Sentiment Analysis D3.js http://vimeo.com/79558274 NLTK
  18. Cloud Applications Haiku Here is my source code Run it

    on the cloud for me I do not care how. -  Onsi Fakhouri @onsijoe
  19. 33 © Copyright 2015 Pivotal. All rights reserved. Simple Flask

    App Demo Ÿ  Simple one page “Hello World” web app Ÿ  Video: https://www.youtube.com/watch?v=QOfD6tnoAB8 Ÿ  Demonstrates: –  Installation of requirements –  Scaling properties Ÿ  Need to Provide: –  App files –  Dependencies listed in requirements.txt file –  Optional manifest.yml file with configuration for deployment
  20. C F R O! U! T! E! R 2. Set

    up domain Cloud Controller Instance 1. Upload code 4. Copy app into containerised instances 3. Install Python & Dependencies 5. Start app and accept connections Send request to URL WHAT JUST HAPPENED? Source Code Instance $  cf  push   Browser 5. Load balance between instances
  21. 35 © Copyright 2015 Pivotal. All rights reserved. Python on

    Cloud Foundry Ÿ  First class language (with Go, Java, Ruby, Node.js, PHP) Ÿ  Automatic app type detection –  Looks for requirements.txt or setup.py Ÿ  Buildpack takes care of –  Detecting that a Python app is being pushed –  Installing Python interpreter –  Installing packages in requirements.txt using pip –  Starting web app as requested (e.g. python myapp.py)
  22. 36 © Copyright 2015 Pivotal. All rights reserved. Official Python

    Buildpack ü  Great for simple pip based requirements ü  Well tested and officially maintained ü  Covers both Python 2 and 3 ✗ Suffers from the Python Packaging Problem: -  Hard to build packages with C, C++ or Fortran extensions -  Complicated local configuration of libraries and paths needed -  Takes a long time to build main PyData packages from source
  23. 37 © Copyright 2015 Pivotal. All rights reserved. Using conda

    for package management Ÿ  http://conda.pydata.org Ÿ  Benefits: –  Uses precompiled binary packages –  No fiddling with Fortran or C compilers and library paths –  Known good combinations of main package versions –  Really simple environment management (better than virtualenv) –  Easy to run Python 2 and 3 side-by-side Go try it out if you haven’t already!
  24. 38 © Copyright 2015 Pivotal. All rights reserved. How to

    use the conda buildpack https://github.com/ihuston/python-conda-buildpack Ÿ  Specify as a custom buildpack when pushing app with manifest or -­‐b command line option. Ÿ  Export your current environment to a environment.yml file Ÿ  Or write requirements.txt (pip) and conda_requirements.txt Ÿ  Send me feedback & pull requests!
  25. R E S T A P I Send data as

    JSON Data Ingest Model Create Model Redis Kicking off periodic retraining Save training data Save model object Send JSON data without label Receive prediction from trained model instance Deployed at:! http://dsoncf.cfapps.io! Code: https://github.com/pivotalsoftware/ds-cfpylearning PREDICTION API ARCHITECTURE $  cf  create-­‐service   rediscloud   PLAN_NAME   INSTANCE_NAME  
  26. 41 © Copyright 2015 Pivotal. All rights reserved. Show off

    your data science related Cloud Foundry apps: Twitter: @dsoncf http://dsoncf.com
  27. 42 © Copyright 2015 Pivotal. All rights reserved. Resources Ÿ 

    PyData.org Ÿ  PL/Python – see PostgreSQL docs Ÿ  CloudFoundry.org We’re hiring in Dublin & London: pivotal.io/careers Kevin Olsen [email protected]
  28. C F R O! U! T! E! R 2. Set

    up domain Cloud Controller Instance 1. Upload code 4. Copy app into containerised instances 3. Install Python & Dependencies 5. Start app and accept connections Send request to URL WHAT JUST HAPPENED? Source Code Instance $  cf  push   Browser 5. Load balance between instances
  29. 46 © Copyright 2015 Pivotal. All rights reserved. What just

    happened? 1.  Application code is uploaded to CF 2.  Domain URL is set up ready for routing 3.  Cloud controller builds application in container: –  Python interpreter selected –  Dependencies installed with pip 4.  Container is replicated to provide instances 5.  App starts and Router load balances requests
  30. 47 © Copyright 2015 Pivotal. All rights reserved. Containers vs

    Buildpacks runtime layer OS image application layer Container (e.g. Docker) system brings fixed host OS Kernel * Devs may bring a custom buildpack runtime layer* OS image application layer Buildpack App container System Provides Dev Provides system brings fixed host OS Kernel