Pro Yearly is on sale from $80 to $50! »

Python for Data Science at Pivotal

41d2c569bbfbec97e0ab6fd2a8c261b7?s=47 Ian Huston
August 12, 2015

Python for Data Science at Pivotal

Talk given at Python Ireland meetup

Python is a key part of the technology stack used at Pivotal for data science projects. In this talk I will outline how Python and the PyData ecosystem of projects has been used in a variety of customer projects, in small-, medium- and big-data scenarios.

41d2c569bbfbec97e0ab6fd2a8c261b7?s=128

Ian Huston

August 12, 2015
Tweet

Transcript

  1. Python for Data Science at Pivotal •  Ian Huston, Data

    Scientist Ian Huston, Data Scientist Python Ireland August 2015
  2. 2 © Copyright 2015 Pivotal. All rights reserved. Who am

    I? Ÿ  Ian Huston Ÿ  @ianhuston Ÿ  www.ianhuston.net Ÿ  Data Scientist Ÿ  Use PyData stack for predictive analytics and machine learning Ÿ  Previously a theoretical physicist using Python for numerical simulations & HPC
  3. 3 © Copyright 2015 Pivotal. All rights reserved. Who are

    Pivotal? OPEN DATA PLATFORM Pivotal Big Data Suite
  4. 4 © Copyright 2015 Pivotal. All rights reserved. NOW HIRING

    IN DUBLIN!
  5. Data Scientist (n.): Person who is better at statistics than

    any software engineer and better at software engineering than any statistician. - Josh Wills
  6. 6 © Copyright 2015 Pivotal. All rights reserved. Plan 1 

    Python for Data packages and tools 2  Python in your database 3  Python in the cloud
  7. None
  8. 8 © Copyright 2015 Pivotal. All rights reserved. Why Python?

    Ÿ  Powerful & simple syntax – great for interactive work Ÿ  Backed up with fast C & Fortran numerical libraries Ÿ  Growing community and set of libraries Ÿ  R is still extremely popular in data science Ÿ  GIL and multi-core support make scaling Python difficult
  9. 1.  PyData packages ! and tools

  10. 10 © Copyright 2015 Pivotal. All rights reserved. Python for

    Data community Ÿ  Pycon Ireland Data Science track! Ÿ  PyData conferences + videos –  London, Berlin, multiple US locations Ÿ  #PyData on Twitter
  11. 11 © Copyright 2015 Pivotal. All rights reserved. Packages -

    Data Manipulation Ÿ  Low level array operations Ÿ  Data tables and in-memory manipulation Ÿ  Parallel out-of-core array manipulation Ÿ  High level interface for databases and different computational backends NumPy Dask
  12. 12 © Copyright 2015 Pivotal. All rights reserved. Packages -

    Modelling Ÿ  FFTs, integration, other general algorithms Ÿ  Statistical distributions and tests Ÿ  Machine Learning pipelines Ÿ  Bayesian Probabilistic Programming SciPy PyMC3
  13. 13 © Copyright 2015 Pivotal. All rights reserved. Packages -

    Visualisation Ÿ  Widely used and powerful plotting package Ÿ  Opinionated but beautiful data visualisations Ÿ  Interactive plotting with server option Ÿ  Graphics API with translation between languages (e.g. Python -> D3) seaborn Bokeh
  14. 14 © Copyright 2015 Pivotal. All rights reserved. IPython Notebooks

    http://nbviewer.ipython.org/gist/fonnesbeck/2352771
  15. 15 © Copyright 2015 Pivotal. All rights reserved.

  16. 16 © Copyright 2015 Pivotal. All rights reserved. PREDICT THE

    DESTINATION
  17. 17 © Copyright 2015 Pivotal. All rights reserved. PREDICT THE

    RANGE
  18. 18 © Copyright 2015 Pivotal. All rights reserved. Connected Car

    http://tinyurl.com/pivotal-car https://github.com/pivotal/IoT-ConnectedCar
  19. 2.  In-Database Python! (and R, Java, C, etc)

  20. 20 © Copyright 2015 Pivotal. All rights reserved. Bring your

    code to the data Ÿ  Procedural Python – support in PostgreSQL + others Ÿ  Use the expressive power of Python inside the database Ÿ  Reduce/remove large data movements Ÿ  Couple with distributed databases for simple parallelisation
  21. CREATE  FUNCTION        pymax  (a  integer,  b  integer)

      RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;     SQL wrapper Language Normal Python
  22. 22 © Copyright 2015 Pivotal. All rights reserved. Data Parallelism

    Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python
  23. 23 © Copyright 2015 Pivotal. All rights reserved. PostgreSQL PostgreSQL

    PostgreSQL PostgreSQL PostgreSQL
  24. PostgreSQL

  25. BENEFITS: Reuse existing Python code Access Python libraries Implicit parallelism

  26. 26 © Copyright 2015 Pivotal. All rights reserved. Natural Language

    Processing in-database Ÿ  Business Problem: 
 Want to understand what is being discussed in millions of documents and whether authors feel positive about us Ÿ  Topic Modelling:
 Characterise documents based on topics contained within Ÿ  Sentiment Analysis:
 Score documents based on ‘sentiment’ (positive or negative) Natural Language ToolKit (NLTK)
  27. 27 © Copyright 2015 Pivotal. All rights reserved. Topic and

    Sentiment Analysis Pipeline Documents Load into database Parallel Parsing of JSON using PL/Python Topic Modelling Sentiment Analysis D3.js http://vimeo.com/79558274 NLTK
  28. 3.  Python in the Cloud

  29. What do data scientists need?

  30. Cloud Applications Haiku Here is my source code Run it

    on the cloud for me I do not care how. -  Onsi Fakhouri @onsijoe
  31. What is Cloud Foundry? http://cloudfoundry.org Open Source Multi-Cloud Platform Simple

    App Deployment, Scaling & Availability
  32. $ cf push  

  33. 33 © Copyright 2015 Pivotal. All rights reserved. Simple Flask

    App Demo Ÿ  Simple one page “Hello World” web app Ÿ  Video: https://www.youtube.com/watch?v=QOfD6tnoAB8 Ÿ  Demonstrates: –  Installation of requirements –  Scaling properties Ÿ  Need to Provide: –  App files –  Dependencies listed in requirements.txt file –  Optional manifest.yml file with configuration for deployment
  34. C F R O! U! T! E! R 2. Set

    up domain Cloud Controller Instance 1. Upload code 4. Copy app into containerised instances 3. Install Python & Dependencies 5. Start app and accept connections Send request to URL WHAT JUST HAPPENED? Source Code Instance $  cf  push   Browser 5. Load balance between instances
  35. 35 © Copyright 2015 Pivotal. All rights reserved. Python on

    Cloud Foundry Ÿ  First class language (with Go, Java, Ruby, Node.js, PHP) Ÿ  Automatic app type detection –  Looks for requirements.txt or setup.py Ÿ  Buildpack takes care of –  Detecting that a Python app is being pushed –  Installing Python interpreter –  Installing packages in requirements.txt using pip –  Starting web app as requested (e.g. python myapp.py)
  36. 36 © Copyright 2015 Pivotal. All rights reserved. Official Python

    Buildpack ü  Great for simple pip based requirements ü  Well tested and officially maintained ü  Covers both Python 2 and 3 ✗ Suffers from the Python Packaging Problem: -  Hard to build packages with C, C++ or Fortran extensions -  Complicated local configuration of libraries and paths needed -  Takes a long time to build main PyData packages from source
  37. 37 © Copyright 2015 Pivotal. All rights reserved. Using conda

    for package management Ÿ  http://conda.pydata.org Ÿ  Benefits: –  Uses precompiled binary packages –  No fiddling with Fortran or C compilers and library paths –  Known good combinations of main package versions –  Really simple environment management (better than virtualenv) –  Easy to run Python 2 and 3 side-by-side Go try it out if you haven’t already!
  38. 38 © Copyright 2015 Pivotal. All rights reserved. How to

    use the conda buildpack https://github.com/ihuston/python-conda-buildpack Ÿ  Specify as a custom buildpack when pushing app with manifest or -­‐b command line option. Ÿ  Export your current environment to a environment.yml file Ÿ  Or write requirements.txt (pip) and conda_requirements.txt Ÿ  Send me feedback & pull requests!
  39. R E S T A P I Send data as

    JSON Data Ingest Model Create Model Redis Kicking off periodic retraining Save training data Save model object Send JSON data without label Receive prediction from trained model instance Deployed at:! http://dsoncf.cfapps.io! Code: https://github.com/pivotalsoftware/ds-cfpylearning PREDICTION API ARCHITECTURE $  cf  create-­‐service   rediscloud   PLAN_NAME   INSTANCE_NAME  
  40. TRANSPORT DISRUPTION! PREDICTIONS http://ds-demo-transport.cfapps.io

  41. 41 © Copyright 2015 Pivotal. All rights reserved. Show off

    your data science related Cloud Foundry apps: Twitter: @dsoncf http://dsoncf.com
  42. 42 © Copyright 2015 Pivotal. All rights reserved. Resources Ÿ 

    PyData.org Ÿ  PL/Python – see PostgreSQL docs Ÿ  CloudFoundry.org We’re hiring in Dublin & London: pivotal.io/careers Kevin Olsen kolsen@pivotal.io
  43. 43 © Copyright 2015 Pivotal. All rights reserved. @ianhuston

  44. 44 © Copyright 2015 Pivotal. All rights reserved. Appendix

  45. C F R O! U! T! E! R 2. Set

    up domain Cloud Controller Instance 1. Upload code 4. Copy app into containerised instances 3. Install Python & Dependencies 5. Start app and accept connections Send request to URL WHAT JUST HAPPENED? Source Code Instance $  cf  push   Browser 5. Load balance between instances
  46. 46 © Copyright 2015 Pivotal. All rights reserved. What just

    happened? 1.  Application code is uploaded to CF 2.  Domain URL is set up ready for routing 3.  Cloud controller builds application in container: –  Python interpreter selected –  Dependencies installed with pip 4.  Container is replicated to provide instances 5.  App starts and Router load balances requests
  47. 47 © Copyright 2015 Pivotal. All rights reserved. Containers vs

    Buildpacks runtime layer OS image application layer Container (e.g. Docker) system brings fixed host OS Kernel * Devs may bring a custom buildpack runtime layer* OS image application layer Buildpack App container System Provides Dev Provides system brings fixed host OS Kernel