Slide 1

Slide 1 text

Python for Data Science at Pivotal •  Ian Huston, Data Scientist Ian Huston, Data Scientist Python Ireland August 2015

Slide 2

Slide 2 text

2 © Copyright 2015 Pivotal. All rights reserved. Who am I? Ÿ  Ian Huston Ÿ  @ianhuston Ÿ  www.ianhuston.net Ÿ  Data Scientist Ÿ  Use PyData stack for predictive analytics and machine learning Ÿ  Previously a theoretical physicist using Python for numerical simulations & HPC

Slide 3

Slide 3 text

3 © Copyright 2015 Pivotal. All rights reserved. Who are Pivotal? OPEN DATA PLATFORM Pivotal Big Data Suite

Slide 4

Slide 4 text

4 © Copyright 2015 Pivotal. All rights reserved. NOW HIRING IN DUBLIN!

Slide 5

Slide 5 text

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. - Josh Wills

Slide 6

Slide 6 text

6 © Copyright 2015 Pivotal. All rights reserved. Plan 1  Python for Data packages and tools 2  Python in your database 3  Python in the cloud

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

8 © Copyright 2015 Pivotal. All rights reserved. Why Python? Ÿ  Powerful & simple syntax – great for interactive work Ÿ  Backed up with fast C & Fortran numerical libraries Ÿ  Growing community and set of libraries Ÿ  R is still extremely popular in data science Ÿ  GIL and multi-core support make scaling Python difficult

Slide 9

Slide 9 text

1.  PyData packages ! and tools

Slide 10

Slide 10 text

10 © Copyright 2015 Pivotal. All rights reserved. Python for Data community Ÿ  Pycon Ireland Data Science track! Ÿ  PyData conferences + videos –  London, Berlin, multiple US locations Ÿ  #PyData on Twitter

Slide 11

Slide 11 text

11 © Copyright 2015 Pivotal. All rights reserved. Packages - Data Manipulation Ÿ  Low level array operations Ÿ  Data tables and in-memory manipulation Ÿ  Parallel out-of-core array manipulation Ÿ  High level interface for databases and different computational backends NumPy Dask

Slide 12

Slide 12 text

12 © Copyright 2015 Pivotal. All rights reserved. Packages - Modelling Ÿ  FFTs, integration, other general algorithms Ÿ  Statistical distributions and tests Ÿ  Machine Learning pipelines Ÿ  Bayesian Probabilistic Programming SciPy PyMC3

Slide 13

Slide 13 text

13 © Copyright 2015 Pivotal. All rights reserved. Packages - Visualisation Ÿ  Widely used and powerful plotting package Ÿ  Opinionated but beautiful data visualisations Ÿ  Interactive plotting with server option Ÿ  Graphics API with translation between languages (e.g. Python -> D3) seaborn Bokeh

Slide 14

Slide 14 text

14 © Copyright 2015 Pivotal. All rights reserved. IPython Notebooks http://nbviewer.ipython.org/gist/fonnesbeck/2352771

Slide 15

Slide 15 text

15 © Copyright 2015 Pivotal. All rights reserved.

Slide 16

Slide 16 text

16 © Copyright 2015 Pivotal. All rights reserved. PREDICT THE DESTINATION

Slide 17

Slide 17 text

17 © Copyright 2015 Pivotal. All rights reserved. PREDICT THE RANGE

Slide 18

Slide 18 text

18 © Copyright 2015 Pivotal. All rights reserved. Connected Car http://tinyurl.com/pivotal-car https://github.com/pivotal/IoT-ConnectedCar

Slide 19

Slide 19 text

2.  In-Database Python! (and R, Java, C, etc)

Slide 20

Slide 20 text

20 © Copyright 2015 Pivotal. All rights reserved. Bring your code to the data Ÿ  Procedural Python – support in PostgreSQL + others Ÿ  Use the expressive power of Python inside the database Ÿ  Reduce/remove large data movements Ÿ  Couple with distributed databases for simple parallelisation

Slide 21

Slide 21 text

CREATE  FUNCTION        pymax  (a  integer,  b  integer)   RETURNS  integer   AS  $$      if  a  >  b:          return  a      return  b   $$  LANGUAGE  plpythonu;     SQL wrapper Language Normal Python

Slide 22

Slide 22 text

22 © Copyright 2015 Pivotal. All rights reserved. Data Parallelism Ÿ  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks. Ÿ  Examples: –  Measure the height of each student in a classroom (explicitly parallelizable by student) –  MapReduce –  map() function in Python

Slide 23

Slide 23 text

23 © Copyright 2015 Pivotal. All rights reserved. PostgreSQL PostgreSQL PostgreSQL PostgreSQL PostgreSQL

Slide 24

Slide 24 text

PostgreSQL

Slide 25

Slide 25 text

BENEFITS: Reuse existing Python code Access Python libraries Implicit parallelism

Slide 26

Slide 26 text

26 © Copyright 2015 Pivotal. All rights reserved. Natural Language Processing in-database Ÿ  Business Problem: 
 Want to understand what is being discussed in millions of documents and whether authors feel positive about us Ÿ  Topic Modelling:
 Characterise documents based on topics contained within Ÿ  Sentiment Analysis:
 Score documents based on ‘sentiment’ (positive or negative) Natural Language ToolKit (NLTK)

Slide 27

Slide 27 text

27 © Copyright 2015 Pivotal. All rights reserved. Topic and Sentiment Analysis Pipeline Documents Load into database Parallel Parsing of JSON using PL/Python Topic Modelling Sentiment Analysis D3.js http://vimeo.com/79558274 NLTK

Slide 28

Slide 28 text

3.  Python in the Cloud

Slide 29

Slide 29 text

What do data scientists need?

Slide 30

Slide 30 text

Cloud Applications Haiku Here is my source code Run it on the cloud for me I do not care how. -  Onsi Fakhouri @onsijoe

Slide 31

Slide 31 text

What is Cloud Foundry? http://cloudfoundry.org Open Source Multi-Cloud Platform Simple App Deployment, Scaling & Availability

Slide 32

Slide 32 text

$ cf push  

Slide 33

Slide 33 text

33 © Copyright 2015 Pivotal. All rights reserved. Simple Flask App Demo Ÿ  Simple one page “Hello World” web app Ÿ  Video: https://www.youtube.com/watch?v=QOfD6tnoAB8 Ÿ  Demonstrates: –  Installation of requirements –  Scaling properties Ÿ  Need to Provide: –  App files –  Dependencies listed in requirements.txt file –  Optional manifest.yml file with configuration for deployment

Slide 34

Slide 34 text

C F R O! U! T! E! R 2. Set up domain Cloud Controller Instance 1. Upload code 4. Copy app into containerised instances 3. Install Python & Dependencies 5. Start app and accept connections Send request to URL WHAT JUST HAPPENED? Source Code Instance $  cf  push   Browser 5. Load balance between instances

Slide 35

Slide 35 text

35 © Copyright 2015 Pivotal. All rights reserved. Python on Cloud Foundry Ÿ  First class language (with Go, Java, Ruby, Node.js, PHP) Ÿ  Automatic app type detection –  Looks for requirements.txt or setup.py Ÿ  Buildpack takes care of –  Detecting that a Python app is being pushed –  Installing Python interpreter –  Installing packages in requirements.txt using pip –  Starting web app as requested (e.g. python myapp.py)

Slide 36

Slide 36 text

36 © Copyright 2015 Pivotal. All rights reserved. Official Python Buildpack ü  Great for simple pip based requirements ü  Well tested and officially maintained ü  Covers both Python 2 and 3 ✗ Suffers from the Python Packaging Problem: -  Hard to build packages with C, C++ or Fortran extensions -  Complicated local configuration of libraries and paths needed -  Takes a long time to build main PyData packages from source

Slide 37

Slide 37 text

37 © Copyright 2015 Pivotal. All rights reserved. Using conda for package management Ÿ  http://conda.pydata.org Ÿ  Benefits: –  Uses precompiled binary packages –  No fiddling with Fortran or C compilers and library paths –  Known good combinations of main package versions –  Really simple environment management (better than virtualenv) –  Easy to run Python 2 and 3 side-by-side Go try it out if you haven’t already!

Slide 38

Slide 38 text

38 © Copyright 2015 Pivotal. All rights reserved. How to use the conda buildpack https://github.com/ihuston/python-conda-buildpack Ÿ  Specify as a custom buildpack when pushing app with manifest or -­‐b command line option. Ÿ  Export your current environment to a environment.yml file Ÿ  Or write requirements.txt (pip) and conda_requirements.txt Ÿ  Send me feedback & pull requests!

Slide 39

Slide 39 text

R E S T A P I Send data as JSON Data Ingest Model Create Model Redis Kicking off periodic retraining Save training data Save model object Send JSON data without label Receive prediction from trained model instance Deployed at:! http://dsoncf.cfapps.io! Code: https://github.com/pivotalsoftware/ds-cfpylearning PREDICTION API ARCHITECTURE $  cf  create-­‐service   rediscloud   PLAN_NAME   INSTANCE_NAME  

Slide 40

Slide 40 text

TRANSPORT DISRUPTION! PREDICTIONS http://ds-demo-transport.cfapps.io

Slide 41

Slide 41 text

41 © Copyright 2015 Pivotal. All rights reserved. Show off your data science related Cloud Foundry apps: Twitter: @dsoncf http://dsoncf.com

Slide 42

Slide 42 text

42 © Copyright 2015 Pivotal. All rights reserved. Resources Ÿ  PyData.org Ÿ  PL/Python – see PostgreSQL docs Ÿ  CloudFoundry.org We’re hiring in Dublin & London: pivotal.io/careers Kevin Olsen [email protected]

Slide 43

Slide 43 text

43 © Copyright 2015 Pivotal. All rights reserved. @ianhuston

Slide 44

Slide 44 text

44 © Copyright 2015 Pivotal. All rights reserved. Appendix

Slide 45

Slide 45 text

C F R O! U! T! E! R 2. Set up domain Cloud Controller Instance 1. Upload code 4. Copy app into containerised instances 3. Install Python & Dependencies 5. Start app and accept connections Send request to URL WHAT JUST HAPPENED? Source Code Instance $  cf  push   Browser 5. Load balance between instances

Slide 46

Slide 46 text

46 © Copyright 2015 Pivotal. All rights reserved. What just happened? 1.  Application code is uploaded to CF 2.  Domain URL is set up ready for routing 3.  Cloud controller builds application in container: –  Python interpreter selected –  Dependencies installed with pip 4.  Container is replicated to provide instances 5.  App starts and Router load balances requests

Slide 47

Slide 47 text

47 © Copyright 2015 Pivotal. All rights reserved. Containers vs Buildpacks runtime layer OS image application layer Container (e.g. Docker) system brings fixed host OS Kernel * Devs may bring a custom buildpack runtime layer* OS image application layer Buildpack App container System Provides Dev Provides system brings fixed host OS Kernel