Slide 1

Slide 1 text

Python For Data Science

Slide 2

Slide 2 text

WHOAMI Sr. Database Manager at Mito Organizer of Budapest Database Meetup Support member of budapest.py Meetup Member of Python Software Foundation and NumFOCUS. Bence Faludi | @bfaludi | [email protected]

Slide 3

Slide 3 text

WHOAMI Sr. Database Manager at Mito Organizer of Budapest Database Meetup Support member of budapest.py Meetup Member of Python Software Foundation and NumFOCUS. Bence Faludi | @bfaludi | [email protected] See the possibilities in the environment 1 Learn about new libraries 3 Embrace the fact that Python is easy 2

Slide 4

Slide 4 text

Theme Python Data Harvesting Data Cleansing Analyzing Data Visualisation Machine Learning Data Reporting, Publish ETL

Slide 5

Slide 5 text

Theme Data Harvesting Data Cleansing Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX

Slide 6

Slide 6 text

Theme Data Harvesting Data Cleansing Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX

Slide 7

Slide 7 text

Why Python? It’s awesome and popular! Free and Open Source language. Readable syntax. Easy to learn and has an active community. Large amount of libraries. High level language.

Slide 8

Slide 8 text

Download Anaconda to Start Free enterprise-ready cross platform Python distribution for large- scale data processing, predictive analytics, and scientific computing.

Slide 9

Slide 9 text

Download Anaconda to Start Free enterprise-ready cross platform Python distribution for large- scale data processing, predictive analytics, and scientific computing. Python 2.7 Python 3.4 Extended support of this end-of- life release. It works perfectly but unicode handling is not a dream. Slightly worse library support. Most of the popular packages were ported to Python 3. Use it if you have the opportunity

Slide 10

Slide 10 text

Syntax print( “Hello world” )

Slide 11

Slide 11 text

def fn( name ): print( 'Hello {}'.format(name) ) fn('John') Syntax

Slide 12

Slide 12 text

names = [] if name == ‘John’: names.append({ ‘name’: name, ‘length’: len(name) }) print( names ) Syntax

Slide 13

Slide 13 text

Syntax for x in range(1, 11): for y in range(1, 11): print( '%d * %d = %d' % (x, y, x*y) )

Slide 14

Slide 14 text

IPython Notebook The IPython Notebook is a web-based interactive environment where you can combine code execution, text, mathematics, plots and rich media into a single document.

Slide 15

Slide 15 text

IPython Notebook The IPython Notebook is a web-based interactive environment where you can combine code execution, text, mathematics, plots and rich media into a single document.

Slide 16

Slide 16 text

IPython Notebook The IPython Notebook is a web-based interactive environment where you can combine code execution, text, mathematics, plots and rich media into a single document.

Slide 17

Slide 17 text

Theme Python Data Cleansing Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting SymPy NetworkX mply PyBrain Vispy

Slide 18

Slide 18 text

API Specification of remote calls exposed to the API consumers. with general libraries with application-specific libraries import requests print( requests.get( 'http://www.omdbapi.com/?s=%s' % ('Iron Man') ).json() ) facebook-sdk python-twitter python-linkedin requests urllib

Slide 19

Slide 19 text

Web Scraping Extract information from structured documents. lxml xmlsquash beautifulsoup4 with XML parser libraries import requests, bs4 p = bs4.BeautifulSoup( requests.get('http://index.hu').text ) print( p.select('div.cim h1.cikkcim a’)[0] ) with Scrapy framework

Slide 20

Slide 20 text

Crawling Crawlers used to navigate through web documents. import requests, bs4 data = {} def crawler( url, level = 0, maximum_depth = None ): if maximum_depth and level >= maximum_depth: return data[ url ] = requests.get( url ).text p = bs4.BeautifulSoup( data[ url ] ) for a in p.select('a'): if not a.attrs.get('href') and not a.attrs['href'].startswith('http') and a.attrs['href'] in data.keys(): continue crawler( a.attrs['href'], level+1, maximum_depth ) crawler( 'http://index.hu', maximum_depth = 3 )

Slide 21

Slide 21 text

Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing SymPy NetworkX mply PyBrain Vispy

Slide 22

Slide 22 text

Poor Quality Data Missing fields or wrong data. Incorrect characters or character encoding. Unknown date formats, e.g: 4/7/2014 Non existing data validation. Not normalized input.

Slide 23

Slide 23 text

Data Cleansing Detect anomalies from harvested data and standardise it to get structured data for analysis. You can use validation libraries for individual data. phonenumbers validate_email Other libraries for standardisation. dateutil …

Slide 24

Slide 24 text

Theme Python Crawling Web Scraping API Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing Analyzing Data SymPy NetworkX mply PyBrain Vispy Visualisa- tion

Slide 25

Slide 25 text

NumPy Standard package for scientific and numerical computing. Offers N-dimensional array object. Linear algebra operations with speed in mind. Vectorisation, broadcasting, aggregation functions. Easy interface to C/C++/Fortan code. Simple to understand data-structure.

Slide 26

Slide 26 text

NumPy Example from numpy import * a = arange(15).reshape(3, 5) print a # array([[ 0, 1, 2, 3, 4], # [ 5, 6, 7, 8, 9], # [10, 11, 12, 13, 14]]) print a.shape # (3, 5) print a.ndim # 2 print a.dtype.name # 'int64' print a.itemsize # 8 print a.size # 15 print type(a) # numpy.ndarray from numpy import * b = array([6, 7, 8]) print b # array([6, 7, 8]) print type(b) # numpy.ndarray

Slide 27

Slide 27 text

SciPy Collections of high-level mathematical operations. Depends on NumPy’s data structures. Efficient numerical functions for regression, interpolation, integration, optimisation. Vectorisation, broadcasting, aggregation functions.

Slide 28

Slide 28 text

Matplotlib Display and plot your data quickly.

Slide 29

Slide 29 text

SciPy + Matplotlib Example import numpy as np from scipy import special, optimize import matplotlib.pyplot as plt # Compute maximum f = lambda x: -special.jv(3, x) sol = optimize.minimize(f, 1.0) # Plot x = np.linspace(0, 10, 5000) plt.plot(x, special.jv(3, x), '-', sol.x, - sol.fun, 'o') # Produce output plt.savefig('output.png', dpi=96)

Slide 30

Slide 30 text

IPython Rich architecture for interactive computing. Interactive shells. Browser-based notebook with support of code, text, mathematical expressions, inline plots, etc. Flexible, embeddable interpereters. Easy to use. Architecture for parallel computing.

Slide 31

Slide 31 text

IPython Notebook

Slide 32

Slide 32 text

NumPy & SciPy & Matplotlib & IPython Provides a Matlab-“ish” environment.

Slide 33

Slide 33 text

Pandas Data Analysis Library which helps you to carry out your entire data analysis workflow in Python. Fast and efficient DataFrame object for data manipulation. Reading and writing data in multiple formats. Reshaping, slicing, indexing, subsetting of large data sets. Merging and joining data sets. Optimised for performance (critical parts in Cython/C).

Slide 34

Slide 34 text

Pandas Example import pandas as pd import numpy as np import matplotlib.pyplot as plt dates = pd.date_range( ‘20130101', periods = 6 ) print dates # # [2013-01-01, ..., 2013-01-06] # Length: 6, Freq: D, Timezone: None df = pd.DataFrame( np.random.randn(6,4), index = dates, columns = list(‘ABCD') ) print df # A B C D # 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 # 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 # 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 # 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

Slide 35

Slide 35 text

Pandas Example print df.head() # A B C D # 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 # 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 # 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 print df.tail(3) # A B C D # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 # 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

Slide 36

Slide 36 text

Pandas Example print df.describe() # A B C D # count 6.000000 6.000000 6.000000 6.000000 # mean 0.073711 -0.431125 -0.687758 -0.233103 # std 0.843157 0.922818 0.779887 0.973118 # min -0.861849 -2.104569 -1.509059 -1.135632 # 25% -0.611510 -0.600794 -1.368714 -1.076610 # 50% 0.022070 -0.228039 -0.767252 -0.386188 # 75% 0.658444 0.041933 -0.034326 0.461706 # max 1.212112 0.567020 0.276232 1.071804 print df.loc['20130102':'20130104',['A','B']] # A B # 2013-01-02 1.212112 -0.173215 # 2013-01-03 -0.861849 -2.104569 # 2013-01-04 0.721555 -0.706771

Slide 37

Slide 37 text

Pandas Example df = pd.DataFrame({ 'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8) }) print df # A B C D # 0 foo one -1.202872 -0.055224 # 1 bar one -1.814470 2.395985 # 2 foo two 1.018601 1.552825 # 3 bar three -0.595447 0.166599 # 4 foo two 1.395433 0.047609 # 5 bar two -0.392670 -0.136473 # 6 foo one 0.007207 -0.561757 # 7 foo three 1.928123 -1.623033 print df.groupby('A').sum() # C D # A # bar -2.802588 2.42611 # foo 3.146492 -0.63958

Slide 38

Slide 38 text

Bokeh Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. import numpy as np from bokeh.plotting import * N = 1000 x = np.linspace(0, 10, N) y = np.linspace(0, 10, N) xx, yy = np.meshgrid(x, y) d = np.sin(xx)*np.cos(yy) output_file("image.html", title="image.py example") image( image=[d], x=[0], y=[0], dw=[10], dh=[10], palette=["Spectral-11"], x_range=[0, 10], y_range=[0, 10], tools="pan,wheel_zoom,box_zoom,reset,previewsave", name="image_example" ) show() # open a browser

Slide 39

Slide 39 text

Bokeh Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. import numpy as np from bokeh.plotting import * N = 1000 x = np.linspace(0, 10, N) y = np.linspace(0, 10, N) xx, yy = np.meshgrid(x, y) d = np.sin(xx)*np.cos(yy) output_file("image.html", title="image.py example") image( image=[d], x=[0], y=[0], dw=[10], dh=[10], palette=["Spectral-11"], x_range=[0, 10], y_range=[0, 10], tools="pan,wheel_zoom,box_zoom,reset,previewsave", name="image_example" ) show() # open a browser

Slide 40

Slide 40 text

NLTK NLTK is a leading platform for building Python programs to work with human language data. Over 50 lexical resources included. Tokenizing: breaking text into segments. Stemming: splitting word before the stem. Classification: organising text based on tags and rules. Tagging: adding tense, related terms, properties, etc.

Slide 41

Slide 41 text

Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing SymPy NetworkX mply PyBrain Vispy Machine Learning

Slide 42

Slide 42 text

Scikit-Learn Open source library for Machine Learning with simple fit, predict and transform API. Built on NumPy, SciPy and matplotlib. Classification

Slide 43

Slide 43 text

Scikit-Learn Open source library for Machine Learning with simple fit, predict and transform API. Built on NumPy, SciPy and matplotlib. Classification Regression

Slide 44

Slide 44 text

Scikit-Learn Open source library for Machine Learning with simple fit, predict and transform API. Built on NumPy, SciPy and matplotlib. Classification Regression Clustering

Slide 45

Slide 45 text

Scikit-Learn Open source library for Machine Learning with simple fit, predict and transform API. Built on NumPy, SciPy and matplotlib. Classification Regression Clustering Dimensionality reduction

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Scikit-Learn

Slide 48

Slide 48 text

Scikit-Image It is a collection of algorithms for image processing. Canny edge detector

Slide 49

Slide 49 text

Scikit-Image It is a collection of algorithms for image processing. Template Matching

Slide 50

Slide 50 text

Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing Data Reporting, Publish SymPy NetworkX mply PyBrain Vispy

Slide 51

Slide 51 text

CKAN CKAN is a powerful data management system that makes data accessible – by providing tools for streamline publishing, sharing, finding and using data. Web interface with API. Data visualisation and analytics. Workflow support. Integrated data storage.

Slide 52

Slide 52 text

Cubes Reporting applications and aggregate browsing of multi- dimensionally modelled data. Analytical modelling and OLAP. Slicing and dicing, aggregation browser. OLAP server. SQL backend.

Slide 53

Slide 53 text

Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing ETL SymPy NetworkX mply PyBrain Vispy

Slide 54

Slide 54 text

ETL using Bubbles Framework for data processing and data quality measurement. p = Pipeline() p.source_object("csv_source", "data.csv") p.distinct("category") p.pretty_print() p.run() Abstraction from the backend storage. Focus on the pipeline. Easy SCDs. Extensible.

Slide 55

Slide 55 text

ETL using mETL Versatile loader with easy configuration. source: type: CSV resource: input.csv headerRow: 0 skipRow: 1 fields: - name: Name - name: Age type: Integer target: type: JSON resource: output.json No GUI, configuration via YAML format. Checks differences between migrations. Quick transformations and manipulations. Easy to extend. 9 source types, 11 target types, 35+ built-in transformations.

Slide 56

Slide 56 text

ETL using Luigi Batch data processing with data flow support. Dependency definitions Hadoop integration Data flow visualisation Command line integration

Slide 57

Slide 57 text

Summary Python Analyzing Data Crawling Web Scraping API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing ETL SymPy NetworkX mply PyBrain Vispy

Slide 58

Slide 58 text

Problems We need more data cleansing APIs. Python should be more quickly out of the box. Collaboration is not so easy. Visualisation is still hard. Heterogeneous tools would be cool. Python 3+

Slide 59

Slide 59 text

Summary Python Data Harvesting Data Cleansing Analyzing Data Visualisation Machine Learning Data Reporting, Publish ETL

Slide 60

Slide 60 text

Thanks for Your Attention! @bfaludi