Python for Data Science

Python For Data Science

WHOAMI Sr. Database Manager at Mito Organizer of Budapest Database
Meetup Support member of budapest.py Meetup Member of Python Software Foundation and NumFOCUS. Bence Faludi | @bfaludi | [email protected]

WHOAMI Sr. Database Manager at Mito Organizer of Budapest Database
Meetup Support member of budapest.py Meetup Member of Python Software Foundation and NumFOCUS. Bence Faludi | @bfaludi | [email protected] See the possibilities in the environment 1 Learn about new libraries 3 Embrace the fact that Python is easy 2

Theme Python Data Harvesting Data Cleansing Analyzing Data Visualisation Machine
Learning Data Reporting, Publish ETL

Theme Data Harvesting Data Cleansing Analyzing Data Crawling Web Scraping
API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX

Why Python? It’s awesome and popular! Free and Open Source
language. Readable syntax. Easy to learn and has an active community. Large amount of libraries. High level language.

Download Anaconda to Start Free enterprise-ready cross platform Python distribution
for large- scale data processing, predictive analytics, and scientiﬁc computing.

Download Anaconda to Start Free enterprise-ready cross platform Python distribution
for large- scale data processing, predictive analytics, and scientiﬁc computing. Python 2.7 Python 3.4 Extended support of this end-of- life release. It works perfectly but unicode handling is not a dream. Slightly worse library support. Most of the popular packages were ported to Python 3. Use it if you have the opportunity

Syntax print( “Hello world” )

def fn( name ): print( 'Hello {}'.format(name) ) fn('John') Syntax

names = [] if name == ‘John’: names.append({ ‘name’: name,
‘length’: len(name) }) print( names ) Syntax

Syntax for x in range(1, 11): for y in range(1,
11): print( '%d * %d = %d' % (x, y, x*y) )

IPython Notebook The IPython Notebook is a web-based interactive environment
where you can combine code execution, text, mathematics, plots and rich media into a single document.

Theme Python Data Cleansing Analyzing Data Crawling Web Scraping API
Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting SymPy NetworkX mply PyBrain Vispy

API Speciﬁcation of remote calls exposed to the API consumers.
with general libraries with application-speciﬁc libraries import requests print( requests.get( 'http://www.omdbapi.com/?s=%s' % ('Iron Man') ).json() ) facebook-sdk python-twitter python-linkedin requests urllib

Web Scraping Extract information from structured documents. lxml xmlsquash beautifulsoup4
with XML parser libraries import requests, bs4 p = bs4.BeautifulSoup( requests.get('http://index.hu').text ) print( p.select('div.cim h1.cikkcim a’)[0] ) with Scrapy framework

Crawling Crawlers used to navigate through web documents. import requests,
bs4 data = {} def crawler( url, level = 0, maximum_depth = None ): if maximum_depth and level >= maximum_depth: return data[ url ] = requests.get( url ).text p = bs4.BeautifulSoup( data[ url ] ) for a in p.select('a'): if not a.attrs.get('href') and not a.attrs['href'].startswith('http') and a.attrs['href'] in data.keys(): continue crawler( a.attrs['href'], level+1, maximum_depth ) crawler( 'http://index.hu', maximum_depth = 3 )

Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion
Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing SymPy NetworkX mply PyBrain Vispy

Poor Quality Data Missing ﬁelds or wrong data. Incorrect characters
or character encoding. Unknown date formats, e.g: 4/7/2014 Non existing data validation. Not normalized input.

Data Cleansing Detect anomalies from harvested data and standardise it
to get structured data for analysis. You can use validation libraries for individual data. phonenumbers validate_email Other libraries for standardisation. dateutil …

Theme Python Crawling Web Scraping API Plotly Matplotlib NumPy SciPy
Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing Analyzing Data SymPy NetworkX mply PyBrain Vispy Visualisa- tion

NumPy Standard package for scientiﬁc and numerical computing. Offers N-dimensional
array object. Linear algebra operations with speed in mind. Vectorisation, broadcasting, aggregation functions. Easy interface to C/C++/Fortan code. Simple to understand data-structure.

NumPy Example from numpy import * a = arange(15).reshape(3, 5)
print a # array([[ 0, 1, 2, 3, 4], # [ 5, 6, 7, 8, 9], # [10, 11, 12, 13, 14]]) print a.shape # (3, 5) print a.ndim # 2 print a.dtype.name # 'int64' print a.itemsize # 8 print a.size # 15 print type(a) # numpy.ndarray from numpy import * b = array([6, 7, 8]) print b # array([6, 7, 8]) print type(b) # numpy.ndarray

SciPy Collections of high-level mathematical operations. Depends on NumPy’s data
structures. Efﬁcient numerical functions for regression, interpolation, integration, optimisation. Vectorisation, broadcasting, aggregation functions.

Matplotlib Display and plot your data quickly.

SciPy + Matplotlib Example import numpy as np from scipy
import special, optimize import matplotlib.pyplot as plt # Compute maximum f = lambda x: -special.jv(3, x) sol = optimize.minimize(f, 1.0) # Plot x = np.linspace(0, 10, 5000) plt.plot(x, special.jv(3, x), '-', sol.x, - sol.fun, 'o') # Produce output plt.savefig('output.png', dpi=96)

IPython Rich architecture for interactive computing. Interactive shells. Browser-based notebook
with support of code, text, mathematical expressions, inline plots, etc. Flexible, embeddable interpereters. Easy to use. Architecture for parallel computing.

IPython Notebook

NumPy & SciPy & Matplotlib & IPython Provides a Matlab-“ish”
environment.

Pandas Data Analysis Library which helps you to carry out
your entire data analysis workﬂow in Python. Fast and efﬁcient DataFrame object for data manipulation. Reading and writing data in multiple formats. Reshaping, slicing, indexing, subsetting of large data sets. Merging and joining data sets. Optimised for performance (critical parts in Cython/C).

Pandas Example import pandas as pd import numpy as np
import matplotlib.pyplot as plt dates = pd.date_range( ‘20130101', periods = 6 ) print dates # <class 'pandas.tseries.index.DatetimeIndex'> # [2013-01-01, ..., 2013-01-06] # Length: 6, Freq: D, Timezone: None df = pd.DataFrame( np.random.randn(6,4), index = dates, columns = list(‘ABCD') ) print df # A B C D # 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 # 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 # 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 # 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

Pandas Example print df.head() # A B C D #
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 # 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 # 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 print df.tail(3) # A B C D # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 # 2013-01-06 -0.673690 0.113648 -1.478427 0.524988

Pandas Example print df.describe() # A B C D #
count 6.000000 6.000000 6.000000 6.000000 # mean 0.073711 -0.431125 -0.687758 -0.233103 # std 0.843157 0.922818 0.779887 0.973118 # min -0.861849 -2.104569 -1.509059 -1.135632 # 25% -0.611510 -0.600794 -1.368714 -1.076610 # 50% 0.022070 -0.228039 -0.767252 -0.386188 # 75% 0.658444 0.041933 -0.034326 0.461706 # max 1.212112 0.567020 0.276232 1.071804 print df.loc['20130102':'20130104',['A','B']] # A B # 2013-01-02 1.212112 -0.173215 # 2013-01-03 -0.861849 -2.104569 # 2013-01-04 0.721555 -0.706771

Pandas Example df = pd.DataFrame({ 'A' : ['foo', 'bar', 'foo',
'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8) }) print df # A B C D # 0 foo one -1.202872 -0.055224 # 1 bar one -1.814470 2.395985 # 2 foo two 1.018601 1.552825 # 3 bar three -0.595447 0.166599 # 4 foo two 1.395433 0.047609 # 5 bar two -0.392670 -0.136473 # 6 foo one 0.007207 -0.561757 # 7 foo three 1.928123 -1.623033 print df.groupby('A').sum() # C D # A # bar -2.802588 2.42611 # foo 3.146492 -0.63958

Bokeh Bokeh is a Python interactive visualization library that targets
modern web browsers for presentation. import numpy as np from bokeh.plotting import * N = 1000 x = np.linspace(0, 10, N) y = np.linspace(0, 10, N) xx, yy = np.meshgrid(x, y) d = np.sin(xx)*np.cos(yy) output_file("image.html", title="image.py example") image( image=[d], x=[0], y=[0], dw=[10], dh=[10], palette=["Spectral-11"], x_range=[0, 10], y_range=[0, 10], tools="pan,wheel_zoom,box_zoom,reset,previewsave", name="image_example" ) show() # open a browser

NLTK NLTK is a leading platform for building Python programs
to work with human language data. Over 50 lexical resources included. Tokenizing: breaking text into segments. Stemming: splitting word before the stem. Classiﬁcation: organising text based on tags and rules. Tagging: adding tense, related terms, properties, etc.

Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing SymPy NetworkX mply PyBrain Vispy Machine Learning

Scikit-Learn Open source library for Machine Learning with simple ﬁt,
predict and transform API. Built on NumPy, SciPy and matplotlib. Classiﬁcation

predict and transform API. Built on NumPy, SciPy and matplotlib. Classiﬁcation Regression

predict and transform API. Built on NumPy, SciPy and matplotlib. Classiﬁcation Regression Clustering

predict and transform API. Built on NumPy, SciPy and matplotlib. Classiﬁcation Regression Clustering Dimensionality reduction

Scikit-Learn

Scikit-Image It is a collection of algorithms for image processing.
Canny edge detector

Scikit-Image It is a collection of algorithms for image processing.
Template Matching

Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing Data Reporting, Publish SymPy NetworkX mply PyBrain Vispy

CKAN CKAN is a powerful data management system that makes
data accessible – by providing tools for streamline publishing, sharing, ﬁnding and using data. Web interface with API. Data visualisation and analytics. Workﬂow support. Integrated data storage.

Cubes Reporting applications and aggregate browsing of multi- dimensionally modelled
data. Analytical modelling and OLAP. Slicing and dicing, aggregation browser. OLAP server. SQL backend.

Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing ETL SymPy NetworkX mply PyBrain Vispy

ETL using Bubbles Framework for data processing and data quality
measurement. p = Pipeline() p.source_object("csv_source", "data.csv") p.distinct("category") p.pretty_print() p.run() Abstraction from the backend storage. Focus on the pipeline. Easy SCDs. Extensible.

ETL using mETL Versatile loader with easy conﬁguration. source: type:
CSV resource: input.csv headerRow: 0 skipRow: 1 fields: - name: Name - name: Age type: Integer target: type: JSON resource: output.json No GUI, conﬁguration via YAML format. Checks differences between migrations. Quick transformations and manipulations. Easy to extend. 9 source types, 11 target types, 35+ built-in transformations.

ETL using Luigi Batch data processing with data flow support.
Dependency definitions Hadoop integration Data flow visualisation Command line integration

Summary Python Analyzing Data Crawling Web Scraping API Visualisa- tion
Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing ETL SymPy NetworkX mply PyBrain Vispy

Problems We need more data cleansing APIs. Python should be
more quickly out of the box. Collaboration is not so easy. Visualisation is still hard. Heterogeneous tools would be cool. Python 3+

Summary Python Data Harvesting Data Cleansing Analyzing Data Visualisation Machine
Learning Data Reporting, Publish ETL

Thanks for Your Attention! @bfaludi

Python for Data Science

Python for Data Science

More Decks by Bence Faludi

Other Decks in Technology

Featured

Transcript