Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Python for Data Science

Bence Faludi
November 26, 2014

Python for Data Science

I will show you the most important packages for Data Science in Python. The presentation was delivered at Budapest BI Fórum, 2014. http://2014.budapestbiforum.hu

Bence Faludi

November 26, 2014
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. WHOAMI Sr. Database Manager at Mito Organizer of Budapest Database

    Meetup Support member of budapest.py Meetup Member of Python Software Foundation and NumFOCUS. Bence Faludi | @bfaludi | [email protected]
  2. WHOAMI Sr. Database Manager at Mito Organizer of Budapest Database

    Meetup Support member of budapest.py Meetup Member of Python Software Foundation and NumFOCUS. Bence Faludi | @bfaludi | [email protected] See the possibilities in the environment 1 Learn about new libraries 3 Embrace the fact that Python is easy 2
  3. Theme Data Harvesting Data Cleansing Analyzing Data Crawling Web Scraping

    API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX
  4. Theme Data Harvesting Data Cleansing Analyzing Data Crawling Web Scraping

    API Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX
  5. Why Python? It’s awesome and popular! Free and Open Source

    language. Readable syntax. Easy to learn and has an active community. Large amount of libraries. High level language.
  6. Download Anaconda to Start Free enterprise-ready cross platform Python distribution

    for large- scale data processing, predictive analytics, and scientific computing.
  7. Download Anaconda to Start Free enterprise-ready cross platform Python distribution

    for large- scale data processing, predictive analytics, and scientific computing. Python 2.7 Python 3.4 Extended support of this end-of- life release. It works perfectly but unicode handling is not a dream. Slightly worse library support. Most of the popular packages were ported to Python 3. Use it if you have the opportunity
  8. names = [] if name == ‘John’: names.append({ ‘name’: name,

    ‘length’: len(name) }) print( names ) Syntax
  9. Syntax for x in range(1, 11): for y in range(1,

    11): print( '%d * %d = %d' % (x, y, x*y) )
  10. IPython Notebook The IPython Notebook is a web-based interactive environment

    where you can combine code execution, text, mathematics, plots and rich media into a single document.
  11. IPython Notebook The IPython Notebook is a web-based interactive environment

    where you can combine code execution, text, mathematics, plots and rich media into a single document.
  12. IPython Notebook The IPython Notebook is a web-based interactive environment

    where you can combine code execution, text, mathematics, plots and rich media into a single document.
  13. Theme Python Data Cleansing Analyzing Data Crawling Web Scraping API

    Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting SymPy NetworkX mply PyBrain Vispy
  14. API Specification of remote calls exposed to the API consumers.

    with general libraries with application-specific libraries import requests print( requests.get( 'http://www.omdbapi.com/?s=%s' % ('Iron Man') ).json() ) facebook-sdk python-twitter python-linkedin requests urllib
  15. Web Scraping Extract information from structured documents. lxml xmlsquash beautifulsoup4

    with XML parser libraries import requests, bs4 p = bs4.BeautifulSoup( requests.get('http://index.hu').text ) print( p.select('div.cim h1.cikkcim a’)[0] ) with Scrapy framework
  16. Crawling Crawlers used to navigate through web documents. import requests,

    bs4 data = {} def crawler( url, level = 0, maximum_depth = None ): if maximum_depth and level >= maximum_depth: return data[ url ] = requests.get( url ).text p = bs4.BeautifulSoup( data[ url ] ) for a in p.select('a'): if not a.attrs.get('href') and not a.attrs['href'].startswith('http') and a.attrs['href'] in data.keys(): continue crawler( a.attrs['href'], level+1, maximum_depth ) crawler( 'http://index.hu', maximum_depth = 3 )
  17. Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion

    Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing SymPy NetworkX mply PyBrain Vispy
  18. Poor Quality Data Missing fields or wrong data. Incorrect characters

    or character encoding. Unknown date formats, e.g: 4/7/2014 Non existing data validation. Not normalized input.
  19. Data Cleansing Detect anomalies from harvested data and standardise it

    to get structured data for analysis. You can use validation libraries for individual data. phonenumbers validate_email Other libraries for standardisation. dateutil …
  20. Theme Python Crawling Web Scraping API Plotly Matplotlib NumPy SciPy

    Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing Analyzing Data SymPy NetworkX mply PyBrain Vispy Visualisa- tion
  21. NumPy Standard package for scientific and numerical computing. Offers N-dimensional

    array object. Linear algebra operations with speed in mind. Vectorisation, broadcasting, aggregation functions. Easy interface to C/C++/Fortan code. Simple to understand data-structure.
  22. NumPy Example from numpy import * a = arange(15).reshape(3, 5)

    print a # array([[ 0, 1, 2, 3, 4], # [ 5, 6, 7, 8, 9], # [10, 11, 12, 13, 14]]) print a.shape # (3, 5) print a.ndim # 2 print a.dtype.name # 'int64' print a.itemsize # 8 print a.size # 15 print type(a) # numpy.ndarray from numpy import * b = array([6, 7, 8]) print b # array([6, 7, 8]) print type(b) # numpy.ndarray
  23. SciPy Collections of high-level mathematical operations. Depends on NumPy’s data

    structures. Efficient numerical functions for regression, interpolation, integration, optimisation. Vectorisation, broadcasting, aggregation functions.
  24. SciPy + Matplotlib Example import numpy as np from scipy

    import special, optimize import matplotlib.pyplot as plt # Compute maximum f = lambda x: -special.jv(3, x) sol = optimize.minimize(f, 1.0) # Plot x = np.linspace(0, 10, 5000) plt.plot(x, special.jv(3, x), '-', sol.x, - sol.fun, 'o') # Produce output plt.savefig('output.png', dpi=96)
  25. IPython Rich architecture for interactive computing. Interactive shells. Browser-based notebook

    with support of code, text, mathematical expressions, inline plots, etc. Flexible, embeddable interpereters. Easy to use. Architecture for parallel computing.
  26. Pandas Data Analysis Library which helps you to carry out

    your entire data analysis workflow in Python. Fast and efficient DataFrame object for data manipulation. Reading and writing data in multiple formats. Reshaping, slicing, indexing, subsetting of large data sets. Merging and joining data sets. Optimised for performance (critical parts in Cython/C).
  27. Pandas Example import pandas as pd import numpy as np

    import matplotlib.pyplot as plt dates = pd.date_range( ‘20130101', periods = 6 ) print dates # <class 'pandas.tseries.index.DatetimeIndex'> # [2013-01-01, ..., 2013-01-06] # Length: 6, Freq: D, Timezone: None df = pd.DataFrame( np.random.randn(6,4), index = dates, columns = list(‘ABCD') ) print df # A B C D # 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 # 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 # 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 # 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
  28. Pandas Example print df.head() # A B C D #

    2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 # 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 # 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 print df.tail(3) # A B C D # 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 # 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 # 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
  29. Pandas Example print df.describe() # A B C D #

    count 6.000000 6.000000 6.000000 6.000000 # mean 0.073711 -0.431125 -0.687758 -0.233103 # std 0.843157 0.922818 0.779887 0.973118 # min -0.861849 -2.104569 -1.509059 -1.135632 # 25% -0.611510 -0.600794 -1.368714 -1.076610 # 50% 0.022070 -0.228039 -0.767252 -0.386188 # 75% 0.658444 0.041933 -0.034326 0.461706 # max 1.212112 0.567020 0.276232 1.071804 print df.loc['20130102':'20130104',['A','B']] # A B # 2013-01-02 1.212112 -0.173215 # 2013-01-03 -0.861849 -2.104569 # 2013-01-04 0.721555 -0.706771
  30. Pandas Example df = pd.DataFrame({ 'A' : ['foo', 'bar', 'foo',

    'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8) }) print df # A B C D # 0 foo one -1.202872 -0.055224 # 1 bar one -1.814470 2.395985 # 2 foo two 1.018601 1.552825 # 3 bar three -0.595447 0.166599 # 4 foo two 1.395433 0.047609 # 5 bar two -0.392670 -0.136473 # 6 foo one 0.007207 -0.561757 # 7 foo three 1.928123 -1.623033 print df.groupby('A').sum() # C D # A # bar -2.802588 2.42611 # foo 3.146492 -0.63958
  31. Bokeh Bokeh is a Python interactive visualization library that targets

    modern web browsers for presentation. import numpy as np from bokeh.plotting import * N = 1000 x = np.linspace(0, 10, N) y = np.linspace(0, 10, N) xx, yy = np.meshgrid(x, y) d = np.sin(xx)*np.cos(yy) output_file("image.html", title="image.py example") image( image=[d], x=[0], y=[0], dw=[10], dh=[10], palette=["Spectral-11"], x_range=[0, 10], y_range=[0, 10], tools="pan,wheel_zoom,box_zoom,reset,previewsave", name="image_example" ) show() # open a browser
  32. Bokeh Bokeh is a Python interactive visualization library that targets

    modern web browsers for presentation. import numpy as np from bokeh.plotting import * N = 1000 x = np.linspace(0, 10, N) y = np.linspace(0, 10, N) xx, yy = np.meshgrid(x, y) d = np.sin(xx)*np.cos(yy) output_file("image.html", title="image.py example") image( image=[d], x=[0], y=[0], dw=[10], dh=[10], palette=["Spectral-11"], x_range=[0, 10], y_range=[0, 10], tools="pan,wheel_zoom,box_zoom,reset,previewsave", name="image_example" ) show() # open a browser
  33. NLTK NLTK is a leading platform for building Python programs

    to work with human language data. Over 50 lexical resources included. Tokenizing: breaking text into segments. Stemming: splitting word before the stem. Classification: organising text based on tags and rules. Tagging: adding tense, related terms, properties, etc.
  34. Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion

    Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing SymPy NetworkX mply PyBrain Vispy Machine Learning
  35. Scikit-Learn Open source library for Machine Learning with simple fit,

    predict and transform API. Built on NumPy, SciPy and matplotlib. Classification
  36. Scikit-Learn Open source library for Machine Learning with simple fit,

    predict and transform API. Built on NumPy, SciPy and matplotlib. Classification Regression
  37. Scikit-Learn Open source library for Machine Learning with simple fit,

    predict and transform API. Built on NumPy, SciPy and matplotlib. Classification Regression Clustering
  38. Scikit-Learn Open source library for Machine Learning with simple fit,

    predict and transform API. Built on NumPy, SciPy and matplotlib. Classification Regression Clustering Dimensionality reduction
  39. Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion

    Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image CKAN Cubes ETL mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing Data Reporting, Publish SymPy NetworkX mply PyBrain Vispy
  40. CKAN CKAN is a powerful data management system that makes

    data accessible – by providing tools for streamline publishing, sharing, finding and using data. Web interface with API. Data visualisation and analytics. Workflow support. Integrated data storage.
  41. Cubes Reporting applications and aggregate browsing of multi- dimensionally modelled

    data. Analytical modelling and OLAP. Slicing and dicing, aggregation browser. OLAP server. SQL backend.
  42. Theme Python Analyzing Data Crawling Web Scraping API Visualisa- tion

    Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing ETL SymPy NetworkX mply PyBrain Vispy
  43. ETL using Bubbles Framework for data processing and data quality

    measurement. p = Pipeline() p.source_object("csv_source", "data.csv") p.distinct("category") p.pretty_print() p.run() Abstraction from the backend storage. Focus on the pipeline. Easy SCDs. Extensible.
  44. ETL using mETL Versatile loader with easy configuration. source: type:

    CSV resource: input.csv headerRow: 0 skipRow: 1 fields: - name: Name - name: Age type: Integer target: type: JSON resource: output.json No GUI, configuration via YAML format. Checks differences between migrations. Quick transformations and manipulations. Easy to extend. 9 source types, 11 target types, 35+ built-in transformations.
  45. ETL using Luigi Batch data processing with data flow support.

    Dependency definitions Hadoop integration Data flow visualisation Command line integration
  46. Summary Python Analyzing Data Crawling Web Scraping API Visualisa- tion

    Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes mETL Bubbles Luigi Numba Bokeh Data Harvesting Data Cleansing ETL SymPy NetworkX mply PyBrain Vispy
  47. Problems We need more data cleansing APIs. Python should be

    more quickly out of the box. Collaboration is not so easy. Visualisation is still hard. Heterogeneous tools would be cool. Python 3+