Slide 1

Slide 1 text

DATA PROCESSING & MACHINE LEARNING WITH PYTHON AHMED KACHKACH @HALFLINGS - PYCON SWEDEN, MAY 2015

Slide 2

Slide 2 text

Who am I? • Ahmed Kachkach < kachkach.com > • Machine Learning master student @KTH. • Interested in all things data, Python, web dev. • On Twitter & Github: @halflings

Slide 3

Slide 3 text

So … what’s “data science”? A new buzz-word to use in business stock photos? much big data very NoSQL pls bitcoin wow nodejs such kony 2012

Slide 4

Slide 4 text

Data science = Statistics Computer science ! + All your Bayes are belong to us! No one can spell my name correctly :(

Slide 5

Slide 5 text

A typical data analysis pipeline From “Biomedical Named Entity Recognition: A Survey of Machine-Learning Tools"

Slide 6

Slide 6 text

Subject of this talk • Data analysis is a long process that requires different steps & various tools • Many challenges at each step of the pipeline • This is a “quick” overview of all steps of this process!

Slide 7

Slide 7 text

Outline • Why Python? • Fetching and cleaning data (requests, lxml, pandas, ...) • Analysis / Machine learning (scikit-learn, SimpleCV) • Visualization and data exploration (matplotlib, IPython, ...)

Slide 8

Slide 8 text

PYTHON WHAT MAKES GREAT FOR DATA ANALYSIS

Slide 9

Slide 9 text

Why Python? • Clean syntax, dynamic language (good and bad) • Strong principles: like simplicity and explicitness • Incredibly active community!

Slide 10

Slide 10 text

Some useful features • List comprehensions: [line.rstrip().lower() for line in file if not line.startswith(“#”)] • Useful operators: map(str.upper, [“hey”, “what’s up?”]) # [“HEY”, “WHAT’S UP?”]
 any(w.startswith(“s”) for w in {“mloukhiya”, “saykouk”}) # True
 sorted(countries, key=lambda country : country.capital.size) • Closures, decorators, 1st class functions, … @functools.lru_cache(maxsize=None)
 def some_expensive_function(x):
 # do stuff ...

Slide 11

Slide 11 text

FETCHING
 CLEANING
 VALIDATING
 DATA IT ALL STARTS BY GETTING THE RAW STUFF

Slide 12

Slide 12 text

Raw data • Data comes in all shapes and colors, or formats and encodings • Finding a needle in a haystack (of irrelevant data) • Acquiring different types of data requires different tools

Slide 13

Slide 13 text

FETCHING DATA FROM THE WEB

Slide 14

Slide 14 text

requests: HTTP for Humans urllib2 can do the job... but things can quickly get messy as soon as you need to handle parameters encoding, repeat logic, SSL, etc. requests remove all the boilerplate code and provides a truly Pythonic API!
 
 Don’t miss Kenneth Reitz’s talk tomorrow on some of Python’s crufty APIs!

Slide 15

Slide 15 text

Fetching data from the web import requests
 
 print requests.get(‘http://example.com‘).text
 “ Example Domain . . .“

Slide 16

Slide 16 text

Communicating with APIs import requests
 
 print requests.get(“https://www.googleapis.com/books/v1/volumes”, params={“q”:”machine learning”}).json()[‘items’] [ {"volumeInfo": { "title": “KTH", "subtitle": "i.e. Kungliga Tekniska högskolan 1912-62 i.e. nittonhundra tolv till sextiotvå . Kungl. Tekniska Högskolan i Stockholm under 50 år”, . . . },
 . . . ]

Slide 17

Slide 17 text

Parsing an HTML page import lxml.html
 page = lxml.html.parse(‘http://www.blocket.se/ stockholm?q=apple‘) # Querying by CSS class print page.getroot().find_class(‘item_row‘)
 # Querying using xpath print page.xpath(‘//img[contains(@class, “item_image”)]/@src’)

Slide 18

Slide 18 text

scrapy: your own personal “Google Killer”™ scrappy lets you build web crawlers to fetch structured data from web pages.

Slide 19

Slide 19 text

Building a web crawler from scrapy import Spider, Item, Field class Post(Item): title = Field() class BlogSpider(Spider): name, start_urls = 'blogspider', ['http://blog.scrapinghub.com'] def parse(self, response): return [Post(title=e.extract()) for e in response.css("h2 a::text")] scrapy runspider myspider.py| And run it with:

Slide 20

Slide 20 text

pandas: excel on steroids Heavily inspired by R’s data-frames, pandas is a must have for data analysis! • Handles various inputs (csv, json, sql, excel, …) • Easy data validation and aggregation • Lets you explore your data in many ways (more on that later) 
 Robin Linderborg gave a talk about Pandas right before me so... go back in time and check it out!

Slide 21

Slide 21 text

Reading local data import pandas
 
 df = pandas.read_csv(‘cars.csv') # Filling missing values df['Description'] = df['Description'].fillna("No description is available.") df['Price'] = df['Price'].interpolate()

Slide 22

Slide 22 text

MACHINE LEARNING: ANALYZING THE DATA WE HAVE THE DATA, LET’S MAKE
 SOMETHING OUT OF IT!

Slide 23

Slide 23 text

PART 1: PRE- PROCESSING

Slide 24

Slide 24 text

Pre-processing data Pre-processing is often a vital step to change our data into a representation usable by our ML models. Among the most common steps: • Feature extraction & Vectorization • Scaling/Normalization • Feature selection/Dimensionality reduction

Slide 25

Slide 25 text

Feature extraction Raw data comes in multiple shapes: • Image • Text • Structured data (database table, dictionary, etc.)
 
 We need to extract relevant features from this data.

Slide 26

Slide 26 text

from sklearn import feature_extraction corpus = ['Cats really are great.', 'I like cats but I still prefer dogs.', 'Dogs are the best.', 'I like trains'] tfidf = feature_extraction.text.TfidfVectorizer() print tfidf.fit_transform(corpus) print tfidf.get_feature_names() Example: text documents Converting text documents to a vector representation using TF-IDF:

Slide 27

Slide 27 text

Vectorization Your features may be in various forms: • Numerical variables (ex: weight) • Categorical variables (ex: country of origin) • Boolean variables (ex: active account) We have to represent all these variables in the vector space model to train our models.

Slide 28

Slide 28 text

Example: DictVectorizer Transforming key->value pairs to vectors: from sklearn import feature_extraction data = [{"weight": 60., "sex": "female", "student": True}, {"weight": 80.1, "sex": "male", "student": False}, {"weight": 65.3, "sex": "male", "student": True}, {"weight": 58.5, "sex": "female", "student": False}] vectorizer = feature_extraction.DictVectorizer(sparse=False) vectors = vectorizer.fit_transform(data) print vectors print vectorizer.get_feature_names()

Slide 29

Slide 29 text

Normalization Many models are sensitive to the scale of the input data, so it’s often a good idea to normalize the data we feed it:
 (each sample is normalized independently) from sklearn import preprocessing data = [[10., 2345., 0., 2.], [3., -3490., 0.1, 1.99], [13., 3903., -0.2, 2.11]] print preprocessing.normalize(data)

Slide 30

Slide 30 text

Dimensionality reduction Beware the curse of dimensionality!
 
 Many features can be invariant or heavily correlated with other features. A dimensionality reduction algorithm (like PCA) can help us get better performance and faster training/predictions.

Slide 31

Slide 31 text

Kernel PCA Kernel PCA can be used for non-linear decompositions

Slide 32

Slide 32 text

PART 2: TRAINING & USING MODELS

Slide 33

Slide 33 text

Let the fun begin! We finally have some clean, relevant and structured data. Let’s train some models!

Slide 34

Slide 34 text

Classification Given labeled inputs, we want to identify which class a new datapoint belongs to. Many methods exist, none of them better than all the others (“no free lunch”).
 Depends on the hypothesis we make on the data.
 
 Among these classifiers: Decision Trees, Naive Bayes, SVMs, KNN, …

Slide 35

Slide 35 text

ENOUGH ANAGRAMS.
 
 LET’S SEE A TYPICALLY SWEDISH CLASSIFICATION PROBLEM:
 
 PASTRIES

Slide 36

Slide 36 text

Training the
 Support Pastries Machine Psst, dear SPM.
 These are “kanelbullar” Yum!
 *data-analysis intensifies*

Slide 37

Slide 37 text

These are not kanelbullar! Training the
 Support Pastries Machine Ugh!
 *crunching vectors*

Slide 38

Slide 38 text

Using the model

Slide 39

Slide 39 text

Using the model

Slide 40

Slide 40 text

Example: Support Vector Machine from sklearn import datasets from sklearn import svm iris = datasets.load_iris() X, y = iris.data[:, :2], iris.target 
 # Training the model clf = svm.SVC(kernel='rbf') clf.fit(X, y) # Doing predictions new_data = [[4.85, 3.1], [5.61, 3.02], [6.63, 3.13]] print clf.predict(new_data)

Slide 41

Slide 41 text

Regression Given a series of inputs and their target output, we want to predict the output of a new series of inputs.

Slide 42

Slide 42 text

Example: LinearRegression import numpy as np from sklearn import linear_model def f(x): return x + np.random.random() * 3. X = np.arange(0, 5, 0.5) X = X.reshape((len(X), 1)) y = map(f, X) clf = linear_model.LinearRegression() clf.fit(X, y)

Slide 43

Slide 43 text

Example: LinearClassifier

Slide 44

Slide 44 text

Clustering Grouping similar data-points together. Can be either with a known number of clusters (KMeans, Hierarchical clustering, …) or an unknown number of clusters (Mean-shift, DBScan, …).

Slide 45

Slide 45 text

Comparison of clustering techniques

Slide 46

Slide 46 text

Example: DBSCAN from sklearn.cluster import DBSCAN from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt # Generate sample data centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=200, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X)
 # Compute DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit(X) labels = db.labels_ plt.scatter(X[:, 0], X[:, 1], c=labels) plt.show()

Slide 47

Slide 47 text

MODEL EVALUATION

Slide 48

Slide 48 text

“No free hunch” Looking at your program’s output and saying “Mmh that looks about right” really isn’t a sane way to evaluate your models.
 
 scikit-learn makes it extremely easy to do systematic model evaluation.

Slide 49

Slide 49 text

Integrated model evaluation •Most scikit-learn classifiers have a score function that takes a list of inputs and the target outputs. •Scoring functions let you calculate some of these values: •accuracy •precision/recall •mean absolute error / mean squared error

Slide 50

Slide 50 text

Cross-validation from sklearn import svm, cross_validation, datasets iris = datasets.load_iris() X, y = iris.data, iris.target model = svm.SVC() print cross_validation.cross_val_score(model, X, y, scoring=‘precision') print cross_validation.cross_val_score(model, X, y, scoring=‘mean_squared_error’)

Slide 51

Slide 51 text

VISUALIZE AND EXPLORE
 DATA SCATTER PLOTS, GRAPHS, HEAT-MAPS AND OTHER FANCY THINGS.

Slide 52

Slide 52 text

Matplotlib The “go-to” plotting library in Python. Integrated with most scientific/data libraries (pandas, scikit-learn, etc.) Easy to use, can be used to create various plots and offers a high level of customizability, but graphs can be pretty “ugly” by default and don’t integrate well with web apps (by default).

Slide 53

Slide 53 text

Bokeh Simple API, more diverse plots, allows plotting interactive graphs that can be shared on the web (using D3.js)
 
 Example: http://bokeh.pydata.org/en/latest/docs/gallery/texas.html

Slide 54

Slide 54 text

ggplot Similar to R’s ggplot2, arguably fancier plots than matplotlib!

Slide 55

Slide 55 text

Jupyter / IPython notebook Needs a whole talk by itself!
 Lets you build interactive notebooks, usable right in the browser. Notebooks can easily be shared, exported as static web pages or even presentations & books!
 Really popular in the data science community.
 
 Demo? (if there’s enough time left!)

Slide 56

Slide 56 text

THAT’S ALL FOLKS!
 
 QUESTIONS ? THANKS FOR YOUR ATTENTION!
 
 AHMED KACHKACH < KACHKACH.COM >
 @ HALFLINGS