Slide 1

Slide 1 text

The Artful Business of Data Mining Computational Statistics with Open Source Tool Wednesday 20 March 13

Slide 2

Slide 2 text

David Coallier @davidcoallier Wednesday 20 March 13

Slide 3

Slide 3 text

Data Scientist At Engine Yard (.com) Wednesday 20 March 13

Slide 4

Slide 4 text

Find Data Wednesday 20 March 13

Slide 5

Slide 5 text

Clean Data Wednesday 20 March 13

Slide 6

Slide 6 text

Analyse Data? Wednesday 20 March 13

Slide 7

Slide 7 text

Analyse Data Wednesday 20 March 13

Slide 8

Slide 8 text

Question Data Wednesday 20 March 13

Slide 9

Slide 9 text

Report Findings Wednesday 20 March 13

Slide 10

Slide 10 text

Data Scientist Wednesday 20 March 13

Slide 11

Slide 11 text

Data Janitor Wednesday 20 March 13

Slide 12

Slide 12 text

Actual Tasks Wednesday 20 March 13

Slide 13

Slide 13 text

“If your model is elegant, it’s probably wrong” Wednesday 20 March 13

Slide 14

Slide 14 text

“The Times they are a-Changing” — Bob Dylan Wednesday 20 March 13

Slide 15

Slide 15 text

Python & R Wednesday 20 March 13

Slide 16

Slide 16 text

SciPy http://www.scipy.org Wednesday 20 March 13

Slide 17

Slide 17 text

scipy.stats Wednesday 20 March 13

Slide 18

Slide 18 text

scipy.stats Descriptive Statistics Wednesday 20 March 13

Slide 19

Slide 19 text

from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s) Wednesday 20 March 13

Slide 20

Slide 20 text

scipy.stats Probability Distributions Wednesday 20 March 13

Slide 21

Slide 21 text

Example Poisson Distribution Wednesday 20 March 13

Slide 22

Slide 22 text

f (k;λ) = λke−k k! for k >= 0 Wednesday 20 March 13

Slide 23

Slide 23 text

import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2) Wednesday 20 March 13

Slide 24

Slide 24 text

print p.mean() print p.sum() ... Wednesday 20 March 13

Slide 25

Slide 25 text

NumPy http://www.numpy.org/ Wednesday 20 March 13

Slide 26

Slide 26 text

NumPy Linear Algebra Wednesday 20 March 13

Slide 27

Slide 27 text

1 0 0 1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ Wednesday 20 March 13

Slide 28

Slide 28 text

import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x) Wednesday 20 March 13

Slide 29

Slide 29 text

>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) ) Wednesday 20 March 13

Slide 30

Slide 30 text

Matplotlib Python Plotting Wednesday 20 March 13

Slide 31

Slide 31 text

statsmodels Advanced Statistics Modeling Wednesday 20 March 13

Slide 32

Slide 32 text

NLTK Natural Language Tool Kit Wednesday 20 March 13

Slide 33

Slide 33 text

scikit-learn Machine Learning Wednesday 20 March 13

Slide 34

Slide 34 text

from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1]) Wednesday 20 March 13

Slide 35

Slide 35 text

PyBrain ... Machine Learning Wednesday 20 March 13

Slide 36

Slide 36 text

PyMC Bayesian Inference Wednesday 20 March 13

Slide 37

Slide 37 text

Pattern Web Mining for Python Wednesday 20 March 13

Slide 38

Slide 38 text

NetworkX Study Networks Wednesday 20 March 13

Slide 39

Slide 39 text

MILK MOAR machine LEARNING! Wednesday 20 March 13

Slide 40

Slide 40 text

Pandas easy-to-use data structures Wednesday 20 March 13

Slide 41

Slide 41 text

from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean() Wednesday 20 March 13

Slide 42

Slide 42 text

R Wednesday 20 March 13

Slide 43

Slide 43 text

RStudio The IDE Wednesday 20 March 13

Slide 44

Slide 44 text

lubridate and zoo Dealing with Dates... Wednesday 20 March 13

Slide 45

Slide 45 text

yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone Wednesday 20 March 13

Slide 46

Slide 46 text

reshape2 Reshape your Data Wednesday 20 March 13

Slide 47

Slide 47 text

ggplot2 Visualise your Data Wednesday 20 March 13

Slide 48

Slide 48 text

RCurl, RJSONIO Find more Data Wednesday 20 March 13

Slide 49

Slide 49 text

HMisc Miscellaneous useful functions Wednesday 20 March 13

Slide 50

Slide 50 text

forecast Can you guess? Wednesday 20 March 13

Slide 51

Slide 51 text

garch And ruGarch Wednesday 20 March 13

Slide 52

Slide 52 text

quantmod Statistical Financial Trading Wednesday 20 March 13

Slide 53

Slide 53 text

xts Extensible Time Series Wednesday 20 March 13

Slide 54

Slide 54 text

igraph Study Networks Wednesday 20 March 13

Slide 55

Slide 55 text

maptools Read & View Maps Wednesday 20 March 13

Slide 56

Slide 56 text

map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T) Wednesday 20 March 13

Slide 57

Slide 57 text

Sto rage Wednesday 20 March 13

Slide 58

Slide 58 text

Oppose “big” Data Wednesday 20 March 13

Slide 59

Slide 59 text

“Learn how to sample” Wednesday 20 March 13

Slide 60

Slide 60 text

Experim ents Wednesday 20 March 13

Slide 61

Slide 61 text

What Do You Want to Answer? Wednesday 20 March 13

Slide 62

Slide 62 text

Understand Your Audience Wednesday 20 March 13

Slide 63

Slide 63 text

Scientific Reporting Wednesday 20 March 13

Slide 64

Slide 64 text

Busy-ness Time is money Wednesday 20 March 13

Slide 65

Slide 65 text

Public Visualisation Wednesday 20 March 13

Slide 66

Slide 66 text

Best Visualisation, Bad Data Wednesday 20 March 13

Slide 67

Slide 67 text

Best Forecasting models... Bad Visualisation Wednesday 20 March 13

Slide 68

Slide 68 text

Wednesday 20 March 13

Slide 69

Slide 69 text

Wednesday 20 March 13

Slide 70

Slide 70 text

Sean chaí Wednesday 20 March 13

Slide 71

Slide 71 text

Wednesday 20 March 13

Slide 72

Slide 72 text

Feel it Wednesday 20 March 13

Slide 73

Slide 73 text

Wednesday 20 March 13

Slide 74

Slide 74 text

Wednesday 20 March 13

Slide 75

Slide 75 text

Wednesday 20 March 13

Slide 76

Slide 76 text

“Don’t be scared of bar charts.” Wednesday 20 March 13

Slide 77

Slide 77 text

Mathematical Statistics Engineering Business Economics Curiosity Wednesday 20 March 13

Slide 78

Slide 78 text

davidcoallier.github.com @davidcoallier on Twitter Wednesday 20 March 13