Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
The Artful Business of Data Mining Computational Statistics with Open Source Tool Wednesday 20 March 13
Slide 2
Slide 2 text
David Coallier @davidcoallier Wednesday 20 March 13
Slide 3
Slide 3 text
Data Scientist At Engine Yard (.com) Wednesday 20 March 13
Slide 4
Slide 4 text
Find Data Wednesday 20 March 13
Slide 5
Slide 5 text
Clean Data Wednesday 20 March 13
Slide 6
Slide 6 text
Analyse Data? Wednesday 20 March 13
Slide 7
Slide 7 text
Analyse Data Wednesday 20 March 13
Slide 8
Slide 8 text
Question Data Wednesday 20 March 13
Slide 9
Slide 9 text
Report Findings Wednesday 20 March 13
Slide 10
Slide 10 text
Data Scientist Wednesday 20 March 13
Slide 11
Slide 11 text
Data Janitor Wednesday 20 March 13
Slide 12
Slide 12 text
Actual Tasks Wednesday 20 March 13
Slide 13
Slide 13 text
“If your model is elegant, it’s probably wrong” Wednesday 20 March 13
Slide 14
Slide 14 text
“The Times they are a-Changing” — Bob Dylan Wednesday 20 March 13
Slide 15
Slide 15 text
Python & R Wednesday 20 March 13
Slide 16
Slide 16 text
SciPy http://www.scipy.org Wednesday 20 March 13
Slide 17
Slide 17 text
scipy.stats Wednesday 20 March 13
Slide 18
Slide 18 text
scipy.stats Descriptive Statistics Wednesday 20 March 13
Slide 19
Slide 19 text
from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s) Wednesday 20 March 13
Slide 20
Slide 20 text
scipy.stats Probability Distributions Wednesday 20 March 13
Slide 21
Slide 21 text
Example Poisson Distribution Wednesday 20 March 13
Slide 22
Slide 22 text
f (k;λ) = λke−k k! for k >= 0 Wednesday 20 March 13
Slide 23
Slide 23 text
import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2) Wednesday 20 March 13
Slide 24
Slide 24 text
print p.mean() print p.sum() ... Wednesday 20 March 13
Slide 25
Slide 25 text
NumPy http://www.numpy.org/ Wednesday 20 March 13
Slide 26
Slide 26 text
NumPy Linear Algebra Wednesday 20 March 13
Slide 27
Slide 27 text
1 0 0 1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ Wednesday 20 March 13
Slide 28
Slide 28 text
import numpy as np x = np.array([ [1, 0], [0, 1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x) Wednesday 20 March 13
Slide 29
Slide 29 text
>>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.], [ 0., 1.] ]) ) Wednesday 20 March 13
Slide 30
Slide 30 text
Matplotlib Python Plotting Wednesday 20 March 13
Slide 31
Slide 31 text
statsmodels Advanced Statistics Modeling Wednesday 20 March 13
Slide 32
Slide 32 text
NLTK Natural Language Tool Kit Wednesday 20 March 13
Slide 33
Slide 33 text
scikit-learn Machine Learning Wednesday 20 March 13
Slide 34
Slide 34 text
from sklearn import tree X = [[0, 0], [1, 1]] Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1]) Wednesday 20 March 13
Slide 35
Slide 35 text
PyBrain ... Machine Learning Wednesday 20 March 13
Slide 36
Slide 36 text
PyMC Bayesian Inference Wednesday 20 March 13
Slide 37
Slide 37 text
Pattern Web Mining for Python Wednesday 20 March 13
Slide 38
Slide 38 text
NetworkX Study Networks Wednesday 20 March 13
Slide 39
Slide 39 text
MILK MOAR machine LEARNING! Wednesday 20 March 13
Slide 40
Slide 40 text
Pandas easy-to-use data structures Wednesday 20 March 13
Slide 41
Slide 41 text
from pandas import * x = DataFrame([ {"age": 26}, {"age": 19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean() Wednesday 20 March 13
Slide 42
Slide 42 text
R Wednesday 20 March 13
Slide 43
Slide 43 text
RStudio The IDE Wednesday 20 March 13
Slide 44
Slide 44 text
lubridate and zoo Dealing with Dates... Wednesday 20 March 13
Slide 45
Slide 45 text
yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone Wednesday 20 March 13
Slide 46
Slide 46 text
reshape2 Reshape your Data Wednesday 20 March 13
Slide 47
Slide 47 text
ggplot2 Visualise your Data Wednesday 20 March 13
Slide 48
Slide 48 text
RCurl, RJSONIO Find more Data Wednesday 20 March 13
Slide 49
Slide 49 text
HMisc Miscellaneous useful functions Wednesday 20 March 13
Slide 50
Slide 50 text
forecast Can you guess? Wednesday 20 March 13
Slide 51
Slide 51 text
garch And ruGarch Wednesday 20 March 13
Slide 52
Slide 52 text
quantmod Statistical Financial Trading Wednesday 20 March 13
Slide 53
Slide 53 text
xts Extensible Time Series Wednesday 20 March 13
Slide 54
Slide 54 text
igraph Study Networks Wednesday 20 March 13
Slide 55
Slide 55 text
maptools Read & View Maps Wednesday 20 March 13
Slide 56
Slide 56 text
map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T) Wednesday 20 March 13
Slide 57
Slide 57 text
Sto rage Wednesday 20 March 13
Slide 58
Slide 58 text
Oppose “big” Data Wednesday 20 March 13
Slide 59
Slide 59 text
“Learn how to sample” Wednesday 20 March 13
Slide 60
Slide 60 text
Experim ents Wednesday 20 March 13
Slide 61
Slide 61 text
What Do You Want to Answer? Wednesday 20 March 13
Slide 62
Slide 62 text
Understand Your Audience Wednesday 20 March 13
Slide 63
Slide 63 text
Scientific Reporting Wednesday 20 March 13
Slide 64
Slide 64 text
Busy-ness Time is money Wednesday 20 March 13
Slide 65
Slide 65 text
Public Visualisation Wednesday 20 March 13
Slide 66
Slide 66 text
Best Visualisation, Bad Data Wednesday 20 March 13
Slide 67
Slide 67 text
Best Forecasting models... Bad Visualisation Wednesday 20 March 13
Slide 68
Slide 68 text
Wednesday 20 March 13
Slide 69
Slide 69 text
Wednesday 20 March 13
Slide 70
Slide 70 text
Sean chaí Wednesday 20 March 13
Slide 71
Slide 71 text
Wednesday 20 March 13
Slide 72
Slide 72 text
Feel it Wednesday 20 March 13
Slide 73
Slide 73 text
Wednesday 20 March 13
Slide 74
Slide 74 text
Wednesday 20 March 13
Slide 75
Slide 75 text
Wednesday 20 March 13
Slide 76
Slide 76 text
“Don’t be scared of bar charts.” Wednesday 20 March 13
Slide 77
Slide 77 text
Mathematical Statistics Engineering Business Economics Curiosity Wednesday 20 March 13
Slide 78
Slide 78 text
davidcoallier.github.com @davidcoallier on Twitter Wednesday 20 March 13