The Artful Business of Data Mining: Computational Statistics with Open Source Tools

The Artful Business of Data Mining: Computational Statistics with Open Source Tools

This talk goes over a concepts of data mining and data analysis using open source tools, mainly Python and R with interesting libraries and the tools I have used and currently use at Engine Yard.

C6ec08260e13aa2d5e9a7519546bed27?s=128

David Coallier

March 20, 2013
Tweet

Transcript

  1. The Artful Business of Data Mining Computational Statistics with Open

    Source Tool Wednesday 20 March 13
  2. David Coallier @davidcoallier Wednesday 20 March 13

  3. Data Scientist At Engine Yard (.com) Wednesday 20 March 13

  4. Find Data Wednesday 20 March 13

  5. Clean Data Wednesday 20 March 13

  6. Analyse Data? Wednesday 20 March 13

  7. Analyse Data Wednesday 20 March 13

  8. Question Data Wednesday 20 March 13

  9. Report Findings Wednesday 20 March 13

  10. Data Scientist Wednesday 20 March 13

  11. Data Janitor Wednesday 20 March 13

  12. Actual Tasks Wednesday 20 March 13

  13. “If your model is elegant, it’s probably wrong” Wednesday 20

    March 13
  14. “The Times they are a-Changing” — Bob Dylan Wednesday 20

    March 13
  15. Python & R Wednesday 20 March 13

  16. SciPy http://www.scipy.org Wednesday 20 March 13

  17. scipy.stats Wednesday 20 March 13

  18. scipy.stats Descriptive Statistics Wednesday 20 March 13

  19. from scipy.stats import describe s = [1,2,1,3,4,5] print describe(s) Wednesday

    20 March 13
  20. scipy.stats Probability Distributions Wednesday 20 March 13

  21. Example Poisson Distribution Wednesday 20 March 13

  22. f (k;λ) = λke−k k! for k >= 0 Wednesday

    20 March 13
  23. import scipy.stats.poisson p = poisson.pmf([1,2,3,4,1,2,3], 2) Wednesday 20 March 13

  24. print p.mean() print p.sum() ... Wednesday 20 March 13

  25. NumPy http://www.numpy.org/ Wednesday 20 March 13

  26. NumPy Linear Algebra Wednesday 20 March 13

  27. 1 0 0 1 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟

    Wednesday 20 March 13
  28. import numpy as np x = np.array([ [1, 0], [0,

    1] ]) vec, val = np.linalg.eig(x) np.linalg.eigvals(x) Wednesday 20 March 13
  29. >>> np.linalg.eig(x) ( array([ 1., 1.]), array([ [ 1., 0.],

    [ 0., 1.] ]) ) Wednesday 20 March 13
  30. Matplotlib Python Plotting Wednesday 20 March 13

  31. statsmodels Advanced Statistics Modeling Wednesday 20 March 13

  32. NLTK Natural Language Tool Kit Wednesday 20 March 13

  33. scikit-learn Machine Learning Wednesday 20 March 13

  34. from sklearn import tree X = [[0, 0], [1, 1]]

    Y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(X, Y) clf.predict([[2., 2.]]) >>> array([1]) Wednesday 20 March 13
  35. PyBrain ... Machine Learning Wednesday 20 March 13

  36. PyMC Bayesian Inference Wednesday 20 March 13

  37. Pattern Web Mining for Python Wednesday 20 March 13

  38. NetworkX Study Networks Wednesday 20 March 13

  39. MILK MOAR machine LEARNING! Wednesday 20 March 13

  40. Pandas easy-to-use data structures Wednesday 20 March 13

  41. from pandas import * x = DataFrame([ {"age": 26}, {"age":

    19}, {"age": 21}, {"age": 18} ]) print x[x['age'] > 20].count() print x[x['age'] > 20].mean() Wednesday 20 March 13
  42. R Wednesday 20 March 13

  43. RStudio The IDE Wednesday 20 March 13

  44. lubridate and zoo Dealing with Dates... Wednesday 20 March 13

  45. yy/mm/dd mm/dd/yy YYYY-mm-dd HH:MM:ss TZ yy-mm-dd 1363784094.513425 yy/mm different timezone

    Wednesday 20 March 13
  46. reshape2 Reshape your Data Wednesday 20 March 13

  47. ggplot2 Visualise your Data Wednesday 20 March 13

  48. RCurl, RJSONIO Find more Data Wednesday 20 March 13

  49. HMisc Miscellaneous useful functions Wednesday 20 March 13

  50. forecast Can you guess? Wednesday 20 March 13

  51. garch And ruGarch Wednesday 20 March 13

  52. quantmod Statistical Financial Trading Wednesday 20 March 13

  53. xts Extensible Time Series Wednesday 20 March 13

  54. igraph Study Networks Wednesday 20 March 13

  55. maptools Read & View Maps Wednesday 20 March 13

  56. map('state', region = c(row.names(USArrests)), col=cm.colors(16, 1)[floor(USArrests$Rape/max(USArrests$Rape)*28)], fill=T) Wednesday 20 March

    13
  57. Sto rage Wednesday 20 March 13

  58. Oppose “big” Data Wednesday 20 March 13

  59. “Learn how to sample” Wednesday 20 March 13

  60. Experim ents Wednesday 20 March 13

  61. What Do You Want to Answer? Wednesday 20 March 13

  62. Understand Your Audience Wednesday 20 March 13

  63. Scientific Reporting Wednesday 20 March 13

  64. Busy-ness Time is money Wednesday 20 March 13

  65. Public Visualisation Wednesday 20 March 13

  66. Best Visualisation, Bad Data Wednesday 20 March 13

  67. Best Forecasting models... Bad Visualisation Wednesday 20 March 13

  68. Wednesday 20 March 13

  69. Wednesday 20 March 13

  70. Sean chaí Wednesday 20 March 13

  71. Wednesday 20 March 13

  72. Feel it Wednesday 20 March 13

  73. Wednesday 20 March 13

  74. Wednesday 20 March 13

  75. Wednesday 20 March 13

  76. “Don’t be scared of bar charts.” Wednesday 20 March 13

  77. Mathematical Statistics Engineering Business Economics Curiosity Wednesday 20 March 13

  78. davidcoallier.github.com @davidcoallier on Twitter Wednesday 20 March 13