Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Mining 101

Dfda3ce33093a2ce23246410c5087a92?s=47 Ali Akbar S.
February 21, 2015

Data Mining 101

Introduction to data mining based on CRISP DM


Ali Akbar S.

February 21, 2015

More Decks by Ali Akbar S.

Other Decks in Science


  1. Data Mining 101 Okiriza Wibisono - @okiriza Ali Akbar Septiandri

    - @aliakbars
  2. Outline Introduction •Terminology •Potential application •Venn diagram Process overview •Business

    understanding •Data understanding (exploration) •Data preparation (preprocessing) •Modeling •Evaluation •Deployment (presentation) Tools & Resource
  3. Introduction – Terminology Data mining Knowledge Discovery in Databases Big

    data analytics Statistics Data science
  4. “ ” The process of collecting, searching through, and analyzing

    a large amount of data in a database, as to discover patterns or relationships. Data Mining - dictionary.reference.com
  5. Introduction – Potential Application Customer segmentation Recommendation engine Social media

  6. “ ” What should we do? Where to start? Do

    I have to get a master degree in statistics?
  7. http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg

  8. Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the- data-science-venn-diagram

  9. And now the business process…

  10. CRISP DM Methodology http://lyle.smu.edu/~mhd/8331f03/crisp.pdf

  11. Business Understanding CRISP DM Methodology

  12. Objective Statement Bottom-up Top-down

  13. Objective Statement Data Problem vs

  14. Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks

    and Contingencies Terminology Costs and Benefits http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  15. Situation Assessment – Inventory of Resources Resource Data, Knowledge, Tools

    Hardware Personnel http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  16. Situation Assessment – Requirements, Assumptions, and Constraints Requirements Scheduling Accuracy

    Security Assumptions Data quality External factors Reporting type Constraints Legal issues Budget Resources http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  17. Situation Assessment – Risks and Contingencies Contingency Plan Financial Organizational

    Business http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  18. Situation Assessment – Terminology Write down related terminology http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg

  19. Situation Assessment – Costs and Benefits Money, money, money! http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

  20. “ ” How to evaluate the results? Define your success

  21. Data Understanding CRISP DM Methodology

  22. Data Collection External Internal vs

  23. Watch out!

  24. “ ” visible ≠ accessible ≠ storable ≠ presentable Victor

    Lavrenko – Text Technologies http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf
  25. Data Exploration – Visualization Heuristics  Visualize fast. Visualize reactively.

     Go for high information 2D visualizations.  Select data subsets to visualize. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
  26. Data Exploration – Visualization Heuristics  Never let anomalies pass

    you by. Dig deeper.  Use your visualizations to inform potential models. Use your potential model to direct your visualizations.  Expect problems in your data. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
  27. “ ” This is the cheapest and most informative stage

    of data mining. Nigel Goddard – DME Visualization http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
  28. Data Exploration – Visualization Tools  Column/bar: Large change 

    Line, curve: Small change, long periods  Histogram: Frequency distribution https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
  29. Data Preparation CRISP DM Methodology

  30. “ ” Which one should I include (or exclude)? Data

  31. Data Cleaning Dirty Data Missing value Incomplete Outdated Duplication Outlier

    Remember: Expect problems in your data.
  32. Data Construction  Feature engineering – derived attributes, e.g.: year

    from timestamp quarter from timestamp BMI from weight and height Log(x) for skewed data (e.g. house price)
  33. Data Splitting Two kinds of data splitting: Training-Validation-Testing Cross Validation

  34. Data Splitting – Training-Validation-Testing • Construct classifier Training • Pick

    algorithm • Knob settings (tree depth, k in kNN, c in SVM) Validation • Estimate future error rate Testing Split randomly to avoid bias http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
  35. Data Splitting – Cross Validation Every point is both training

    and testing, never at the same time
  36. Dimensionality Reduction Principal Component Analysis Linear Discriminant Analysis vs

  37. Modeling CRISP DM Methodology

  38. Machine Learning Classification Regression Ranking Clustering

  39. Model Selection Regression Technique Generalization bound Linear regression Kernel ridge

    regression Support vector regression Lasso
  40. “ ” Which one should I choose? Should I use

    all of them?
  41. It depends on…

  42. Model Selection Assumptions The predictors are linearly independent The error

    is a random variable with a mean of zero conditional on the explanatory variables The sample is representative of the population for the inference prediction Interpretability The understandability of why the model is true or how the model is induced from https://chenhaot.com/pubs/mldg-interpretability.pdf
  43. Beware of Overfitting! http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png

  44. Model Assessment Regression • (R)MSE • Mean Absolute Error •

    Correlation Coefficient Classification • Accuracy • Precision • Recall • F-score Descriptive • Std. Error • p-value • Confidence Interval
  45. Evaluation CRISP DM Methodology

  46. “ ” Does my model solve the problem? What is

    the impact? Is it novel? How useful is the solution?
  47. Deployment CRISP DM Methodology

  48. The Tasks Plan deployment Plan monitoring and maintenance Produce final

    report Review project
  49. Tools & Resource  Text mining: NLTK, spaCy, OpenNLP 

    Query expansion & clustering: Carrot2, Weka  Data mining & machine learning: Weka, scikit-learn  Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala  Python lib: Pandas, SciPy, NumPy, scikit-learn  Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark  Visualization: D3.js  Community: Big Data & Open Data Indonesia
  50. “ ” Thank you! Data Mining 101 – Python-ID Meetup

    February 2015 Okiriza Wibisono - @okiriza Ali Akbar Septiandri - @aliakbars