Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Mining 101

Ali Akbar S.
February 21, 2015

Data Mining 101

Introduction to data mining based on CRISP DM

Ali Akbar S.

February 21, 2015
Tweet

More Decks by Ali Akbar S.

Other Decks in Science

Transcript

  1. Outline Introduction •Terminology •Potential application •Venn diagram Process overview •Business

    understanding •Data understanding (exploration) •Data preparation (preprocessing) •Modeling •Evaluation •Deployment (presentation) Tools & Resource
  2. “ ” The process of collecting, searching through, and analyzing

    a large amount of data in a database, as to discover patterns or relationships. Data Mining - dictionary.reference.com
  3. “ ” What should we do? Where to start? Do

    I have to get a master degree in statistics?
  4. Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks

    and Contingencies Terminology Costs and Benefits http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  5. Situation Assessment – Inventory of Resources Resource Data, Knowledge, Tools

    Hardware Personnel http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  6. Situation Assessment – Requirements, Assumptions, and Constraints Requirements Scheduling Accuracy

    Security Assumptions Data quality External factors Reporting type Constraints Legal issues Budget Resources http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  7. Situation Assessment – Risks and Contingencies Contingency Plan Financial Organizational

    Business http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
  8. “ ” visible ≠ accessible ≠ storable ≠ presentable Victor

    Lavrenko – Text Technologies http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf
  9. Data Exploration – Visualization Heuristics  Visualize fast. Visualize reactively.

     Go for high information 2D visualizations.  Select data subsets to visualize. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
  10. Data Exploration – Visualization Heuristics  Never let anomalies pass

    you by. Dig deeper.  Use your visualizations to inform potential models. Use your potential model to direct your visualizations.  Expect problems in your data. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
  11. “ ” This is the cheapest and most informative stage

    of data mining. Nigel Goddard – DME Visualization http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
  12. Data Exploration – Visualization Tools  Column/bar: Large change 

    Line, curve: Small change, long periods  Histogram: Frequency distribution https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
  13. Data Construction  Feature engineering – derived attributes, e.g.: year

    from timestamp quarter from timestamp BMI from weight and height Log(x) for skewed data (e.g. house price)
  14. Data Splitting – Training-Validation-Testing • Construct classifier Training • Pick

    algorithm • Knob settings (tree depth, k in kNN, c in SVM) Validation • Estimate future error rate Testing Split randomly to avoid bias http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
  15. Model Selection Assumptions The predictors are linearly independent The error

    is a random variable with a mean of zero conditional on the explanatory variables The sample is representative of the population for the inference prediction Interpretability The understandability of why the model is true or how the model is induced from https://chenhaot.com/pubs/mldg-interpretability.pdf
  16. Model Assessment Regression • (R)MSE • Mean Absolute Error •

    Correlation Coefficient Classification • Accuracy • Precision • Recall • F-score Descriptive • Std. Error • p-value • Confidence Interval
  17. “ ” Does my model solve the problem? What is

    the impact? Is it novel? How useful is the solution?
  18. Tools & Resource  Text mining: NLTK, spaCy, OpenNLP 

    Query expansion & clustering: Carrot2, Weka  Data mining & machine learning: Weka, scikit-learn  Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala  Python lib: Pandas, SciPy, NumPy, scikit-learn  Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark  Visualization: D3.js  Community: Big Data & Open Data Indonesia
  19. “ ” Thank you! Data Mining 101 – Python-ID Meetup

    February 2015 Okiriza Wibisono - @okiriza Ali Akbar Septiandri - @aliakbars