$30 off During Our Annual Pro Sale. View Details »

Data Mining 101

Ali Akbar S.
February 21, 2015

Data Mining 101

Introduction to data mining based on CRISP DM

Ali Akbar S.

February 21, 2015
Tweet

More Decks by Ali Akbar S.

Other Decks in Science

Transcript

  1. Data Mining 101
    Okiriza Wibisono - @okiriza
    Ali Akbar Septiandri - @aliakbars

    View Slide

  2. Outline
    Introduction
    •Terminology
    •Potential application
    •Venn diagram
    Process
    overview
    •Business understanding
    •Data understanding (exploration)
    •Data preparation (preprocessing)
    •Modeling
    •Evaluation
    •Deployment (presentation)
    Tools &
    Resource

    View Slide

  3. Introduction – Terminology
    Data
    mining
    Knowledge
    Discovery
    in
    Databases
    Big data
    analytics
    Statistics
    Data
    science

    View Slide



  4. The process of collecting,
    searching through, and analyzing
    a large amount of data in a
    database, as to discover patterns
    or relationships.
    Data Mining - dictionary.reference.com

    View Slide

  5. Introduction – Potential Application
    Customer
    segmentation
    Recommendation
    engine
    Social media
    mining

    View Slide



  6. What should we do?
    Where to start? Do I have to get a master degree in statistics?

    View Slide

  7. http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg

    View Slide

  8. Data Science Venn Diagram
    http://drewconway.com/zia/2013/3/26/the-
    data-science-venn-diagram

    View Slide

  9. And now the business process…

    View Slide

  10. CRISP DM Methodology
    http://lyle.smu.edu/~mhd/8331f03/crisp.pdf

    View Slide

  11. Business Understanding
    CRISP DM Methodology

    View Slide

  12. Objective Statement
    Bottom-up
    Top-down

    View Slide

  13. Objective Statement
    Data Problem
    vs

    View Slide

  14. Situation Assessment
    Inventory of Resources
    Requirements, Assumptions, and Constraints
    Risks and Contingencies
    Terminology
    Costs and Benefits
    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

    View Slide

  15. Situation Assessment –
    Inventory of Resources
    Resource
    Data,
    Knowledge,
    Tools
    Hardware
    Personnel
    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

    View Slide

  16. Situation Assessment –
    Requirements, Assumptions, and Constraints
    Requirements
    Scheduling
    Accuracy
    Security
    Assumptions
    Data quality
    External
    factors
    Reporting type
    Constraints
    Legal issues
    Budget
    Resources
    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

    View Slide

  17. Situation Assessment –
    Risks and Contingencies
    Contingency Plan
    Financial
    Organizational
    Business
    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

    View Slide

  18. Situation Assessment – Terminology
    Write down related terminology
    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
    http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg

    View Slide

  19. Situation Assessment – Costs and Benefits
    Money, money, money!
    http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
    http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg

    View Slide



  20. How to evaluate the results?
    Define your success criteria!

    View Slide

  21. Data Understanding
    CRISP DM Methodology

    View Slide

  22. Data Collection
    External Internal
    vs

    View Slide

  23. Watch out!

    View Slide



  24. visible ≠ accessible ≠
    storable ≠ presentable
    Victor Lavrenko – Text Technologies
    http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf

    View Slide

  25. Data Exploration –
    Visualization Heuristics
     Visualize fast. Visualize reactively.
     Go for high information 2D visualizations.
     Select data subsets to visualize.
    http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

    View Slide

  26. Data Exploration –
    Visualization Heuristics
     Never let anomalies pass you by. Dig deeper.
     Use your visualizations to inform potential
    models. Use your potential model to direct your
    visualizations.
     Expect problems in your data.
    http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

    View Slide



  27. This is the cheapest and most
    informative stage of data
    mining.
    Nigel Goddard – DME Visualization
    http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

    View Slide

  28. Data Exploration –
    Visualization Tools
     Column/bar: Large change
     Line, curve: Small change, long periods
     Histogram: Frequency distribution
    https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp

    View Slide

  29. Data Preparation
    CRISP DM Methodology

    View Slide



  30. Which one should I include
    (or exclude)?
    Data Selection

    View Slide

  31. Data Cleaning
    Dirty
    Data
    Missing
    value
    Incomplete
    Outdated
    Duplication
    Outlier
    Remember: Expect problems in your data.

    View Slide

  32. Data Construction
     Feature engineering – derived attributes,
    e.g.:
    year from timestamp
    quarter from timestamp
    BMI from weight and height
    Log(x) for skewed data (e.g. house price)

    View Slide

  33. Data Splitting
    Two kinds of data splitting:
    Training-Validation-Testing
    Cross Validation

    View Slide

  34. Data Splitting –
    Training-Validation-Testing
    • Construct
    classifier
    Training
    • Pick algorithm
    • Knob settings
    (tree depth, k in
    kNN, c in SVM)
    Validation
    • Estimate future
    error rate
    Testing
    Split randomly to avoid bias
    http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf

    View Slide

  35. Data Splitting –
    Cross Validation
    Every point is both training and testing, never at the same time

    View Slide

  36. Dimensionality Reduction
    Principal
    Component
    Analysis
    Linear
    Discriminant
    Analysis
    vs

    View Slide

  37. Modeling
    CRISP DM Methodology

    View Slide

  38. Machine Learning
    Classification Regression Ranking Clustering

    View Slide

  39. Model Selection
    Regression
    Technique
    Generalization bound
    Linear regression
    Kernel ridge regression
    Support vector regression
    Lasso

    View Slide



  40. Which one should I choose?
    Should I use all of them?

    View Slide

  41. It depends on…

    View Slide

  42. Model Selection
    Assumptions
    The predictors are linearly
    independent
    The error is a random variable
    with a mean of zero conditional on
    the explanatory variables
    The sample is representative of
    the population for the inference
    prediction
    Interpretability
    The
    understandability
    of why the model
    is true or how the
    model is induced
    from
    https://chenhaot.com/pubs/mldg-interpretability.pdf

    View Slide

  43. Beware of Overfitting!
    http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png

    View Slide

  44. Model Assessment
    Regression
    • (R)MSE
    • Mean
    Absolute
    Error
    • Correlation
    Coefficient
    Classification
    • Accuracy
    • Precision
    • Recall
    • F-score
    Descriptive
    • Std. Error
    • p-value
    • Confidence
    Interval

    View Slide

  45. Evaluation
    CRISP DM Methodology

    View Slide



  46. Does my model solve the
    problem?
    What is the impact? Is it novel? How useful is the solution?

    View Slide

  47. Deployment
    CRISP DM Methodology

    View Slide

  48. The Tasks
    Plan deployment
    Plan monitoring
    and maintenance
    Produce final
    report
    Review project

    View Slide

  49. Tools & Resource
     Text mining: NLTK, spaCy, OpenNLP
     Query expansion & clustering: Carrot2, Weka
     Data mining & machine learning: Weka, scikit-learn
     Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala
     Python lib: Pandas, SciPy, NumPy, scikit-learn
     Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark
     Visualization: D3.js
     Community: Big Data & Open Data Indonesia

    View Slide



  50. Thank you!
    Data Mining 101 – Python-ID Meetup February 2015
    Okiriza Wibisono - @okiriza
    Ali Akbar Septiandri - @aliakbars

    View Slide