Slide 1

Slide 1 text

Data Mining 101 Okiriza Wibisono - @okiriza Ali Akbar Septiandri - @aliakbars

Slide 2

Slide 2 text

Outline Introduction •Terminology •Potential application •Venn diagram Process overview •Business understanding •Data understanding (exploration) •Data preparation (preprocessing) •Modeling •Evaluation •Deployment (presentation) Tools & Resource

Slide 3

Slide 3 text

Introduction – Terminology Data mining Knowledge Discovery in Databases Big data analytics Statistics Data science

Slide 4

Slide 4 text

“ ” The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships. Data Mining - dictionary.reference.com

Slide 5

Slide 5 text

Introduction – Potential Application Customer segmentation Recommendation engine Social media mining

Slide 6

Slide 6 text

“ ” What should we do? Where to start? Do I have to get a master degree in statistics?

Slide 7

Slide 7 text

http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg

Slide 8

Slide 8 text

Data Science Venn Diagram http://drewconway.com/zia/2013/3/26/the- data-science-venn-diagram

Slide 9

Slide 9 text

And now the business process…

Slide 10

Slide 10 text

CRISP DM Methodology http://lyle.smu.edu/~mhd/8331f03/crisp.pdf

Slide 11

Slide 11 text

Business Understanding CRISP DM Methodology

Slide 12

Slide 12 text

Objective Statement Bottom-up Top-down

Slide 13

Slide 13 text

Objective Statement Data Problem vs

Slide 14

Slide 14 text

Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Slide 15

Slide 15 text

Situation Assessment – Inventory of Resources Resource Data, Knowledge, Tools Hardware Personnel http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Slide 16

Slide 16 text

Situation Assessment – Requirements, Assumptions, and Constraints Requirements Scheduling Accuracy Security Assumptions Data quality External factors Reporting type Constraints Legal issues Budget Resources http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Slide 17

Slide 17 text

Situation Assessment – Risks and Contingencies Contingency Plan Financial Organizational Business http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Slide 18

Slide 18 text

Situation Assessment – Terminology Write down related terminology http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg

Slide 19

Slide 19 text

Situation Assessment – Costs and Benefits Money, money, money! http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg

Slide 20

Slide 20 text

“ ” How to evaluate the results? Define your success criteria!

Slide 21

Slide 21 text

Data Understanding CRISP DM Methodology

Slide 22

Slide 22 text

Data Collection External Internal vs

Slide 23

Slide 23 text

Watch out!

Slide 24

Slide 24 text

“ ” visible ≠ accessible ≠ storable ≠ presentable Victor Lavrenko – Text Technologies http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf

Slide 25

Slide 25 text

Data Exploration – Visualization Heuristics  Visualize fast. Visualize reactively.  Go for high information 2D visualizations.  Select data subsets to visualize. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

Slide 26

Slide 26 text

Data Exploration – Visualization Heuristics  Never let anomalies pass you by. Dig deeper.  Use your visualizations to inform potential models. Use your potential model to direct your visualizations.  Expect problems in your data. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

Slide 27

Slide 27 text

“ ” This is the cheapest and most informative stage of data mining. Nigel Goddard – DME Visualization http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

Slide 28

Slide 28 text

Data Exploration – Visualization Tools  Column/bar: Large change  Line, curve: Small change, long periods  Histogram: Frequency distribution https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp

Slide 29

Slide 29 text

Data Preparation CRISP DM Methodology

Slide 30

Slide 30 text

“ ” Which one should I include (or exclude)? Data Selection

Slide 31

Slide 31 text

Data Cleaning Dirty Data Missing value Incomplete Outdated Duplication Outlier Remember: Expect problems in your data.

Slide 32

Slide 32 text

Data Construction  Feature engineering – derived attributes, e.g.: year from timestamp quarter from timestamp BMI from weight and height Log(x) for skewed data (e.g. house price)

Slide 33

Slide 33 text

Data Splitting Two kinds of data splitting: Training-Validation-Testing Cross Validation

Slide 34

Slide 34 text

Data Splitting – Training-Validation-Testing • Construct classifier Training • Pick algorithm • Knob settings (tree depth, k in kNN, c in SVM) Validation • Estimate future error rate Testing Split randomly to avoid bias http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf

Slide 35

Slide 35 text

Data Splitting – Cross Validation Every point is both training and testing, never at the same time

Slide 36

Slide 36 text

Dimensionality Reduction Principal Component Analysis Linear Discriminant Analysis vs

Slide 37

Slide 37 text

Modeling CRISP DM Methodology

Slide 38

Slide 38 text

Machine Learning Classification Regression Ranking Clustering

Slide 39

Slide 39 text

Model Selection Regression Technique Generalization bound Linear regression Kernel ridge regression Support vector regression Lasso

Slide 40

Slide 40 text

“ ” Which one should I choose? Should I use all of them?

Slide 41

Slide 41 text

It depends on…

Slide 42

Slide 42 text

Model Selection Assumptions The predictors are linearly independent The error is a random variable with a mean of zero conditional on the explanatory variables The sample is representative of the population for the inference prediction Interpretability The understandability of why the model is true or how the model is induced from https://chenhaot.com/pubs/mldg-interpretability.pdf

Slide 43

Slide 43 text

Beware of Overfitting! http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png

Slide 44

Slide 44 text

Model Assessment Regression • (R)MSE • Mean Absolute Error • Correlation Coefficient Classification • Accuracy • Precision • Recall • F-score Descriptive • Std. Error • p-value • Confidence Interval

Slide 45

Slide 45 text

Evaluation CRISP DM Methodology

Slide 46

Slide 46 text

“ ” Does my model solve the problem? What is the impact? Is it novel? How useful is the solution?

Slide 47

Slide 47 text

Deployment CRISP DM Methodology

Slide 48

Slide 48 text

The Tasks Plan deployment Plan monitoring and maintenance Produce final report Review project

Slide 49

Slide 49 text

Tools & Resource  Text mining: NLTK, spaCy, OpenNLP  Query expansion & clustering: Carrot2, Weka  Data mining & machine learning: Weka, scikit-learn  Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala  Python lib: Pandas, SciPy, NumPy, scikit-learn  Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark  Visualization: D3.js  Community: Big Data & Open Data Indonesia

Slide 50

Slide 50 text

“ ” Thank you! Data Mining 101 – Python-ID Meetup February 2015 Okiriza Wibisono - @okiriza Ali Akbar Septiandri - @aliakbars