“ ” The process of collecting, searching through, and analyzing a large amount of data in a database, as to discover patterns or relationships. Data Mining - dictionary.reference.com
Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment – Risks and Contingencies Contingency Plan Financial Organizational Business http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment – Terminology Write down related terminology http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg
Data Exploration – Visualization Heuristics Visualize fast. Visualize reactively. Go for high information 2D visualizations. Select data subsets to visualize. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration – Visualization Heuristics Never let anomalies pass you by. Dig deeper. Use your visualizations to inform potential models. Use your potential model to direct your visualizations. Expect problems in your data. http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
“ ” This is the cheapest and most informative stage of data mining. Nigel Goddard – DME Visualization http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration – Visualization Tools Column/bar: Large change Line, curve: Small change, long periods Histogram: Frequency distribution https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
Data Construction Feature engineering – derived attributes, e.g.: year from timestamp quarter from timestamp BMI from weight and height Log(x) for skewed data (e.g. house price)
Data Splitting – Training-Validation-Testing • Construct classifier Training • Pick algorithm • Knob settings (tree depth, k in kNN, c in SVM) Validation • Estimate future error rate Testing Split randomly to avoid bias http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
Model Selection Assumptions The predictors are linearly independent The error is a random variable with a mean of zero conditional on the explanatory variables The sample is representative of the population for the inference prediction Interpretability The understandability of why the model is true or how the model is induced from https://chenhaot.com/pubs/mldg-interpretability.pdf