GIDS 2014 - Learning from 2014

Learning from Data Busy Professional’s guide to machine learning @govindk
http://govindkanshi.wordpress.com

Agenda • What we know • What we do not
know • Process • What to measure • Challenge with Model • Challenge with Data • Resources • Software • Books

What we know • Reports made from data • KPIs
made of data • Dashboards made of data • They all measure known metrics, questions

What we do not know • Will this person turn
delinquent in x years based on his profile (age/income/background…) • Which kind of process, machine will fail • Which people/things are similar to each other – find me a pattern • Prevent people from readmission into Hospital • Why - because we do not know the question and database/applications do not have oob functionality.

We are already using applied ML results • Mails get
despammed • Kinect recognizes our gestures • Facebook recognizes our photos • Siri/Cortana – recognize our voice commands • Watson used some • Search uses many • Recommendation is there in face

So then • Learn from data • How • Create
a model of the data • Test the model for error and use it

Unsupervised • Clustering • Customer segmentation • Topic identification •
Number of algorithms • Hierarchical (distance as measure – generally Euclidian ) • Agglomerative ( start with n groups and start merging them) • Single Link (2 at time) vs divisive (start single – break it down)

Simple way • Group folks on • Height • What
you eat • Where you are from (state) • Next time a new person comes in – let us predict

Demos • USArrests Data • Wine Data

Challenges and next steps • How many groups/clusters • How
many miss-groupings (Evaluation) • Associate Topics & after Clustering what • Once clusters are formed – some one can name them • Now run supervised methods on data to learn more

Supervised learning • Given a label L for a attributes
(a1,a2,a3..) • Learn the model which can predict the label based on attributes

Simple way to understand Classification • Let us say we
are labelled north indian, south indian • How • Attributes (language, food, movie language, music …) • Basically learning the link between • An observed data X and • A variable y usually called target or labels.

Supervised • Data • One dataset for training which has
label • One dataset for testing • Example • Classification (spam, order data, disease data, Kinect gesture) • Classification • binary vs. multiclass • Regression (sales) • Ranking • Search • Predictive maintenance • Recommendation • Netflix - Netflix competition = SVD

Demos • Trees • DecisionTree – Python (show train and
test, validation) • Decision tree – R • BigML (nw dependent) • Challenge – • one input every time

Few more terms to overcome data issues • Bagging –
(used with tree models) (bias reduction) • Train an ensemble of models from Bootstrap samples • Get a vote amongst models • Class predicted by majority of the model wins • Get an average if outputs are scores or probabilities • * Bootstrap – denotes different random sample of dataset • Boosting (variance reduction) • Like Bagging but penalizes & learns from misclassification • Challenge of assigning “weights” misclassified instances to penalize • Start with higher weight say 1 and keep reducing till error comes down

Demo • RandomForest • n training data out of N,
at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result. • Evaluation • Loss function to margins (penalize mis-classification, reward +ve)

Regression • Explain relationship betwee two variables (dependent vs independent)
• Simple linear - y = W0 + W1 x1 + W2 x2 + … • Estimate the weights to predict y • Multivariate

Demos • Excel • SimpleLinear -R • RandomForest – Wine
• Evaluate by applying loss function to residuals

What to meaure • Data • Cross Validation • n-fold
cross-validation • Leave-one-out validation • Hold out • Eod – how much data is enough, is there bias in data (only certain kind of labels) • Model Results • Contingency table(true negatives & false positive are bad ) • ROC & AUC (coverage curve) (true positive vs false positives) • Precision/Recall (from search world) • F-measure • Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)

Is Model working right Predicted +ve Predicted -ve Actual +ve
40 15 55 Actual -ve 5 40 45 45 55 100 Precision 40/45 Recall 40/55 F measure (Harmonic mean) 2/((1/prec) + (1/rec)) Accuracy TPR(40) + TPN(55)/ (40+15+5+40) How much accuracy is enough Lift – How much better than random guessing Lift and accuracy do not have correlation

Challenge with Model • Overfitting • Avoid Bias and have
less variance • Use Regularization • L1 (Ridge) • L2 (Lasso) • If time permits show the alpha effect • Look for “overfitting model” , “bias and variance”

Challenge with Data • Categorical, ordinal, quantitative • Measures –
mean, median, variance, std deviation, range, shape (skewness) • Always observe to get “feel”/smell of data • Discretize/Thresholding (convert quantitative feature) • Missing feature(s) – • What do you do – median, avg • Data encoding • Create new from existing vs encode in different way

Feature engineering • Feature selection • Intuition, testing co-relation •
Subset (Start small and increase) based on some error function • Feature extraction • New k dimensions – as combination of older d dimensions • Linear • PCA (find the variance by projecting – explains impact of outliers) • LDA (supervised method for dimension redn for classification) • FA(Factor Analysis), Multidimensional Scaling(distance between points) • IsoMap (geodesic distance) and Locally Linear Embedding (LLE)

What we could not cover • Mechanisms • Reinforcement Learning
(punishment/rewards to learn better) • Algorithm types • Perceptron (back propogation, som, ..) • SVM • LDA and friends for unstructured world • Regression(ols,logistic,stepwise,mars) • Regularization (ridge/lasso) • Trees (GBM,c4.5, ID3…) • Bayesian • Kernel (radial) • Deep learning(DBN, Boltzman..) • Clustering (Expectation Max) • Recommendation • Probability (distributions) & Linear Algebra • Constraint Solving and Optimization (Solver, OpenSolver..)

Tools • R • Scikit • Theano • Weka •
Kmine • Recommender (.net….) • DataTau • BigML • WiseIO • Skytree • SAS/SPSS • YHatr

Books • Bishop • Alpyadin • John Foreman • PyMC
– Search query (Bayesian-Methods-for-Hackers) • Scikit – • jakevdp – “scikit jake 2014 tutorial” • Olvier – “scikit olvier grasel tutorial” • Recommender (http://mymedialite.net/) – Zeno Ganter

What you will be doing • Data • Touch/feel (visualize),breathe
it in • Cleaning, scaling/normalization • Selecting • Algorithm (chose the task) • Classification • Regression • Ranking (recommendation, search results) • Amongst • Evaluate Algorithm against each other & refine/calibrate • AUC, ROC, RMSE etc…

If time & net permits Yhatr demo • Because you
need to deploy,test & use the model • Yhatr provides good host (theirs and host your own)

Thanks for your time • Please fill the evaluation form
• See you next time

Reference

GIDS 2014 - Learning from 2014

GIDS 2014 - Learning from 2014

Govind Kanshi

More Decks by Govind Kanshi

Other Decks in Programming

Featured

Transcript

Learning from Data Busy Professional’s guide to machine learning @govindk

Agenda • What we know • What we do not

What we know • Reports made from data • KPIs

What we do not know • Will this person turn

We are already using applied ML results • Mails get

So then • Learn from data • How • Create

Unsupervised • Clustering • Customer segmentation • Topic identification •

Simple way • Group folks on • Height • What

Demos • USArrests Data • Wine Data

Challenges and next steps • How many groups/clusters • How

Supervised learning • Given a label L for a attributes

Simple way to understand Classification • Let us say we

Supervised • Data • One dataset for training which has

Demos • Trees • DecisionTree – Python (show train and

Few more terms to overcome data issues • Bagging –

Demo • RandomForest • n training data out of N,

Regression • Explain relationship betwee two variables (dependent vs independent)

Demos • Excel • SimpleLinear -R • RandomForest – Wine

What to meaure • Data • Cross Validation • n-fold

Is Model working right Predicted +ve Predicted -ve Actual +ve

Challenge with Model • Overfitting • Avoid Bias and have

Challenge with Data • Categorical, ordinal, quantitative • Measures –

Feature engineering • Feature selection • Intuition, testing co-relation •

What we could not cover • Mechanisms • Reinforcement Learning

Tools • R • Scikit • Theano • Weka •

Books • Bishop • Alpyadin • John Foreman • PyMC

What you will be doing • Data • Touch/feel (visualize),breathe

If time & net permits Yhatr demo • Because you

Thanks for your time • Please fill the evaluation form

Reference