Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GIDS 2014 - Learning from 2014

GIDS 2014 - Learning from 2014

Machine learning for busy SW professionals. This session is targeted at folks who are curious about machine learning and want to get a gist by looking at examples rather than dry theory. It will be a crisp presentation which takes various datasets and uses bunch of tools. Intention here is to share a way to comprehend what is involved at high level in machine learning. Since the ground is very vast this session will focus on applied usage of Machine learning with demos using Excel, R, Scikit and others. You will walk out with what it means to create a model using simple algorithms, evaluate a model. Idea here would be to simplify the topic and create enough interest so that attendees can go and follow-up on topic on their own using their favourite tool.

Govind Kanshi

April 25, 2014
Tweet

More Decks by Govind Kanshi

Other Decks in Programming

Transcript

  1. Agenda • What we know • What we do not

    know • Process • What to measure • Challenge with Model • Challenge with Data • Resources • Software • Books
  2. What we know • Reports made from data • KPIs

    made of data • Dashboards made of data • They all measure known metrics, questions
  3. What we do not know • Will this person turn

    delinquent in x years based on his profile (age/income/background…) • Which kind of process, machine will fail • Which people/things are similar to each other – find me a pattern • Prevent people from readmission into Hospital • Why - because we do not know the question and database/applications do not have oob functionality.
  4. We are already using applied ML results • Mails get

    despammed • Kinect recognizes our gestures • Facebook recognizes our photos • Siri/Cortana – recognize our voice commands • Watson used some • Search uses many • Recommendation is there in face
  5. So then • Learn from data • How • Create

    a model of the data • Test the model for error and use it
  6. Unsupervised • Clustering • Customer segmentation • Topic identification •

    Number of algorithms • Hierarchical (distance as measure – generally Euclidian ) • Agglomerative ( start with n groups and start merging them) • Single Link (2 at time) vs divisive (start single – break it down)
  7. Simple way • Group folks on • Height • What

    you eat • Where you are from (state) • Next time a new person comes in – let us predict
  8. Challenges and next steps • How many groups/clusters • How

    many miss-groupings (Evaluation) • Associate Topics & after Clustering what • Once clusters are formed – some one can name them • Now run supervised methods on data to learn more
  9. Supervised learning • Given a label L for a attributes

    (a1,a2,a3..) • Learn the model which can predict the label based on attributes
  10. Simple way to understand Classification • Let us say we

    are labelled north indian, south indian • How • Attributes (language, food, movie language, music …) • Basically learning the link between • An observed data X and • A variable y usually called target or labels.
  11. Supervised • Data • One dataset for training which has

    label • One dataset for testing • Example • Classification (spam, order data, disease data, Kinect gesture) • Classification • binary vs. multiclass • Regression (sales) • Ranking • Search • Predictive maintenance • Recommendation • Netflix - Netflix competition = SVD
  12. Demos • Trees • DecisionTree – Python (show train and

    test, validation) • Decision tree – R • BigML (nw dependent) • Challenge – • one input every time
  13. Few more terms to overcome data issues • Bagging –

    (used with tree models) (bias reduction) • Train an ensemble of models from Bootstrap samples • Get a vote amongst models • Class predicted by majority of the model wins • Get an average if outputs are scores or probabilities • * Bootstrap – denotes different random sample of dataset • Boosting (variance reduction) • Like Bagging but penalizes & learns from misclassification • Challenge of assigning “weights” misclassified instances to penalize • Start with higher weight say 1 and keep reducing till error comes down
  14. Demo • RandomForest • n training data out of N,

    at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result. • Evaluation • Loss function to margins (penalize mis-classification, reward +ve)
  15. Regression • Explain relationship betwee two variables (dependent vs independent)

    • Simple linear - y = W0 + W1 x1 + W2 x2 + … • Estimate the weights to predict y • Multivariate
  16. Demos • Excel • SimpleLinear -R • RandomForest – Wine

    • Evaluate by applying loss function to residuals
  17. What to meaure • Data • Cross Validation • n-fold

    cross-validation • Leave-one-out validation • Hold out • Eod – how much data is enough, is there bias in data (only certain kind of labels) • Model Results • Contingency table(true negatives & false positive are bad ) • ROC & AUC (coverage curve) (true positive vs false positives) • Precision/Recall (from search world) • F-measure • Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
  18. Is Model working right Predicted +ve Predicted -ve Actual +ve

    40 15 55 Actual -ve 5 40 45 45 55 100 Precision 40/45 Recall 40/55 F measure (Harmonic mean) 2/((1/prec) + (1/rec)) Accuracy TPR(40) + TPN(55)/ (40+15+5+40) How much accuracy is enough Lift – How much better than random guessing Lift and accuracy do not have correlation
  19. Challenge with Model • Overfitting • Avoid Bias and have

    less variance • Use Regularization • L1 (Ridge) • L2 (Lasso) • If time permits show the alpha effect • Look for “overfitting model” , “bias and variance”
  20. Challenge with Data • Categorical, ordinal, quantitative • Measures –

    mean, median, variance, std deviation, range, shape (skewness) • Always observe to get “feel”/smell of data • Discretize/Thresholding (convert quantitative feature) • Missing feature(s) – • What do you do – median, avg • Data encoding • Create new from existing vs encode in different way
  21. Feature engineering • Feature selection • Intuition, testing co-relation •

    Subset (Start small and increase) based on some error function • Feature extraction • New k dimensions – as combination of older d dimensions • Linear • PCA (find the variance by projecting – explains impact of outliers) • LDA (supervised method for dimension redn for classification) • FA(Factor Analysis), Multidimensional Scaling(distance between points) • IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
  22. What we could not cover • Mechanisms • Reinforcement Learning

    (punishment/rewards to learn better) • Algorithm types • Perceptron (back propogation, som, ..) • SVM • LDA and friends for unstructured world • Regression(ols,logistic,stepwise,mars) • Regularization (ridge/lasso) • Trees (GBM,c4.5, ID3…) • Bayesian • Kernel (radial) • Deep learning(DBN, Boltzman..) • Clustering (Expectation Max) • Recommendation • Probability (distributions) & Linear Algebra • Constraint Solving and Optimization (Solver, OpenSolver..)
  23. Tools • R • Scikit • Theano • Weka •

    Kmine • Recommender (.net….) • DataTau • BigML • WiseIO • Skytree • SAS/SPSS • YHatr
  24. Books • Bishop • Alpyadin • John Foreman • PyMC

    – Search query (Bayesian-Methods-for-Hackers) • Scikit – • jakevdp – “scikit jake 2014 tutorial” • Olvier – “scikit olvier grasel tutorial” • Recommender (http://mymedialite.net/) – Zeno Ganter
  25. What you will be doing • Data • Touch/feel (visualize),breathe

    it in • Cleaning, scaling/normalization • Selecting • Algorithm (chose the task) • Classification • Regression • Ranking (recommendation, search results) • Amongst • Evaluate Algorithm against each other & refine/calibrate • AUC, ROC, RMSE etc…
  26. If time & net permits Yhatr demo • Because you

    need to deploy,test & use the model • Yhatr provides good host (theirs and host your own)