Simple rules for building robust machine learning models

Slide 1

Slide 1 text

SIMPLE RULES FOR BUILDING ROBUST MACHINE LEARNING MODELS WITH EXAMPLES IN R Kyriakos Chatzidimitriou Research Fellow, Aristotle University of Thessaloniki PhD in Electrical and Computer Engineering [email protected] AMA Call Early Career and Engagement IG

Slide 2

Slide 2 text

ABOUT ME • Born in 1978 • 1997-2003, Diploma in Electrical and Computer Engineering, AUTH, GREECE • 2003-2004, Worked as a developer • 2004-2006, MSc in Computer Science, CSU, USA • 2006-2007, Greek Army • 2007-2012, PhD, AUTH, GREECE • Reinforcement learning and evolutionary computing mechanisms for autonomous agents • 2013-Now, Research Fellow, ECE, AUTH • 2017-Now, co-founder, manager and full stack developer of Cyclopt P.C. • Spin-off company of AUTH focusing on software analytics

Slide 3

Slide 3 text

GENERAL CAREER ADVICE Life is hard and full of problems. No point thus in meaningless suffering J To be happy and for the problems you can choose, choose those that you like solving. By working on the 10K hour of more… …you will be too good to be ignored and you will achieve that by focusing on deep work and working on difficult problems Positive feedback loop, where good things happen

Slide 4

Slide 4 text

WHAT IS ML (SUPERVISED)

Slide 5

Slide 5 text

WHAT AM I WORKING ON ML WISE Deep website aesthetics AutoML Continuous Implicit Authentication Formatting or linting errors

Slide 6

Slide 6 text

SIMPLE RULE 1: SPLIT YOU DATA IN THREE

Slide 7

Slide 7 text

GREEK SONG • «Δεν υπάρχει ευτυχία, που να κόβεται στα τρία…» – “There is no happiness split in three…” • Not true for ML

Slide 8

Slide 8 text

THE THREE SETS •Training set — Data on which the learning algorithms runs •Validation set — Used for making decisions: tuning parameters, selecting features, model complexity Test set — Only used for evaluating performance Else data snooping

Slide 9

Slide 9 text

DO NOT TAKE ANY DECISIONS ON THE TEST SET - Do not use it for selecting ANYTHING! - Rely on the validation score only

Slide 10

Slide 10 text

VISUAL EXAMPLES OF SPLITS Whole Dataset Training Val Test 1F 2F 3F 4F 5F 6F 7F 8F 9F 10F Test Test 60-20-20 split 10-CV LOOCV

Slide 11

Slide 11 text

R EXAMPLE k = 5 results <- numeric(k) ind <- sample(2, nrow(iris), replace=T, prob=c(0.8, 0.2)) trainval <- iris[ind==1,] test <- iris[ind !=1,] cv <- sample(rep(1:k, length.out=nrow(trainval))) for(i in 1:k) { trainData <- trainval[cv != i,] valData <- trainval[cv == i,] model <- naiveBayes(Species ~ ., data=trainData, laplace = 0) pred <- predict(model, valData) results[i] <- Accuracy(pred, valData$Species) } print(mean(results)) # after finding the best laplace finalmodel <- naiveBayes(Species ~ ., data=trainval, laplace = 0) pred <- predict(finalmodel, test) print(Accuracy(pred, test$Species)) Validation - 0.95 vs. 0.91 - Test

Slide 12

Slide 12 text

SIMPLE RULE 2: SPLIT YOUR DATA IN THREE, CORRECTLY

Slide 13

Slide 13 text

1948 US ELECTIONS 1948, Truman vs. Dewey Newspaper made a phone poll the previous day Most of Dewey supporters had phones those days

Slide 14

Slide 14 text

RULE • Choose validation and test sets to reflect the data you expect to see in the future • Ideally performance in validation and test sets should be the same • Example: Let’s say validation set performance is super and test set performance is so and so • If from the same distribution: • You had overfitted the validation set • If from different distributions • You had overfitted the validation set • Test set is harder • Test set is different

Slide 15

Slide 15 text

EXAMPLE OF STRATIFIED CV IN R iris$Species <- as.numeric(iris$Species) folds <- createFolds(iris$Species, list=FALSE) #caret iris$folds <- folds ddply(iris, 'folds', summarise, prop=mean(Species)) non_strat_folds <- sample(rep(1:10, length.out=nrow(iris))) iris$non_strat_folds <- non_strat_folds ddply(iris, 'non_strat_folds', summarise, prop=mean(Species)) Things will be (much) worse if the distribution is more skewed

Slide 16

Slide 16 text

SIMPLE RULE 3: DATASET SIZE IS IMPORTANT

Slide 17

Slide 17 text

SIZE HEURISTICS • #1 Good validation set sizes are between 1000 and 10K • #2 For the training set have at least 10x the VC-dimension • For a NN is roughly equal to the number of weights • #3 Popular heuristic for test size should be 30%, less for large problems • #4 If you have more data, put them in the validation set to reduce overfitting • #5 The validation set should be large enough to detect differences between algorithms • For distinguishing between classifier A with 90% accuracy and B with 90.1% accuracy then 100 validation examples will not do it.

Slide 18

Slide 18 text

FINANCIAL IMPACT OF 0.1% Before 10,000,000 searches 1% CTR 100,000 visitors 1% conversion 1,000 purchases $100 $100,000 After 10,000,000 searches 1.1% CTR 110,000 visitors 1.1% conversion 1,210 purchases $100 $121,000 +$21,000 Bid Prediction Adaptive Content

Slide 19

Slide 19 text

LEARNING CURVES AS A TOOL

Slide 20

Slide 20 text

SIMPLE RULE 4: CHOOSE ONE METRIC

Slide 21

Slide 21 text

DIFFERENT METRIC FOR DIFFERENT NEEDS • The one metrics allows faster iterations and focus • Are your classes balanced? Use accuracy • Are your classes imbalanced? Use the F1-score • Are you doing multilabel classification? Use for example macro-averaged accuracy • "#$%& = ( ) ∑ +,( ) (+ , + , + , + ) • B is a binary evaluation metric like Accuracy = :;<=:>< :;<=?;<=:><=?>< • The application dictates the metric • Continuous Implicit Authentication: Equal Error Rate • Combines two metrics: False Acceptance Rate and False Rejection Rate • Interested both in preventing impostors but also allowing legitimate users

Slide 22

Slide 22 text

SIMPLE RULE 5: ALWAYS DO YOUR EDA

Slide 23

Slide 23 text

THE QUESTION TO ASK IN EXPLORATORY DATA ANALYSIS • Definition: Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. • Do you see what you expect to see?

Slide 24

Slide 24 text

R COMMANDS data <- read.csv(file="winequality-red.csv", header=T, sep=";") head(data) – did I read it OK? str(data) – Am I satisfied with the datatypes? dim(data) – Dataset size summary(data) – Summary statistics, missing values? table(data$quality) – Distribution of class variable

Slide 25

Slide 25 text

CORRPLOT(W, METHOD="CIRCLE", TL.COL="BLACK", TL.SRT=45) install.packages(“corrplot”) - Check if correlations make sense. - Decide on dropping uncorrelated Variables with the class

Slide 26

Slide 26 text

BOX-PLOTS g <- list() j <- 1 long <- melt(data) for(i in names(data)) { subdata = long[long$variable == i,] g[[j]] <- ggplot(data = subdata, aes(x=variable, y=value)) + geom_boxplot() j = j+1 } grid.arrange(grobs=g, nrow = 2) - Check outliers

Slide 27

Slide 27 text

DENSITY PLOTS g <- list() j <- 1 for(i in names(data)) { print(i) p <- ggplot(data = data, aes_string(x=i)) + geom_density() g[[j]] <- p j = j+1 } grid.arrange(grobs=g, nrow = 2) - Check normally distributed or right/positively skewed

Slide 28

Slide 28 text

SIMPLE RULE 6: BE CAREFUL WITH DATA PREPROCESSING AS WELL

Slide 29

Slide 29 text

IMPUTING • Imputation: the process of replacing missing data with substituting values • Calculate statistics on training data, i.e. mean • Use this mean to replace missing data on both the validation and the testing datasets • Same for normalization or standardization • Normalization sensitive to outliers

Slide 30

Slide 30 text

PROPROCESSING EXAMPLES IN R ind <- sample(3, nrow(data), replace=TRUE, prob=c(0.6, 0.2, 0.2)) trainData <- data[ind==1,] valData <- data[ind==2,] testData <- data[ind==3,] trainMaxs <- apply(trainData[,1:11], 2, max) trainMins <- apply(trainData[,1:11], 2, min) normTrainData <- sweep(sweep(trainData[,1:11], 2, trainMins, "-"), 2, (trainMaxs - trainMins), "/") summary(normTrainData)

Slide 31

Slide 31 text

PROPROCESSING EXAMPLES IN R normValData <- sweep(sweep(valData[,1:11], 2, trainMins, "-"), 2, (trainMaxs - trainMins), "/") Not an issue if data is big and correct sampling is kept.

Slide 32

Slide 32 text

SIMPLE RULE 7: DON’T BE UNLUCKY

Slide 33

Slide 33 text

BE KNOWLEDGEABLE • Aka how randomness affects results… • If you don’t want to be unlucky do 10 times the 10-fold cross-validation and average the averages and get precise estimates

Slide 34

Slide 34 text

EXAMPLE IN R results <- numeric(100) for(i in 1:100) { ind <- sample(2, nrow(iris), replace=T, prob=c(0.9, 0.1)) trainData <- iris[ind==1,] valData <- iris[ind==2,] model <- naiveBayes(Species ~ ., data=trainData) pred <- predict(model, valData) results[i] <- Accuracy(pred, valData$Species) } • Even in this simple dataset and scenario….55/100 splits gave perfect score in one run. • With simple 10-fold cross-validation I could have gotten 100% validation accuracy. • In one run I got 70%...30% difference based on luck.

Slide 35

Slide 35 text

SIMPLE RULE 8: TESTING

Slide 36

Slide 36 text

TEST WHICH MODEL IS SUPERIOR Depends on what you are doing: If you work in a single dataset and you are in the industry, probably you go with the model that has the best metric in the validation data, backed by the testing data metric If you are doing research you can add statistical testing If you are building ML algorithms and you are comparing different algorithms on a whole lot of datasets, check J. Demsar’s 2006 JMLR paper (more than 7K citations)

Slide 37

Slide 37 text

CHOOSING BETWEEN TWO • X and Y models, 10-fold CV • For a given confidence level, we will check whether the actual difference exceeds the confidence limit • Decide on a confidence level: 5% or 1% • Use Wilcoxon test • Other tests require more assumptions that are valid with large samples

Slide 38

Slide 38 text

R EXAMPLE resultsMA <- numeric(10) resultsMB <- numeric(10) cv <- sample(rep(1:10, nrow(iris)/10)) for(i in 1:10) { trainData <- iris[cv == i,] valData <- iris[cv != i,] model <- naiveBayes(Species ~ ., data=trainData) pred <- predict(model, valData) resultsMA[i] <- Accuracy(pred, valData$Species) ctree = rpart(Species ~ ., data=trainData, method="class",minsplit = 1, minbucket = 1, cp = -1) pred <- predict(ctree, valData, type="class") resultsMB[i] <- Accuracy(pred, valData$Species) } wilcoxon.test(resultsMA, resultsMB) If p value less than confidence level then there is statistical significance.

Slide 39

Slide 39 text

SIMPLE RULE 9: TIME IS MONEY

Slide 40

Slide 40 text

TIME IS MONEY • Before doing the whole experimentation, play with a small(er) dataset • What should this data be? • Representative!!! • Check all the pipeline, end-to-end GPU instances

Slide 41

Slide 41 text

SIMPLE RULE 10: KNOW THY DATA

Slide 42

Slide 42 text

DO I HAVE ENOUGH DATA? • Learning curves…. • Else augment • How can one augment?

Slide 43

Slide 43 text

IMAGE AUGMENTATION https://github.com/aleju/imgaug

Slide 44

Slide 44 text

SMOTE TO OVERSAMPLE MINORITY CLASS https://www.researchgate.net/figure/Graphical-representation-of-the-SMOTE-algorithm-a-SMOTE-starts-from-a-set-of-positive_fig2_317489171

Slide 45

Slide 45 text

MORE ANNOTATIONS

Slide 46

Slide 46 text

SIMPLE RULE 11: DECIDE ON YOUR GOAL

Slide 47

Slide 47 text

IS IT INTERPRETABILITY OR PERFORMANCE? • Decide what are you striving for. • (Multi)-Collinearity • X1 = a * X2 + b • Many different values of the features could predict equally well Y • Variance Inflation Factor (VIF) test • 1, no collinearity • >10, indication of collinearity • Discussed in: http://kyrcha.info/2019/03/22/on-collinearity-and-feature-selection

Slide 48

Slide 48 text

R EXAMPLE Miles per gallon prediction, autompg dataset

Slide 49

Slide 49 text

VIF

Slide 50

Slide 50 text

REMOVE DISPL.

Slide 51

Slide 51 text

VIF AGAIN

Slide 52

Slide 52 text

COMPARISON

Slide 53

Slide 53 text

REMOVE ALL VIF>10

Slide 54

Slide 54 text

RIDGE REGRESSION Regularization gives preference towards one Solution over the others.

Slide 55

Slide 55 text

RESULTS

Slide 56

Slide 56 text

SIMPLE RULE 12: START BY CHOOSING THE CORRECT MODEL FOR YOUR PROBLEM

Slide 57

Slide 57 text

RANDOM FORESTS • Nice algorithm, works on a lot of dataset (Fernandez-Delgado et al., JMLR, 2014) • Few important parameters to tune • Handles multiclass problems (unlike for example SVMs) • Can handle a mixture of features and scales

Slide 58

Slide 58 text

SVM - Nice algorithm, works on a lot of dataset (Fernandez-Delgado et al., JMLR, 2014) - Robust theory behind it - Good for binary classification and 1-class classification We use it in Continuous Implicit Authentication - Can handle sparse data

Slide 59

Slide 59 text

GRADIENT BOOSTING MACHINES • Focuses on difficult samples that are hard to learn • If you have outliers, it will boost them to be the most important points • So have important outliers and not errors as outliers • Is more of a black-box, even though it is tree-based • Needs more tuning • Easy to overfit • Mostly better results that RF

Slide 60

Slide 60 text

DEEP LEARNING • Choose if you have lots of data and computational resources • Don’t have to throw away anything. Solves the problem end-to-end.

Slide 61

Slide 61 text

SIDENOTE: BIAS - VARIANCE

Slide 62

Slide 62 text

BIAS VS. VARIANCE Bias: algorithm’s error rate on the training set. Erroneous assumptions in the learning algorithm. Variance: difference in error rate between training set and validation set. It is caused by overfitting to the training data and accounting for small fluctuations. Learning from Data slides: http://work.caltech.edu/telecourse.html

Slide 63

Slide 63 text

EXAMPLE

Slide 64

Slide 64 text

SIMPLE RULE 13: BECOME A KNOWLEDGEABLE TRADER

Slide 65

Slide 65 text

BIAS VARIANCE TRADE-OFF HEURISTICS • #1 High bias => Increase model size (usually with regularization to mitigate high variance) • #2 High variance => add training data (usually with a big model to handle them)

Slide 66

Slide 66 text

TRADE FOR BIAS • Will reduce (avoidable) bias • Increase model size (more neurons/layers/trees/depth etc.) • Add more helpful features • Reduce/remove regularization (L2/L1/dropout) • Indifferent •Add more training data

Slide 67

Slide 67 text

TRADE FOR VARIANCE • Reduce variance •Add more training data •Add regularization •Early stopping (NN) •Remove features •Decrease model size (prefer regularization) • Usually big model to handle training data and then add regularization •Add more helpful features

Slide 68

Slide 68 text

SIMPLE RULE 14: FINISH OFF WITH AN ENSEMBLE

Slide 69

Slide 69 text

ENSEMBLE TECHNIQUES - By now you’ve built a ton of models - Bagging: RF - Boosting: AdaBoost, GBT - Voting/Averaging - Stacking Classifier Classifier Classifier Classifier Classifier Final Prediction Predictions Training Data

Slide 70

Slide 70 text

SIMPLE RULE 15: TUNE HYPERPARAMETERS…BUT TO A POINT

Slide 71

Slide 71 text

TUNE THE MOST INFLUENTIAL PARAMETERS • There is performance to be gained by parameter tuning (Bagnal and Crawley 2017) • Tons of parameters, we can’t tune them all • Understand how they influence training + read relevant papers/walkthroughs • Random forests (Fernandez-Delgado et al., JMLR, 2014) • mtry: Number of variables randomly sampled as candidates at each split. • SVM (Fernandez-Delgado et al., JMLR, 2014) • tuning the regularization and kernel spread

Slide 72

Slide 72 text

SIMPLE RULE 16: START A WATERFALL LIKE PROCESS

Slide 73

Slide 73 text

THE PROCESS Study the problem EDA Define optimization strategy (validation, test sets and metric) Feature Engineering Modelling Ensembling Error Analysis

Slide 74

Slide 74 text

GENERAL RULE

Slide 75

Slide 75 text

THE BASIC RECIPE (BY ANDREW NG) http://t.co/1Rn6q35Qf2

Slide 76

Slide 76 text

THANK YOU For further AMA questions open an issue at: https://github.com/kyrcha/ama

Slide 77

Slide 77 text

FURTHER READING • Personal Experiences • Various resources over the internet and the years • ML Yearning: https://www.mlyearning.org/ • Machine Learning from Data course: http://work.caltech.edu/telecourse.html • Practical Machine Learning with H2O book