Slide 1

Slide 1 text

Learning What We Don’t Care About: Regularization with Sacrificial Functions Gregory Ditzler (with Sean Miller, Michael Valenzuela and Jerzy Rozenblit) Assistant Professor The University of Arizona Department of Electrical & Computer Engineering [email protected] uamlda.github.io April 27, 2018 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 2

Slide 2 text

Overview of Research Feature Selection Neyman- Pearson MABS+FS Parallelism Adversarial FS Misc. Model Optimization Compressive Sensing Applications Human Health Environmental Cyber Software Learning Ensembles Concept Drift Partial Information ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 3

Slide 3 text

Supervised Machine Learning in a Nutshell Data Machine Learning Model D := {(xi , yi )}n i=1 b y = (x) Dtest := {(xi , yi )}n i=1 Machine Learning Model Deployment Predictions Free Parameters ✓ Different Losses to Minimize −3 −2 −1 0 1 2 3 0 0.5 1 1.5 2 2.5 3 3.5 4 h ⋅ f L(h,f) log 0−1 hinge modified Huber ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 4

Slide 4 text

Motivation Parameter optimization is often a time consuming task Increased difficulty when a large amount of parameters need to be selected Need to find free parameters for a classification algorithm on a database Selection of parameters has a large effect on the performance of the algorithm on future testing databases Given a classification algorithm, what objective function should be optimized and what parameters need to be considered? G. Ditzler, S. Miller and J. Rozenblit, “Learning What We Don’t Care About: Regularization with Sacrificial Functions,” under revision in Information Science, 2018. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 5

Slide 5 text

Overview Plan of Attack for the Next 50 minutes 1 Motivation for Model Selection 2 Anti-Training with Sacrificial Functions 3 Experiments 4 Applications & Conclusion ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 6

Slide 6 text

Where it all started? Barcelona, Spain, 2010 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 7

Slide 7 text

Where it all started? Barcelona, Spain, 2010 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 8

Slide 8 text

Fast Forward to 2016 → DARPA: Data-Driven Discovery of Models (D3M) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 9

Slide 9 text

Fast Forward to 2016 → DARPA: Data-Driven Discovery of Models (D3M) Key Technologies to be Developed in the Course of the Program Automated composition of complex models. Techniques will be developed for automatically selecting model primitives and for composing selected primitives into complex modeling pipelines based on user-specified data and outcome(s) of interest. Evaluate on real-world problems that will progressively get harder ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 10

Slide 10 text

Fast Forward to 2016 → DARPA: Data-Driven Discovery of Models (D3M) 30k foot perspective All primitives have free parameters that require a fair amount of knowledge to choose parameters What objectives are good? How can they be optimized? Single objective? Multi-objective? How do we check if a model is over-training? A model over-training to a data set is defined as follows. A model h trained on a sample D, over-training occurs when ∃h ∈ H such that the following inequalities hold: errtrain(h) < errtrain(h) and errtrue(h) > errtrue(h) where errtrain(·) and errtrue(·) are the measured errors on the training and true distributions, respectively. Key Technologies to be Developed in the Course of the Program Automated composition of complex models. Techniques will be developed for automatically selecting model primitives and for composing selected primitives into complex modeling pipelines based on user-specified data and outcome(s) of interest. Evaluate on real-world problems that will progressively get harder ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 11

Slide 11 text

Measuring Error Bishop (2006) t x y(xn , w) tn xn −2 −1 0 1 2 z E(z) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 12

Slide 12 text

Measuring Error Bishop (2006) t x y(xn , w) tn xn −2 −1 0 1 2 z E(z) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 13

Slide 13 text

Overfitting y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 14

Slide 14 text

Overfitting y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 15

Slide 15 text

Overfitting y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 16

Slide 16 text

Overfitting y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 x t M = 9 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 17

Slide 17 text

Overfitting y(x, w) = w0 + w1 x + w2 x2 + . . . + wM xM = M j=0 wj xj M ERMS 0 3 6 9 0 0.5 1 Training Test Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 18

Slide 18 text

Choose Parameters and Optimize Different Objectives Example: Logistic Regression with 1 Regularization The free parameters θ are chosen then model is evaluated by a function f to measure generalization. For example, a logistic regression classifier with L1 regularization would have θ = regularizer then min w∈Rp ζ(w) = min w∈Rp E ξ(−ywTx) + θ w 1 where ξ(z) = log(1 + exp(z)) and w are the weights for the logistic regression model. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 19

Slide 19 text

Choose Parameters and Optimize Different Objectives Example: Logistic Regression with 1 Regularization The free parameters θ are chosen then model is evaluated by a function f to measure generalization. For example, a logistic regression classifier with L1 regularization would have θ = regularizer then min w∈Rp ζ(w) = min w∈Rp E ξ(−ywTx) + θ w 1 where ξ(z) = log(1 + exp(z)) and w are the weights for the logistic regression model. Setting Up the Problem What if I want optimize θ for an objective σ(θ), such that I have a “small” classification error and large F-score? We typically measure classification error, F-score, sensitivity, specificity, and AUC at test time, not ζ(θ)! Presenting the problem in this way is clearly noisy, non-convex and is a blackbox ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 20

Slide 20 text

Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 21

Slide 21 text

Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) Anti-training can be viewed as a generalization of meta-learning that exploits the consequences of the NFL Theorem. Anti-training tailors learning and optimization algorithms to problem distributions (i.e., data sets) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 22

Slide 22 text

Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) Anti-training can be viewed as a generalization of meta-learning that exploits the consequences of the NFL Theorem. Anti-training tailors learning and optimization algorithms to problem distributions (i.e., data sets) Bottleneck with Sacrificial Data Generating data to obtain measurements from F− is not trivial and how much sacrificial data do you need to generate? ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 23

Slide 23 text

Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) Anti-training can be viewed as a generalization of meta-learning that exploits the consequences of the NFL Theorem. Anti-training tailors learning and optimization algorithms to problem distributions (i.e., data sets) Bottleneck with Sacrificial Data Generating data to obtain measurements from F− is not trivial and how much sacrificial data do you need to generate? The real-problem: Execution of 10 fold cross validation can take up to 20 weeks for one data set! ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 24

Slide 24 text

Overview of Anti-Training with Sacrificial Functions Input: Model M with parameters θ Choose suitable F+ and F−. The choice of F− is meant to act as a form of regularization. Select a type optimization problem (i.e., single or multi-objective). Optimize the objective(s) for M with parameters θ. Output: θ Figure: High-level pseudo code for model optimization with anti-training sacrificial functions and meta-learning. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 25

Slide 25 text

Optimization Formalization Objective Function Design The first step in our framework is to form the optimization problem. The theory of anti-training defines three partitions (F+ , F0 , F− ) to the functions that can be optimized in the NFL. Ignoring F0 , we partition the optimization task over F+ ∪ F− ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 26

Slide 26 text

Optimization Formalization Objective Function Design The first step in our framework is to form the optimization problem. The theory of anti-training defines three partitions (F+ , F0 , F− ) to the functions that can be optimized in the NFL. Ignoring F0 , we partition the optimization task over F+ ∪ F− Single Objective: Optimize with Synthetic Annealing θ∗ = arg min θ∈Θ        f∈F+ f(θ, D) + β f ∈F− f (θ, D)         where β ≥ 0, f ≥ 0 ∀f ∈ F+ ∪ F− . ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 27

Slide 27 text

Optimization Formalization Objective Function Design The first step in our framework is to form the optimization problem. The theory of anti-training defines three partitions (F+ , F0 , F− ) to the functions that can be optimized in the NFL. Ignoring F0 , we partition the optimization task over F+ ∪ F− Single Objective: Optimize with Synthetic Annealing θ∗ = arg min θ∈Θ        f∈F+ f(θ, D) + β f ∈F− f (θ, D)         where β ≥ 0, f ≥ 0 ∀f ∈ F+ ∪ F− . Multi-Objective: Optimize with a genetic algorithm (e.g., NSGA-II) θ∗ = arg min θ∈Θ {f(θ, D)}f∈F+ , {f (θ, D)}f ∈F− ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 28

Slide 28 text

Designing Beneficial and Sacrificial Functions Beneficial Functions The goal of F+ is to have a class of functions that the classifier should perform well when considering a data set D. These functions are a bit easier to design since many experiments already make use of functions when they are perform their benchmarks. Examples of such functions for F+ include error, AUC, sensitivity, specificity, etc. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 29

Slide 29 text

Designing Beneficial and Sacrificial Functions Beneficial Functions The goal of F+ is to have a class of functions that the classifier should perform well when considering a data set D. These functions are a bit easier to design since many experiments already make use of functions when they are perform their benchmarks. Examples of such functions for F+ include error, AUC, sensitivity, specificity, etc. Sacrificial Functions The goal of F− is to have a class of functions that the classifier should not perform well when considering a data set D. An error measurement from a classifier trained/tested with randomly labeled data should be approximately a random guess. A measurement close to 0/1 could suggest overfitting. E.g., a model should not perform well on: f− sp = 1 2 − spec(θ, g(D)) , f− se = 1 2 − sens(θ, g(D)) , f− err = 1 2 − err(θ, g(D)) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 30

Slide 30 text

Designing Beneficial and Sacrificial Functions Beneficial Functions The goal of F+ is to have a class of functions that the classifier should perform well when considering a data set D. These functions are a bit easier to design since many experiments already make use of functions when they are perform their benchmarks. Examples of such functions for F+ include error, AUC, sensitivity, specificity, etc. Sacrificial Functions The goal of F− is to have a class of functions that the classifier should not perform well when considering a data set D. An error measurement from a classifier trained/tested with randomly labeled data should be approximately a random guess. A measurement close to 0/1 could suggest overfitting. E.g., a model should not perform well on: f− sp = 1 2 − spec(θ, g(D)) , f− se = 1 2 − sens(θ, g(D)) , f− err = 1 2 − err(θ, g(D)) Why the form of f− sp , f− se , and f− err ? arg minθ∈Θ {f+(θ, D)}f+∈F+ , {f−(θ, D)}f−∈F− ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 31

Slide 31 text

Multi-Objective Optimization (MOO) Pareto optimality & Evolutionary Algorithms Goal in MOO is to simultaneously optimize two or more conflicting objective functions An analytical solution that produces the entire Pareto front may be intractable The application of EA have the ability to search for multiple Pareto solutions concurrently NSGA-II algorithm is an elitist EA thats reduces computational complexity and diversity preservation Diversity preservation ensures that the EA has a good spread of solutions along the generated Pareto front as implemented in NSGA-II -0.5 0 0.5 1 1.5 x Tradeoff region between the green lines 0 2 4 6 8 Minimum(f 1 (x)) Minimum(f 2 (x)) f 1 (x) f 2 (x) 1 1.1 1.2 1.3 1.4 1.5 f 1 (x) 6 6.2 6.4 6.6 6.8 7 f 2 (x) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 32

Slide 32 text

ATSF: Optimizing your classifier as a blackbox Data ATSF Model Optimization D := {(xi , yi )}n i=1 Outputs ✓ Classifier M F+ F Error F-score Sensitivity MCC SF: Error SF: F-score ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 33

Slide 33 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 34

Slide 34 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 35

Slide 35 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 36

Slide 36 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 37

Slide 37 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 38

Slide 38 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 39

Slide 39 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation TPOT: This algorithm searches through an entire pipeline of different preprocessing stages then optimizes different classifiers. Experimental Protocol The data set is randomly partitioned into training (80%), and testing data (20%). ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 40

Slide 40 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation TPOT: This algorithm searches through an entire pipeline of different preprocessing stages then optimizes different classifiers. Experimental Protocol The data set is randomly partitioned into training (80%), and testing data (20%). The training data are used to optimize the parameters of the SVM using the meta-search algorithm then an SVM is learned from the training data. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 41

Slide 41 text

Benchmark Comparison Parameter Optimization Strategies The support vector machine was selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation TPOT: This algorithm searches through an entire pipeline of different preprocessing stages then optimizes different classifiers. Experimental Protocol The data set is randomly partitioned into training (80%), and testing data (20%). The training data are used to optimize the parameters of the SVM using the meta-search algorithm then an SVM is learned from the training data. The error and F-score is measured on the testing data. The F-score is computed for each class then we report the average. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 42

Slide 42 text

Statistical Analysis for Significance Testing Comparing Multiple Classifiers Across Multiple Data Sets We follow Demˇ sar’s recommendations for comparing multiple classifiers across multiple data sets The different algorithms are ranked from 1 to k according to their error rates on each data set. The Friedman test is used to check the hypothesis that the ranks are uniformly distributed ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 43

Slide 43 text

Statistical Analysis for Significance Testing Comparing Multiple Classifiers Across Multiple Data Sets We follow Demˇ sar’s recommendations for comparing multiple classifiers across multiple data sets The different algorithms are ranked from 1 to k according to their error rates on each data set. The Friedman test is used to check the hypothesis that the ranks are uniformly distributed If the Friedman test is rejected, the post-hoc Nemenyi test is used to check for statistical differences using multiple comparisons. To do so, the average rank for algorithms A and B are used to calculate: z(A, B) = RA − RB k(k+1) 6N where RA and RB are averaged ranks of algorithms A and B, k is the number of algorithms tested, and N is the number of data sets. z(A, B) is distributed as a normal Gaussian ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 44

Slide 44 text

Figure of Merit: Classification Error Data Set Samples Features None F1 F2 F3 Fmin TPOT blood 748 4 23.4 (4) 23.07 (3) 21.73 (1) 22.87 (2) 26.6 (5) 38.25 breast-cancer-wisc-diag 569 30 4.39 (5) 3.42 (4) 3.33 (3) 2.89 (2) 2.63 (1) 3.33 breast-cancer-wisc-prog 198 33 23.5 (2) 25 (4) 23.25 (1) 24 (3) 28.25 (5) 32.43 breast-cancer-wisc 699 9 4.93 (5) 4.93 (4) 4.5 (2) 4.71 (3) 3.93 (1) 2.63 breast-cancer 286 9 33.28 (5) 27.24 (1.5) 27.24 (1.5) 27.41 (3) 32.41 (4) 38.61 congressional-voting 435 16 39.2 (4) 38.86 (3) 36.82 (2) 36.14 (1) 43.64 (5) 45.06 conn-bench 208 60 13.1 (3) 12.38 (1) 12.86 (2) 13.81 (4) 17.14 (5) 17.12 credit-approval 690 15 15.47 (4) 15.4 (3) 14.82 (2) 14.17 (1) 16.47 (5) 13.84 cylinder-bands 512 35 21.17 (2.5) 21.17 (2.5) 21.55 (4) 20 (1) 28.83 (5) 22.2 echocardiogram 131 10 29.63 (5) 24.44 (2) 26.67 (4) 25.19 (3) 21.11 (1) 26.67 haberman-survival 306 3 28.06 (5) 24.03 (3) 23.55 (2) 22.42 (1) 26.77 (4) 42.31 heart-hungarian 294 12 23.9 (5) 17.29 (1) 19.15 (3) 18.98 (2) 19.66 (4) 21.19 hepatitis 155 19 16.56 (3) 17.5 (4) 16.25 (2) 15 (1) 21.56 (5) 26.9 ionosphere 351 33 6.62 (3) 6.9 (4) 5.77 (2) 4.79 (1) 10.02 (5) 6.76 mammographic 961 5 20.88 (5) 20.62 (4) 16.89 (1) 19.79 (3) 18.5 (2) 18.19 molec-biol-promoter 106 57 15 (3.5) 13.64 (1) 15 (3.5) 14.55 (2) 24.24 (5) 10.19 musk-1 476 166 9.17 (4) 8.33 (1) 8.75 (2) 8.85 (3) 9.58 (5) 9.37 oocytes merluccius 1022 41 16.44 (4) 15.95 (3) 15.41 (2) 15.32 (1) 20.87 (5) 22.75 oocytes trisopterus 912 25 15.3 (4) 14.15 (2) 14.1 (1) 14.21 (3) 17.05 (5) 19.02 ozone 2536 72 3.25 (2) 3.17 (1) 3.29 (4) 3.27 (3) 21.63 (5) 27.58 parkinsons 195 22 8.5 (1) 11.75 (4) 10.75 (2.5) 10.75 (2.5) 12.25 (5) 9.87 pima 768 8 28.18 (5) 26.04 (2) 26.23 (3) 25.97 (1) 26.75 (4) 28.24 planning 182 12 25.95 (1.5) 27.3 (3) 27.57 (4) 25.95 (1.5) 29.46 (5) 49.57 ringnorm 7400 20 1.73 (3) 1.78 (4) 1.65 (2) 1.63 (1) 2.11 (5) 1.47 spectf 80 44 24.12 (4) 18.82 (1) 21.18 (2.5) 21.18 (2.5) 28.82 (5) 24.01 statlog-australian-credit 690 14 33.45 (2) 34.6 (5) 33.81 (3) 34.24 (4) 32.59 (1) 47.14 statlog-german-credit 1000 24 27.66 (4) 27.61 (3) 27.21 (2) 26.92 (1) 27.86 (5) 32.11 statlog-heart 270 13 22.73 (5) 21.27 (4) 20.18 (2) 20.91 (3) 17.45 (1) 18.05 titanic 2201 3 27.94 (5) 26.49 (3) 25.71 (2) 24.97 (1) 27.1 (4) 29.85 vertebral 310 6 16.67 (2.5) 16.67 (2.5) 16.83 (4) 15.4 (1) 19.68 (5) 17.13 rank (without TPOT) 3.7 2.8 2.4 2.1 4.1 – rank (with TPOT) 4.0 3.1 2.6 2.3 4.5 4.5 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 45

Slide 45 text

Figure of Merit: F-score Data Set Samples Features None F1 F2 F3 Fmin TPOT blood 748 4 47.4 (3) 43.76 (5) 53.37 (2) 44.59 (4) 56.44 (1) 62.27 breast-cancer-wisc-diag 569 30 95.31 (5) 96.31 (4) 96.42 (3) 96.9 (2) 97.14 (1) 96.86 breast-cancer-wisc-prog 198 33 56.66 (5) 59.98 (3) 65.58 (2) 66.33 (1) 58.36 (4) 67.05 breast-cancer-wisc 699 9 94.57 (5) 94.58 (4) 95.06 (2) 94.86 (3) 95.64 (1) 97.01 breast-cancer 286 9 49.91 (2) 45.56 (4) 49.53 (3) 45.52 (5) 53.44 (1) 60.56 congressional-voting 435 16 46.43 (3) 47.92 (1) 42.47 (5) 43.27 (4) 46.58 (2) 49.08 conn-bench 208 60 86.07 (3) 86.89 (1) 86.32 (2) 85.35 (4) 82.4 (5) 82.65 credit-approval 690 15 84.27 (4) 84.31 (3) 84.93 (2) 85.66 (1) 83.29 (5) 85.88 cylinder-bands 512 35 76.85 (3) 77.13 (2) 76.75 (4) 78.4 (1) 66.89 (5) 78.46 echocardiogram 131 10 65.9 (5) 71.03 (1) 68.72 (4) 70.65 (3) 70.98 (2) 74.25 haberman-survival 306 3 54.38 (3) 47.67 (5) 53.57 (4) 57.11 (2) 57.16 (1) 57.66 heart-hungarian 294 12 74.09 (5) 81.34 (1) 79.49 (3) 79.71 (2) 76.45 (4) 78.62 hepatitis 155 19 60.73 (5) 70.7 (2) 72.63 (1) 70.65 (3) 61.24 (4) 71.25 ionosphere 351 33 92.5 (3) 92.13 (4) 93.58 (2) 94.72 (1) 83.7 (5) 93.46 mammographic 961 5 77.78 (4) 77.56 (5) 82.98 (1) 79.41 (3) 81.4 (2) 81.76 molec-biol-promoter 106 57 84.64 (3) 85.96 (1) 84.63 (4) 85.06 (2) 73.98 (5) 89.99 musk-1 476 166 90.49 (4) 91.37 (1) 90.92 (2) 90.82 (3) 88.07 (5) 90.74 oocytes merlucciu 1022 41 82.21 (4) 82.84 (3) 83.35 (2) 83.49 (1) 74.35 (5) 78.14 oocytes trisopterus 912 25 84.2 (4) 85.38 (2) 85.42 (1) 85.3 (3) 82.49 (5) 81.26 ozone 2536 72 51.45 (4) 54.33 (3) 60.44 (1) 58.3 (2) 46.42 (5) 50.88 parkinsons 195 22 87.66 (1) 84.43 (4) 85.89 (2) 85.77 (3) 83.84 (5) 90.57 pima 768 8 69.58 (4) 72.08 (1) 71.69 (3) 72 (2) 66.12 (5) 72.12 planning 182 12 47.08 (5) 48.6 (4) 51.87 (2) 52.35 (1) 48.99 (3) 47.48 ringnorm 7400 20 98.27 (3) 98.22 (4) 98.34 (2) 98.37 (1) 97.88 (5) 98.53 spectf 80 44 74.44 (4) 79.57 (1) 77.39 (2) 76.98 (3) 69.75 (5) 75.27 statlog-australian-credit 690 14 40.11 (5) 45.64 (2) 42.72 (3) 47.3 (1) 40.61 (4) 52.44 statlog-german-credit 1000 24 63.86 (5) 68.26 (4) 68.88 (2) 68.68 (3) 69.58 (1) 67.35 statlog-heart 270 13 76.99 (5) 78.38 (4) 79.33 (2) 78.57 (3) 81.67 (1) 82.18 titanic 2201 3 54.91 (5) 60.65 (3) 63.46 (2) 63.92 (1) 58.23 (4) 71.32 vertebral 310 6 81.73 (4) 82.07 (2) 81.92 (3) 83.3 (1) 78.56 (5) 81.96 rank (without TPOT) 3.9 2.8 2.4 2.3 3.5 rank (with TPOT) 4.8 3.5 3.0 2.9 4.4 2.4 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 46

Slide 46 text

Statistical Analysis: Classification Error and F-Score Table: Neymni pairwise comparison for statistically significant improvements classification error and F-score. Each entry can be interpreted as a p-value. If the p-value is less than α/4 then the row algorithm is outperforming the column algorithm with statistical significance. Classification Error None F1 F2 F3 Fmin None – 0.9973 0.9998 0.9999 0.8364 F1 2.75×10−3 – 0.8155 0.8896 0.0362 F2 1.19×10−4 0.1846 – 0.6280 3.52×10−3 F3 3.15×10−5 0.1103 0.3719 – 1.25×10−3 Fmin 0.1636 0.9638 0.9965 0.9987 – F-score None F1 F2 F3 Fmin None – 0.9876 0.9993 0.9999 0.1846 F1 0.0124 – 0.8261 0.9638 8.34×10−4 F2 7.25×10−4 0.1739 – 0.8044 2.22×10−5 F3 2.65×10−5 0.0362 0.1956 – 3.91×10−7 Fmin 0.8155 0.9992 0.9999 1.0000 – ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 47

Slide 47 text

Statistical Analysis for Significance Testing Table: Neymni pairwise comparison for statistically significant improvements for the evaluation time. None F1 F2 F3 Fmin TPOT None – 0.47249 0.12038 0.33942 1 0.99644 F1 0.52751 – 0.13477 0.36503 1 0.99711 F2 0.87962 0.86523 – 0.7761 1 0.99994 F3 0.66058 0.63497 0.2239 – 1 0.99905 Fmin 9.61 × 10−7 6.81 × 10−7 1.47 × 10−9 1.13 × 10−7 – 0.01921 TPOT 0.004 0.003 5.56 × 10−5 0.0009 0.98078 – ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 48

Slide 48 text

Conclusions Summary of Anti-Training with Sacrificial Functions Including sacrificial objectives into the optimization task tends to improve upon the generalization. MOO provides a front of solutions rather than a single solutions, which provides more information to the end user. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 49

Slide 49 text

Conclusions Summary of Anti-Training with Sacrificial Functions Including sacrificial objectives into the optimization task tends to improve upon the generalization. MOO provides a front of solutions rather than a single solutions, which provides more information to the end user. Current + Future Work Recent experiments with decision trees, k-NN and logistic regression confirm the findings with the support vector machine. Optimizing the MOO is still very time consuming. One approach would be to implement heuristics to speedup the parameter optimization. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 50

Slide 50 text

References G. Ditzler, M. Valenzuela and J. Rozenblit, “Learning What We Don’t Care About: Regularization with Sacrificial Functions,” in preparation, 2017. J. Ethridge, G. Ditzler, and R. Polikar, “Optimal ν-SVM parameter estimation using multi-objective evolutionary algorithms,” IEEE Congress on Evolutionary Computing, 2010. M. Valenzuela and J. Rozenblit, “Learning Using Anti-Training with Sacrificial Data,” Journal of Machine Learning Research, 17(24):1–42, 2016. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions

Slide 51

Slide 51 text

That is all folks! ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions