Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Anti-Training with Sacrificial Functions

Anti-Training with Sacrificial Functions

Gregory Ditzler

April 27, 2018
Tweet

More Decks by Gregory Ditzler

Other Decks in Research

Transcript

  1. Learning What We Don’t Care About: Regularization with Sacrificial Functions

    Gregory Ditzler (with Sean Miller, Michael Valenzuela and Jerzy Rozenblit) Assistant Professor The University of Arizona Department of Electrical & Computer Engineering [email protected] uamlda.github.io April 27, 2018 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  2. Overview of Research Feature Selection Neyman- Pearson MABS+FS Parallelism Adversarial

    FS Misc. Model Optimization Compressive Sensing Applications Human Health Environmental Cyber Software Learning Ensembles Concept Drift Partial Information ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  3. Supervised Machine Learning in a Nutshell Data Machine Learning Model

    D := {(xi , yi )}n i=1 b y = (x) Dtest := {(xi , yi )}n i=1 Machine Learning Model Deployment Predictions Free Parameters ✓ Different Losses to Minimize −3 −2 −1 0 1 2 3 0 0.5 1 1.5 2 2.5 3 3.5 4 h ⋅ f L(h,f) log 0−1 hinge modified Huber ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  4. Motivation Parameter optimization is often a time consuming task Increased

    difficulty when a large amount of parameters need to be selected Need to find free parameters for a classification algorithm on a database Selection of parameters has a large effect on the performance of the algorithm on future testing databases Given a classification algorithm, what objective function should be optimized and what parameters need to be considered? G. Ditzler, S. Miller and J. Rozenblit, “Learning What We Don’t Care About: Regularization with Sacrificial Functions,” under revision in Information Science, 2018. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  5. Overview Plan of Attack for the Next 50 minutes 1

    Motivation for Model Selection 2 Anti-Training with Sacrificial Functions 3 Experiments 4 Applications & Conclusion ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  6. Where it all started? Barcelona, Spain, 2010 ECE523: Engineering Applications

    of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  7. Where it all started? Barcelona, Spain, 2010 ECE523: Engineering Applications

    of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  8. Fast Forward to 2016 → DARPA: Data-Driven Discovery of Models

    (D3M) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  9. Fast Forward to 2016 → DARPA: Data-Driven Discovery of Models

    (D3M) Key Technologies to be Developed in the Course of the Program Automated composition of complex models. Techniques will be developed for automatically selecting model primitives and for composing selected primitives into complex modeling pipelines based on user-specified data and outcome(s) of interest. Evaluate on real-world problems that will progressively get harder ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  10. Fast Forward to 2016 → DARPA: Data-Driven Discovery of Models

    (D3M) 30k foot perspective All primitives have free parameters that require a fair amount of knowledge to choose parameters What objectives are good? How can they be optimized? Single objective? Multi-objective? How do we check if a model is over-training? A model over-training to a data set is defined as follows. A model h trained on a sample D, over-training occurs when ∃h ∈ H such that the following inequalities hold: errtrain(h) < errtrain(h) and errtrue(h) > errtrue(h) where errtrain(·) and errtrue(·) are the measured errors on the training and true distributions, respectively. Key Technologies to be Developed in the Course of the Program Automated composition of complex models. Techniques will be developed for automatically selecting model primitives and for composing selected primitives into complex modeling pipelines based on user-specified data and outcome(s) of interest. Evaluate on real-world problems that will progressively get harder ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  11. Measuring Error Bishop (2006) t x y(xn , w) tn

    xn −2 −1 0 1 2 z E(z) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  12. Measuring Error Bishop (2006) t x y(xn , w) tn

    xn −2 −1 0 1 2 z E(z) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  13. Overfitting y(x, w) = w0 + w1 x + w2

    x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  14. Overfitting y(x, w) = w0 + w1 x + w2

    x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  15. Overfitting y(x, w) = w0 + w1 x + w2

    x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  16. Overfitting y(x, w) = w0 + w1 x + w2

    x2 + . . . + wM xM = M j=0 wj xj x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 x t M = 9 0 1 −1 0 1 Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  17. Overfitting y(x, w) = w0 + w1 x + w2

    x2 + . . . + wM xM = M j=0 wj xj M ERMS 0 3 6 9 0 0.5 1 Training Test Bishop (2006) E(w) = 1 2 N n=1 (y(xn , w) − tn )2 , ERMS = 2E(w∗)/N ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  18. Choose Parameters and Optimize Different Objectives Example: Logistic Regression with

    1 Regularization The free parameters θ are chosen then model is evaluated by a function f to measure generalization. For example, a logistic regression classifier with L1 regularization would have θ = regularizer then min w∈Rp ζ(w) = min w∈Rp E ξ(−ywTx) + θ w 1 where ξ(z) = log(1 + exp(z)) and w are the weights for the logistic regression model. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  19. Choose Parameters and Optimize Different Objectives Example: Logistic Regression with

    1 Regularization The free parameters θ are chosen then model is evaluated by a function f to measure generalization. For example, a logistic regression classifier with L1 regularization would have θ = regularizer then min w∈Rp ζ(w) = min w∈Rp E ξ(−ywTx) + θ w 1 where ξ(z) = log(1 + exp(z)) and w are the weights for the logistic regression model. Setting Up the Problem What if I want optimize θ for an objective σ(θ), such that I have a “small” classification error and large F-score? We typically measure classification error, F-score, sensitivity, specificity, and AUC at test time, not ζ(θ)! Presenting the problem in this way is clearly noisy, non-convex and is a blackbox ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  20. Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data

    The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  21. Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data

    The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) Anti-training can be viewed as a generalization of meta-learning that exploits the consequences of the NFL Theorem. Anti-training tailors learning and optimization algorithms to problem distributions (i.e., data sets) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  22. Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data

    The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) Anti-training can be viewed as a generalization of meta-learning that exploits the consequences of the NFL Theorem. Anti-training tailors learning and optimization algorithms to problem distributions (i.e., data sets) Bottleneck with Sacrificial Data Generating data to obtain measurements from F− is not trivial and how much sacrificial data do you need to generate? ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  23. Anti-Training with Sacrificial Functions NFL and Anti-Training with Sacrificial Data

    The probabilistic formulation of the No Free Lunch Theorem is given by f∈F P (Dy|f, Λ, n) = f∈F+ P (Dy|f, Λ, n) + f∈F0 P (Dy|f, Λ, n) + f∈F− P (Dy|f, Λ, n) where Λ is an optimization algorithm, Dy is the data set of corresponding outputs, f is a function to be optimized, and n is the number of samples in the data set. This definition can be interpreted as f∈F P (Dy|f, Λ1 , n) = f∈F P (Dy|f, Λ2 , n) Anti-training can be viewed as a generalization of meta-learning that exploits the consequences of the NFL Theorem. Anti-training tailors learning and optimization algorithms to problem distributions (i.e., data sets) Bottleneck with Sacrificial Data Generating data to obtain measurements from F− is not trivial and how much sacrificial data do you need to generate? The real-problem: Execution of 10 fold cross validation can take up to 20 weeks for one data set! ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  24. Overview of Anti-Training with Sacrificial Functions Input: Model M with

    parameters θ Choose suitable F+ and F−. The choice of F− is meant to act as a form of regularization. Select a type optimization problem (i.e., single or multi-objective). Optimize the objective(s) for M with parameters θ. Output: θ Figure: High-level pseudo code for model optimization with anti-training sacrificial functions and meta-learning. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  25. Optimization Formalization Objective Function Design The first step in our

    framework is to form the optimization problem. The theory of anti-training defines three partitions (F+ , F0 , F− ) to the functions that can be optimized in the NFL. Ignoring F0 , we partition the optimization task over F+ ∪ F− ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  26. Optimization Formalization Objective Function Design The first step in our

    framework is to form the optimization problem. The theory of anti-training defines three partitions (F+ , F0 , F− ) to the functions that can be optimized in the NFL. Ignoring F0 , we partition the optimization task over F+ ∪ F− Single Objective: Optimize with Synthetic Annealing θ∗ = arg min θ∈Θ        f∈F+ f(θ, D) + β f ∈F− f (θ, D)         where β ≥ 0, f ≥ 0 ∀f ∈ F+ ∪ F− . ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  27. Optimization Formalization Objective Function Design The first step in our

    framework is to form the optimization problem. The theory of anti-training defines three partitions (F+ , F0 , F− ) to the functions that can be optimized in the NFL. Ignoring F0 , we partition the optimization task over F+ ∪ F− Single Objective: Optimize with Synthetic Annealing θ∗ = arg min θ∈Θ        f∈F+ f(θ, D) + β f ∈F− f (θ, D)         where β ≥ 0, f ≥ 0 ∀f ∈ F+ ∪ F− . Multi-Objective: Optimize with a genetic algorithm (e.g., NSGA-II) θ∗ = arg min θ∈Θ {f(θ, D)}f∈F+ , {f (θ, D)}f ∈F− ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  28. Designing Beneficial and Sacrificial Functions Beneficial Functions The goal of

    F+ is to have a class of functions that the classifier should perform well when considering a data set D. These functions are a bit easier to design since many experiments already make use of functions when they are perform their benchmarks. Examples of such functions for F+ include error, AUC, sensitivity, specificity, etc. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  29. Designing Beneficial and Sacrificial Functions Beneficial Functions The goal of

    F+ is to have a class of functions that the classifier should perform well when considering a data set D. These functions are a bit easier to design since many experiments already make use of functions when they are perform their benchmarks. Examples of such functions for F+ include error, AUC, sensitivity, specificity, etc. Sacrificial Functions The goal of F− is to have a class of functions that the classifier should not perform well when considering a data set D. An error measurement from a classifier trained/tested with randomly labeled data should be approximately a random guess. A measurement close to 0/1 could suggest overfitting. E.g., a model should not perform well on: f− sp = 1 2 − spec(θ, g(D)) , f− se = 1 2 − sens(θ, g(D)) , f− err = 1 2 − err(θ, g(D)) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  30. Designing Beneficial and Sacrificial Functions Beneficial Functions The goal of

    F+ is to have a class of functions that the classifier should perform well when considering a data set D. These functions are a bit easier to design since many experiments already make use of functions when they are perform their benchmarks. Examples of such functions for F+ include error, AUC, sensitivity, specificity, etc. Sacrificial Functions The goal of F− is to have a class of functions that the classifier should not perform well when considering a data set D. An error measurement from a classifier trained/tested with randomly labeled data should be approximately a random guess. A measurement close to 0/1 could suggest overfitting. E.g., a model should not perform well on: f− sp = 1 2 − spec(θ, g(D)) , f− se = 1 2 − sens(θ, g(D)) , f− err = 1 2 − err(θ, g(D)) Why the form of f− sp , f− se , and f− err ? arg minθ∈Θ {f+(θ, D)}f+∈F+ , {f−(θ, D)}f−∈F− ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  31. Multi-Objective Optimization (MOO) Pareto optimality & Evolutionary Algorithms Goal in

    MOO is to simultaneously optimize two or more conflicting objective functions An analytical solution that produces the entire Pareto front may be intractable The application of EA have the ability to search for multiple Pareto solutions concurrently NSGA-II algorithm is an elitist EA thats reduces computational complexity and diversity preservation Diversity preservation ensures that the EA has a good spread of solutions along the generated Pareto front as implemented in NSGA-II -0.5 0 0.5 1 1.5 x Tradeoff region between the green lines 0 2 4 6 8 Minimum(f 1 (x)) Minimum(f 2 (x)) f 1 (x) f 2 (x) 1 1.1 1.2 1.3 1.4 1.5 f 1 (x) 6 6.2 6.4 6.6 6.8 7 f 2 (x) ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  32. ATSF: Optimizing your classifier as a blackbox Data ATSF Model

    Optimization D := {(xi , yi )}n i=1 Outputs ✓ Classifier M F+ F Error F-score Sensitivity MCC SF: Error SF: F-score ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  33. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  34. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  35. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  36. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  37. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  38. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  39. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation TPOT: This algorithm searches through an entire pipeline of different preprocessing stages then optimizes different classifiers. Experimental Protocol The data set is randomly partitioned into training (80%), and testing data (20%). ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  40. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation TPOT: This algorithm searches through an entire pipeline of different preprocessing stages then optimizes different classifiers. Experimental Protocol The data set is randomly partitioned into training (80%), and testing data (20%). The training data are used to optimize the parameters of the SVM using the meta-search algorithm then an SVM is learned from the training data. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  41. Benchmark Comparison Parameter Optimization Strategies The support vector machine was

    selected as the base classifier to evaluate the impact of anti-training with sacrificial functions. We optimize over the regularizer and kernel bandwidth (RBF kernel). None: F+ = {err} and F− = ∅ F1 : F+ = {err} and F− = {f− err } F2 : F+ = {1 − sen, 1 − spe, err} and F− = {f− err } F3 : F+ = {1 − sen, 1 − spe, err} and F− = {f− sp , f− se , f− err } Fmin: This approach used in the comparison is the suggested parameter search technique from the from Matlab’s documentation TPOT: This algorithm searches through an entire pipeline of different preprocessing stages then optimizes different classifiers. Experimental Protocol The data set is randomly partitioned into training (80%), and testing data (20%). The training data are used to optimize the parameters of the SVM using the meta-search algorithm then an SVM is learned from the training data. The error and F-score is measured on the testing data. The F-score is computed for each class then we report the average. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  42. Statistical Analysis for Significance Testing Comparing Multiple Classifiers Across Multiple

    Data Sets We follow Demˇ sar’s recommendations for comparing multiple classifiers across multiple data sets The different algorithms are ranked from 1 to k according to their error rates on each data set. The Friedman test is used to check the hypothesis that the ranks are uniformly distributed ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  43. Statistical Analysis for Significance Testing Comparing Multiple Classifiers Across Multiple

    Data Sets We follow Demˇ sar’s recommendations for comparing multiple classifiers across multiple data sets The different algorithms are ranked from 1 to k according to their error rates on each data set. The Friedman test is used to check the hypothesis that the ranks are uniformly distributed If the Friedman test is rejected, the post-hoc Nemenyi test is used to check for statistical differences using multiple comparisons. To do so, the average rank for algorithms A and B are used to calculate: z(A, B) = RA − RB k(k+1) 6N where RA and RB are averaged ranks of algorithms A and B, k is the number of algorithms tested, and N is the number of data sets. z(A, B) is distributed as a normal Gaussian ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  44. Figure of Merit: Classification Error Data Set Samples Features None

    F1 F2 F3 Fmin TPOT blood 748 4 23.4 (4) 23.07 (3) 21.73 (1) 22.87 (2) 26.6 (5) 38.25 breast-cancer-wisc-diag 569 30 4.39 (5) 3.42 (4) 3.33 (3) 2.89 (2) 2.63 (1) 3.33 breast-cancer-wisc-prog 198 33 23.5 (2) 25 (4) 23.25 (1) 24 (3) 28.25 (5) 32.43 breast-cancer-wisc 699 9 4.93 (5) 4.93 (4) 4.5 (2) 4.71 (3) 3.93 (1) 2.63 breast-cancer 286 9 33.28 (5) 27.24 (1.5) 27.24 (1.5) 27.41 (3) 32.41 (4) 38.61 congressional-voting 435 16 39.2 (4) 38.86 (3) 36.82 (2) 36.14 (1) 43.64 (5) 45.06 conn-bench 208 60 13.1 (3) 12.38 (1) 12.86 (2) 13.81 (4) 17.14 (5) 17.12 credit-approval 690 15 15.47 (4) 15.4 (3) 14.82 (2) 14.17 (1) 16.47 (5) 13.84 cylinder-bands 512 35 21.17 (2.5) 21.17 (2.5) 21.55 (4) 20 (1) 28.83 (5) 22.2 echocardiogram 131 10 29.63 (5) 24.44 (2) 26.67 (4) 25.19 (3) 21.11 (1) 26.67 haberman-survival 306 3 28.06 (5) 24.03 (3) 23.55 (2) 22.42 (1) 26.77 (4) 42.31 heart-hungarian 294 12 23.9 (5) 17.29 (1) 19.15 (3) 18.98 (2) 19.66 (4) 21.19 hepatitis 155 19 16.56 (3) 17.5 (4) 16.25 (2) 15 (1) 21.56 (5) 26.9 ionosphere 351 33 6.62 (3) 6.9 (4) 5.77 (2) 4.79 (1) 10.02 (5) 6.76 mammographic 961 5 20.88 (5) 20.62 (4) 16.89 (1) 19.79 (3) 18.5 (2) 18.19 molec-biol-promoter 106 57 15 (3.5) 13.64 (1) 15 (3.5) 14.55 (2) 24.24 (5) 10.19 musk-1 476 166 9.17 (4) 8.33 (1) 8.75 (2) 8.85 (3) 9.58 (5) 9.37 oocytes merluccius 1022 41 16.44 (4) 15.95 (3) 15.41 (2) 15.32 (1) 20.87 (5) 22.75 oocytes trisopterus 912 25 15.3 (4) 14.15 (2) 14.1 (1) 14.21 (3) 17.05 (5) 19.02 ozone 2536 72 3.25 (2) 3.17 (1) 3.29 (4) 3.27 (3) 21.63 (5) 27.58 parkinsons 195 22 8.5 (1) 11.75 (4) 10.75 (2.5) 10.75 (2.5) 12.25 (5) 9.87 pima 768 8 28.18 (5) 26.04 (2) 26.23 (3) 25.97 (1) 26.75 (4) 28.24 planning 182 12 25.95 (1.5) 27.3 (3) 27.57 (4) 25.95 (1.5) 29.46 (5) 49.57 ringnorm 7400 20 1.73 (3) 1.78 (4) 1.65 (2) 1.63 (1) 2.11 (5) 1.47 spectf 80 44 24.12 (4) 18.82 (1) 21.18 (2.5) 21.18 (2.5) 28.82 (5) 24.01 statlog-australian-credit 690 14 33.45 (2) 34.6 (5) 33.81 (3) 34.24 (4) 32.59 (1) 47.14 statlog-german-credit 1000 24 27.66 (4) 27.61 (3) 27.21 (2) 26.92 (1) 27.86 (5) 32.11 statlog-heart 270 13 22.73 (5) 21.27 (4) 20.18 (2) 20.91 (3) 17.45 (1) 18.05 titanic 2201 3 27.94 (5) 26.49 (3) 25.71 (2) 24.97 (1) 27.1 (4) 29.85 vertebral 310 6 16.67 (2.5) 16.67 (2.5) 16.83 (4) 15.4 (1) 19.68 (5) 17.13 rank (without TPOT) 3.7 2.8 2.4 2.1 4.1 – rank (with TPOT) 4.0 3.1 2.6 2.3 4.5 4.5 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  45. Figure of Merit: F-score Data Set Samples Features None F1

    F2 F3 Fmin TPOT blood 748 4 47.4 (3) 43.76 (5) 53.37 (2) 44.59 (4) 56.44 (1) 62.27 breast-cancer-wisc-diag 569 30 95.31 (5) 96.31 (4) 96.42 (3) 96.9 (2) 97.14 (1) 96.86 breast-cancer-wisc-prog 198 33 56.66 (5) 59.98 (3) 65.58 (2) 66.33 (1) 58.36 (4) 67.05 breast-cancer-wisc 699 9 94.57 (5) 94.58 (4) 95.06 (2) 94.86 (3) 95.64 (1) 97.01 breast-cancer 286 9 49.91 (2) 45.56 (4) 49.53 (3) 45.52 (5) 53.44 (1) 60.56 congressional-voting 435 16 46.43 (3) 47.92 (1) 42.47 (5) 43.27 (4) 46.58 (2) 49.08 conn-bench 208 60 86.07 (3) 86.89 (1) 86.32 (2) 85.35 (4) 82.4 (5) 82.65 credit-approval 690 15 84.27 (4) 84.31 (3) 84.93 (2) 85.66 (1) 83.29 (5) 85.88 cylinder-bands 512 35 76.85 (3) 77.13 (2) 76.75 (4) 78.4 (1) 66.89 (5) 78.46 echocardiogram 131 10 65.9 (5) 71.03 (1) 68.72 (4) 70.65 (3) 70.98 (2) 74.25 haberman-survival 306 3 54.38 (3) 47.67 (5) 53.57 (4) 57.11 (2) 57.16 (1) 57.66 heart-hungarian 294 12 74.09 (5) 81.34 (1) 79.49 (3) 79.71 (2) 76.45 (4) 78.62 hepatitis 155 19 60.73 (5) 70.7 (2) 72.63 (1) 70.65 (3) 61.24 (4) 71.25 ionosphere 351 33 92.5 (3) 92.13 (4) 93.58 (2) 94.72 (1) 83.7 (5) 93.46 mammographic 961 5 77.78 (4) 77.56 (5) 82.98 (1) 79.41 (3) 81.4 (2) 81.76 molec-biol-promoter 106 57 84.64 (3) 85.96 (1) 84.63 (4) 85.06 (2) 73.98 (5) 89.99 musk-1 476 166 90.49 (4) 91.37 (1) 90.92 (2) 90.82 (3) 88.07 (5) 90.74 oocytes merlucciu 1022 41 82.21 (4) 82.84 (3) 83.35 (2) 83.49 (1) 74.35 (5) 78.14 oocytes trisopterus 912 25 84.2 (4) 85.38 (2) 85.42 (1) 85.3 (3) 82.49 (5) 81.26 ozone 2536 72 51.45 (4) 54.33 (3) 60.44 (1) 58.3 (2) 46.42 (5) 50.88 parkinsons 195 22 87.66 (1) 84.43 (4) 85.89 (2) 85.77 (3) 83.84 (5) 90.57 pima 768 8 69.58 (4) 72.08 (1) 71.69 (3) 72 (2) 66.12 (5) 72.12 planning 182 12 47.08 (5) 48.6 (4) 51.87 (2) 52.35 (1) 48.99 (3) 47.48 ringnorm 7400 20 98.27 (3) 98.22 (4) 98.34 (2) 98.37 (1) 97.88 (5) 98.53 spectf 80 44 74.44 (4) 79.57 (1) 77.39 (2) 76.98 (3) 69.75 (5) 75.27 statlog-australian-credit 690 14 40.11 (5) 45.64 (2) 42.72 (3) 47.3 (1) 40.61 (4) 52.44 statlog-german-credit 1000 24 63.86 (5) 68.26 (4) 68.88 (2) 68.68 (3) 69.58 (1) 67.35 statlog-heart 270 13 76.99 (5) 78.38 (4) 79.33 (2) 78.57 (3) 81.67 (1) 82.18 titanic 2201 3 54.91 (5) 60.65 (3) 63.46 (2) 63.92 (1) 58.23 (4) 71.32 vertebral 310 6 81.73 (4) 82.07 (2) 81.92 (3) 83.3 (1) 78.56 (5) 81.96 rank (without TPOT) 3.9 2.8 2.4 2.3 3.5 rank (with TPOT) 4.8 3.5 3.0 2.9 4.4 2.4 ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  46. Statistical Analysis: Classification Error and F-Score Table: Neymni pairwise comparison

    for statistically significant improvements classification error and F-score. Each entry can be interpreted as a p-value. If the p-value is less than α/4 then the row algorithm is outperforming the column algorithm with statistical significance. Classification Error None F1 F2 F3 Fmin None – 0.9973 0.9998 0.9999 0.8364 F1 2.75×10−3 – 0.8155 0.8896 0.0362 F2 1.19×10−4 0.1846 – 0.6280 3.52×10−3 F3 3.15×10−5 0.1103 0.3719 – 1.25×10−3 Fmin 0.1636 0.9638 0.9965 0.9987 – F-score None F1 F2 F3 Fmin None – 0.9876 0.9993 0.9999 0.1846 F1 0.0124 – 0.8261 0.9638 8.34×10−4 F2 7.25×10−4 0.1739 – 0.8044 2.22×10−5 F3 2.65×10−5 0.0362 0.1956 – 3.91×10−7 Fmin 0.8155 0.9992 0.9999 1.0000 – ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  47. Statistical Analysis for Significance Testing Table: Neymni pairwise comparison for

    statistically significant improvements for the evaluation time. None F1 F2 F3 Fmin TPOT None – 0.47249 0.12038 0.33942 1 0.99644 F1 0.52751 – 0.13477 0.36503 1 0.99711 F2 0.87962 0.86523 – 0.7761 1 0.99994 F3 0.66058 0.63497 0.2239 – 1 0.99905 Fmin 9.61 × 10−7 6.81 × 10−7 1.47 × 10−9 1.13 × 10−7 – 0.01921 TPOT 0.004 0.003 5.56 × 10−5 0.0009 0.98078 – ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  48. Conclusions Summary of Anti-Training with Sacrificial Functions Including sacrificial objectives

    into the optimization task tends to improve upon the generalization. MOO provides a front of solutions rather than a single solutions, which provides more information to the end user. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  49. Conclusions Summary of Anti-Training with Sacrificial Functions Including sacrificial objectives

    into the optimization task tends to improve upon the generalization. MOO provides a front of solutions rather than a single solutions, which provides more information to the end user. Current + Future Work Recent experiments with decision trees, k-NN and logistic regression confirm the findings with the support vector machine. Optimizing the MOO is still very time consuming. One approach would be to implement heuristics to speedup the parameter optimization. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  50. References G. Ditzler, M. Valenzuela and J. Rozenblit, “Learning What

    We Don’t Care About: Regularization with Sacrificial Functions,” in preparation, 2017. J. Ethridge, G. Ditzler, and R. Polikar, “Optimal ν-SVM parameter estimation using multi-objective evolutionary algorithms,” IEEE Congress on Evolutionary Computing, 2010. M. Valenzuela and J. Rozenblit, “Learning Using Anti-Training with Sacrificial Data,” Journal of Machine Learning Research, 17(24):1–42, 2016. ECE523: Engineering Applications of Machine Learning and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions
  51. That is all folks! ECE523: Engineering Applications of Machine Learning

    and Data Analytics Learning What We Don’t Care About: Regularization with Sacrificial Functions