Slide 1

Slide 1 text

Non-parametric Statistical Tests Eskandar Alaa ([email protected]) Alireza Nourian ([email protected])

Slide 2

Slide 2 text

Parametric Tests  Assumptions  Independence  Normality  Homoscedasticity  homogeneity of variance  e.g. T-Test 2

Slide 3

Slide 3 text

Test Case  Test Problems  5 unimodal functions  20 multimodal functions  Algorithms  PSO  IPOP-CMA-ES  CHC  SSGA  SS-arit & SS-BLX  DE-Exp & DE-Bin  SaDE 3

Slide 4

Slide 4 text

Average error in benchmark functions All the algorithms have been run 50 times for each test function. Each run stops either when the error obtained is less than 10−8, or when the maximal number of evaluations (100000) is achieved. 4

Slide 5

Slide 5 text

Comparisons  Pairwise comparison (1×1)  Multiple comparisons with control method (1×N)  Multiple comparisons among all methods (N×N) 5

Slide 6

Slide 6 text

Sign test (Pairwise)  H0 : Both algorithms beat each other equal times  Number of wins ~ ( 2 , 2 )  H1 : Otherwise  Z : specifies the boundary  Example  In 25 problems (Table 4)  = 0.05 ⇒ 18 wins rejects H0  = 0.1 ⇒ 17 wins rejects H0  SaDE vs. PSO  20 wins and 5 losses for SaDE ⇒ 0.95 confidence of improvement 6

Slide 7

Slide 7 text

T-Test (Parametric)  H0 : two sets of data are not significantly different from each other  Test statistic (difference of sets) follows T-Distribution  T-Distribution  Distribution of the location of the true mean, relative to the sample mean and divided by the sample standard deviation. (it’s difference) 7

Slide 8

Slide 8 text

Wilcoxon test (Pairwise)  Analogous to the Paired T-Test without Normal Distribution assumption  Do two samples represent different performances?  We just sample real algorithm performance  H0 : min(+, −) ~ Wilcoxon  + = >0 ( ) + 1 2 =0 ( )  − = <0 ( ) + 1 2 =0 ( )  Example  SaDE vs. PSO  + = 261, − = 64 ⇒ p-value = 0.00673 PSO SaDE Difference Rank 1.23e-04 8.42e-09 1.23e-04 1 2.60e-02 8.21e-09 2.59e-02 2 2.49e+00 8.09e-09 2.49e+00 3 4.10e+02 8.64e-09 4.09e+02 4 5.10e+02 1.74e+03 -1.23e+03 5 5.17e+04 6.56e+03 4.52e+04 6 + = 16, − = 5 8

Slide 9

Slide 9 text

Multiple comparisons  Multiple pairwise comparisons:  = 0 0 = 1 − 0 0 = 1 − 0 ; = 1. . − 1 0 = 1 − −1 0 0 = 1 − (1−)−1  e.g. = 0.05, k = 9 ⇒ p-value = 0.34 (terrible!) 9

Slide 10

Slide 10 text

Multiple sign test  Performance difference between control algorithm and others  H0 : , − ,1 ≥ 0 = , − ,1 ≤ 0 = 1 2  rj ⩽ Rj rejects H0  rj : number of xi,j - xi,1 that have the less frequently occurring sign  Rj : Table for Multiple Comparison Sign test  Example  k = 9 and n = 25 ⇒ Rj = 5  SaDE outperforms PSO and CHC  Only this result! 10

Slide 11

Slide 11 text

Friedman test (1×N)  H0 : medians of the algorithms are equal  = 12 (+1) 2 − (+1)2 4  ~ −1 2 rejects H0 11

Slide 12

Slide 12 text

Friedman Aligned Rank test  Friedman test weakness in small problem sets  H0 : medians of the algorithms are equal  = (−1) 2 − 2 4 (+1)2 (+1)(2+1) 6 − 1 2  ~ −1 2 rejects H0  Cell value minus column mean (value of location) 12

Slide 13

Slide 13 text

Quade test  Friedman test considers all problems to be equal in terms of importance  Problem rank is the difference between the largest and the smallest observations within that problem  First rank has minimum range  Problem weight  , = , , , = , − +1 2  = −1 −  H0 : ~ F-Distribution k−1, (k−1)(n−1) 13

Slide 14

Slide 14 text

Friedman tests of algorithms Rank means 14

Slide 15

Slide 15 text

Post-hoc procedures  Post-hoc test can lead to obtaining a p-value which determines the degree of rejection of each hypothesis  Unadjusted p-values  Prevent rejection of false null hypothesis  Family-wise Error Rate  ≤ 1 − (1−)−1  e.g. = 0.05, k = 9 ⇒ p-value = 0.34 15

Slide 16

Slide 16 text

Friedman test unadjusted p-values after post-hoc  Friedman  = − (+1) 6  Aligned  = − (+1) 6  Quade  − (+1)(2+1)(−1) 18(+1) 16

Slide 17

Slide 17 text

Adjustments  One-step: Benferroni = min , 1 ; = ( − 1)  Step-down: Holm = min , 1 ; = max − ∶ 1 ≤ ≤  Step-up: Hochberg = max − ∶ ≤ ≤ − 1  Two-step: Li = +1−−1  … 17

Slide 18

Slide 18 text

Contrast Estimation  Zu,v = median of performance differences between u and v  mu = Zu,v mean  estimator Mu – Mv = mu – mv  Example  SaDE best  CHC worst 18

Slide 19

Slide 19 text

Multiple comparisons (N*N)  Not all combinations of true and false hypothesis are possible  M1 better than M2 , M1 same as M3 , M2 same as M3  Adjustments  Shaffer’s static = min , 1 ; = max : 1 ≤ ≤  : Maximum number of hypothesis which can be true given that i -1 hypothesis are false  Bergman-Hommel = min , 1 ; = max , ∈ : ℎ; ∈  Finding all elementary hypothesis which can’t be rejected  Exhaustive set: hypothesis that all of them could be true 19

Slide 20

Slide 20 text

Recommendations  The number of algorithms used in multiple comparisons procedures must be lower than the number of case problems  Except for Wilcoxon test  … 20

Slide 21

Slide 21 text

Conclusions  We can do better than just Average!  How many comparison are you looking for?  Pairwise comparison  Multiple comparison  Would you mind level of significance?  Sign test  Rank test  Problem difficulty  Qaude test  Taking into account relative algorithm comparisons  Post-hoc adjustments 21

Slide 22

Slide 22 text

Reference  J. Derrac, S. García, D. Molina, and F. Herrera, “A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms,” Swarm and Evolutionary Computation, vol. 1, no. 1, pp. 3–18, Mar. 2011. 22