Non-parametric Statistical Tests

Non-parametric Statistical Tests Eskandar Alaa ([email protected]) Alireza Nourian ([email protected])

Parametric Tests  Assumptions  Independence  Normality  Homoscedasticity
 homogeneity of variance  e.g. T-Test 2

Test Case  Test Problems  5 unimodal functions 
20 multimodal functions  Algorithms  PSO  IPOP-CMA-ES  CHC  SSGA  SS-arit & SS-BLX  DE-Exp & DE-Bin  SaDE 3

Average error in benchmark functions All the algorithms have been
run 50 times for each test function. Each run stops either when the error obtained is less than 10−8, or when the maximal number of evaluations (100000) is achieved. 4

Comparisons  Pairwise comparison (1×1)  Multiple comparisons with control
method (1×N)  Multiple comparisons among all methods (N×N) 5

Sign test (Pairwise)  H0 : Both algorithms beat each
other equal times  Number of wins ~ ( 2 , 2 )  H1 : Otherwise  Z : specifies the boundary  Example  In 25 problems (Table 4)  = 0.05 ⇒ 18 wins rejects H0  = 0.1 ⇒ 17 wins rejects H0  SaDE vs. PSO  20 wins and 5 losses for SaDE ⇒ 0.95 confidence of improvement 6

T-Test (Parametric)  H0 : two sets of data are
not significantly different from each other  Test statistic (difference of sets) follows T-Distribution  T-Distribution  Distribution of the location of the true mean, relative to the sample mean and divided by the sample standard deviation. (it’s difference) 7

Wilcoxon test (Pairwise)  Analogous to the Paired T-Test without
Normal Distribution assumption  Do two samples represent different performances?  We just sample real algorithm performance  H0 : min(+, −) ~ Wilcoxon  + = >0 ( ) + 1 2 =0 ( )  − = <0 ( ) + 1 2 =0 ( )  Example  SaDE vs. PSO  + = 261, − = 64 ⇒ p-value = 0.00673 PSO SaDE Difference Rank 1.23e-04 8.42e-09 1.23e-04 1 2.60e-02 8.21e-09 2.59e-02 2 2.49e+00 8.09e-09 2.49e+00 3 4.10e+02 8.64e-09 4.09e+02 4 5.10e+02 1.74e+03 -1.23e+03 5 5.17e+04 6.56e+03 4.52e+04 6 + = 16, − = 5 8

Multiple comparisons  Multiple pairwise comparisons:  = 0 0
= 1 − 0 0 = 1 − 0 ; = 1. . − 1 0 = 1 − −1 0 0 = 1 − (1−)−1  e.g. = 0.05, k = 9 ⇒ p-value = 0.34 (terrible!) 9

Multiple sign test  Performance difference between control algorithm and
others  H0 : , − ,1 ≥ 0 = , − ,1 ≤ 0 = 1 2  rj ⩽ Rj rejects H0  rj : number of xi,j - xi,1 that have the less frequently occurring sign  Rj : Table for Multiple Comparison Sign test  Example  k = 9 and n = 25 ⇒ Rj = 5  SaDE outperforms PSO and CHC  Only this result! 10

Friedman test (1×N)  H0 : medians of the algorithms
are equal  = 12 (+1) 2 − (+1)2 4  ~ −1 2 rejects H0 11

Friedman Aligned Rank test  Friedman test weakness in small
problem sets  H0 : medians of the algorithms are equal  = (−1) 2 − 2 4 (+1)2 (+1)(2+1) 6 − 1 2  ~ −1 2 rejects H0  Cell value minus column mean (value of location) 12

Quade test  Friedman test considers all problems to be
equal in terms of importance  Problem rank is the difference between the largest and the smallest observations within that problem  First rank has minimum range  Problem weight  , = , , , = , − +1 2  = −1 −  H0 : ~ F-Distribution k−1, (k−1)(n−1) 13

Friedman tests of algorithms Rank means 14

Post-hoc procedures  Post-hoc test can lead to obtaining a
p-value which determines the degree of rejection of each hypothesis  Unadjusted p-values  Prevent rejection of false null hypothesis  Family-wise Error Rate  ≤ 1 − (1−)−1  e.g. = 0.05, k = 9 ⇒ p-value = 0.34 15

Friedman test unadjusted p-values after post-hoc  Friedman  =
− (+1) 6  Aligned  = − (+1) 6  Quade  − (+1)(2+1)(−1) 18(+1) 16

Adjustments  One-step: Benferroni = min , 1 ; =
( − 1)  Step-down: Holm = min , 1 ; = max − ∶ 1 ≤ ≤  Step-up: Hochberg = max − ∶ ≤ ≤ − 1  Two-step: Li = +1−−1  … 17

Contrast Estimation  Zu,v = median of performance differences between
u and v  mu = Zu,v mean  estimator Mu – Mv = mu – mv  Example  SaDE best  CHC worst 18

Multiple comparisons (N*N)  Not all combinations of true and
false hypothesis are possible  M1 better than M2 , M1 same as M3 , M2 same as M3  Adjustments  Shaffer’s static = min , 1 ; = max : 1 ≤ ≤  : Maximum number of hypothesis which can be true given that i -1 hypothesis are false  Bergman-Hommel = min , 1 ; = max , ∈ : ℎ; ∈  Finding all elementary hypothesis which can’t be rejected  Exhaustive set: hypothesis that all of them could be true 19

Recommendations  The number of algorithms used in multiple comparisons
procedures must be lower than the number of case problems  Except for Wilcoxon test  … 20

Conclusions  We can do better than just Average! 
How many comparison are you looking for?  Pairwise comparison  Multiple comparison  Would you mind level of significance?  Sign test  Rank test  Problem difficulty  Qaude test  Taking into account relative algorithm comparisons  Post-hoc adjustments 21

Reference  J. Derrac, S. García, D. Molina, and F.
Herrera, “A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms,” Swarm and Evolutionary Computation, vol. 1, no. 1, pp. 3–18, Mar. 2011. 22

Non-parametric Statistical Tests

Non-parametric Statistical Tests

Alireza Nourian

More Decks by Alireza Nourian

Other Decks in Research

Featured

Transcript

Non-parametric Statistical Tests Eskandar Alaa ([email protected]) Alireza Nourian ([email protected])

Parametric Tests  Assumptions  Independence  Normality  Homoscedasticity

Test Case  Test Problems  5 unimodal functions 

Average error in benchmark functions All the algorithms have been

Comparisons  Pairwise comparison (1×1)  Multiple comparisons with control

Sign test (Pairwise)  H0 : Both algorithms beat each

T-Test (Parametric)  H0 : two sets of data are

Wilcoxon test (Pairwise)  Analogous to the Paired T-Test without

Multiple comparisons  Multiple pairwise comparisons:  = 0 0

Multiple sign test  Performance difference between control algorithm and

Friedman test (1×N)  H0 : medians of the algorithms

Friedman Aligned Rank test  Friedman test weakness in small

Quade test  Friedman test considers all problems to be

Friedman tests of algorithms Rank means 14

Post-hoc procedures  Post-hoc test can lead to obtaining a

Friedman test unadjusted p-values after post-hoc  Friedman  =

Adjustments  One-step: Benferroni = min , 1 ; =

Contrast Estimation  Zu,v = median of performance differences between

Multiple comparisons (N*N)  Not all combinations of true and

Recommendations  The number of algorithms used in multiple comparisons

Conclusions  We can do better than just Average! 

Reference  J. Derrac, S. García, D. Molina, and F.