Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Non-parametric Statistical Tests

Non-parametric Statistical Tests

Alireza Nourian

April 13, 2013
Tweet

More Decks by Alireza Nourian

Other Decks in Research

Transcript

  1. Test Case  Test Problems  5 unimodal functions 

    20 multimodal functions  Algorithms  PSO  IPOP-CMA-ES  CHC  SSGA  SS-arit & SS-BLX  DE-Exp & DE-Bin  SaDE 3
  2. Average error in benchmark functions All the algorithms have been

    run 50 times for each test function. Each run stops either when the error obtained is less than 10−8, or when the maximal number of evaluations (100000) is achieved. 4
  3. Comparisons  Pairwise comparison (1×1)  Multiple comparisons with control

    method (1×N)  Multiple comparisons among all methods (N×N) 5
  4. Sign test (Pairwise)  H0 : Both algorithms beat each

    other equal times  Number of wins ~ ( 2 , 2 )  H1 : Otherwise  Z : specifies the boundary  Example  In 25 problems (Table 4)  = 0.05 ⇒ 18 wins rejects H0  = 0.1 ⇒ 17 wins rejects H0  SaDE vs. PSO  20 wins and 5 losses for SaDE ⇒ 0.95 confidence of improvement 6
  5. T-Test (Parametric)  H0 : two sets of data are

    not significantly different from each other  Test statistic (difference of sets) follows T-Distribution  T-Distribution  Distribution of the location of the true mean, relative to the sample mean and divided by the sample standard deviation. (it’s difference) 7
  6. Wilcoxon test (Pairwise)  Analogous to the Paired T-Test without

    Normal Distribution assumption  Do two samples represent different performances?  We just sample real algorithm performance  H0 : min(+, −) ~ Wilcoxon  + = >0 ( ) + 1 2 =0 ( )  − = <0 ( ) + 1 2 =0 ( )  Example  SaDE vs. PSO  + = 261, − = 64 ⇒ p-value = 0.00673 PSO SaDE Difference Rank 1.23e-04 8.42e-09 1.23e-04 1 2.60e-02 8.21e-09 2.59e-02 2 2.49e+00 8.09e-09 2.49e+00 3 4.10e+02 8.64e-09 4.09e+02 4 5.10e+02 1.74e+03 -1.23e+03 5 5.17e+04 6.56e+03 4.52e+04 6 + = 16, − = 5 8
  7. Multiple comparisons  Multiple pairwise comparisons:  = 0 0

    = 1 − 0 0 = 1 − 0 ; = 1. . − 1 0 = 1 − −1 0 0 = 1 − (1−)−1  e.g. = 0.05, k = 9 ⇒ p-value = 0.34 (terrible!) 9
  8. Multiple sign test  Performance difference between control algorithm and

    others  H0 : , − ,1 ≥ 0 = , − ,1 ≤ 0 = 1 2  rj ⩽ Rj rejects H0  rj : number of xi,j - xi,1 that have the less frequently occurring sign  Rj : Table for Multiple Comparison Sign test  Example  k = 9 and n = 25 ⇒ Rj = 5  SaDE outperforms PSO and CHC  Only this result! 10
  9. Friedman test (1×N)  H0 : medians of the algorithms

    are equal  = 12 (+1) 2 − (+1)2 4  ~ −1 2 rejects H0 11
  10. Friedman Aligned Rank test  Friedman test weakness in small

    problem sets  H0 : medians of the algorithms are equal  = (−1) 2 − 2 4 (+1)2 (+1)(2+1) 6 − 1 2  ~ −1 2 rejects H0  Cell value minus column mean (value of location) 12
  11. Quade test  Friedman test considers all problems to be

    equal in terms of importance  Problem rank is the difference between the largest and the smallest observations within that problem  First rank has minimum range  Problem weight  , = , , , = , − +1 2  = −1 −  H0 : ~ F-Distribution k−1, (k−1)(n−1) 13
  12. Post-hoc procedures  Post-hoc test can lead to obtaining a

    p-value which determines the degree of rejection of each hypothesis  Unadjusted p-values  Prevent rejection of false null hypothesis  Family-wise Error Rate  ≤ 1 − (1−)−1  e.g. = 0.05, k = 9 ⇒ p-value = 0.34 15
  13. Friedman test unadjusted p-values after post-hoc  Friedman  =

    − (+1) 6  Aligned  = − (+1) 6  Quade  − (+1)(2+1)(−1) 18(+1) 16
  14. Adjustments  One-step: Benferroni = min , 1 ; =

    ( − 1)  Step-down: Holm = min , 1 ; = max − ∶ 1 ≤ ≤  Step-up: Hochberg = max − ∶ ≤ ≤ − 1  Two-step: Li = +1−−1  … 17
  15. Contrast Estimation  Zu,v = median of performance differences between

    u and v  mu = Zu,v mean  estimator Mu – Mv = mu – mv  Example  SaDE best  CHC worst 18
  16. Multiple comparisons (N*N)  Not all combinations of true and

    false hypothesis are possible  M1 better than M2 , M1 same as M3 , M2 same as M3  Adjustments  Shaffer’s static = min , 1 ; = max : 1 ≤ ≤  : Maximum number of hypothesis which can be true given that i -1 hypothesis are false  Bergman-Hommel = min , 1 ; = max , ∈ : ℎ; ∈  Finding all elementary hypothesis which can’t be rejected  Exhaustive set: hypothesis that all of them could be true 19
  17. Recommendations  The number of algorithms used in multiple comparisons

    procedures must be lower than the number of case problems  Except for Wilcoxon test  … 20
  18. Conclusions  We can do better than just Average! 

    How many comparison are you looking for?  Pairwise comparison  Multiple comparison  Would you mind level of significance?  Sign test  Rank test  Problem difficulty  Qaude test  Taking into account relative algorithm comparisons  Post-hoc adjustments 21
  19. Reference  J. Derrac, S. García, D. Molina, and F.

    Herrera, “A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms,” Swarm and Evolutionary Computation, vol. 1, no. 1, pp. 3–18, Mar. 2011. 22