Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Procedures and Scores for MSLA forecasting

Procedures and Scores for MSLA forecasting

A brief overview of the Procedures and Scores to evaluate and compare the performance of MSLA forecasts

Nicolas Fauchereau

February 20, 2014
Tweet

More Decks by Nicolas Fauchereau

Other Decks in Science

Transcript

  1. Outline     •  Reminder  on  the  Dataset   • 

    Procedures   –  What  is  overfiJng  ?   –  Cross-­‐valida6on   –  Development  vs  evalua,on  set     •  Metrics   –  regression   –  Classifica6on   •  Learning  Algorithms   –  regression   –  Classifica6on   –  A  short  note  on  Ensemble  Learning  
  2. Dataset   •  Guam,  Honolulu,  Kwajalein,  Malakal,   Moturiki,  Naha,

     Pago  Pago  and  Rarotonga   •  1980-­‐2010  (31  years)     rom January 1 1979 through December 31 2011. in mm. Month Station J F M A M J J A S O N D Station Total Guam ** 4 3 3 4 3 5 5 3 3 2 2 2 39 Honolulu 0 1 1 1 0 0 0 0 0 1 0 0 4 Kwajalein 1 1 1 0 1 0 0 1 0 0 2 0 7 Malakal 0 0 0 0 2 2 1 1 1 2 0 0 9 Moturiki 3 3 0 0 1 1 2 3 1 1 0 1 16 Naha 0 0 0 0 0 0 1 2 0 0 0 0 3 Pago Pago 3 1 1 0 2 2 3 1 2 2 1 2 20 Rarotonga 4 2 2 1 0 2 2 1 2 1 1 1 19 Month Total 15 11 8 6 9 12 14 12 9 9 6 6 Months Without Data ** Guam is missing 14 consecutive months from Dec 1997 through Jan 1999
  3. Procedures   What  is  overfi-ng  ?       A

     model  is  ‘overfi-ng’  when  it  learns  the  data  it  has  been  exposed  to  ‘by  heart’  but  is  not  able  to  generalize   to  yet  unseen  data.     Scores  (e.g.  skill  scores)  obtained  over  the  data  that  has  been  used  to  train  the  model  then  tell  us  (almost)   nothing  about  the  actual  performance  in  produc6on  of  the  model  …     Cross-­‐valida5on       A  way  to  work  around  that  is  to  train  the  model  over  a  subset  of  the  available  data  (the  training  set),  calculate   the  train  score,  and  test  the  model  (i.e.  calculate  the  test  score)  over  the  remaining  of  the  data  (the  test  set)     Cross-­‐valida5on  consists  into  repea6ng  this  opera6on  several  6mes  using  successive  splits  of  the  original   dataset  into  training  and  test  sets,  and  calcula6ng  a  summary  sta6s6c  of  the  train  and  test  scores  over  the   itera6ons  (usually  average).     Several  splits  can  be  used:     •  Random  split:  a  given  percentage  of  the  data  is  selected  at  random  (with  replacement)       •  K-­‐folds:  the  dataset  is  divided  into  K  exhaus6ve  splits,  each  split  is  used  as  the  test  set,  while  the  K-­‐1  splits   are  using  as  the  training  set   •  Stra5fied  K-­‐folds:    for  classifica6on  mainly.  The  folds  are  made  so  that  the  class  distribu6on  is   approximately  the  same  in  each  fold  (e.g.  the  rela6ve  frequency  of  each  class  is  preserved)     •  Leave  One  Out  (LOO)  does  what  it  says:  it  is  like  K-­‐fold  with  K  equals  to  the  number  of  observa6ons    
  4. Procedures   What  you  really  want  is  to  have  both

     train  and  test  scores  high  !   γ  for  SVM   Training  set  size   (not  applicable  for  LOO)  
  5. Procedures   The  case  for  a  development  vs  evalua6on  set

        Even  cross-­‐validated  test-­‐scores  can  give  an  unrealis6c  es6mate  of  the  ability  of  the  model  to   generalize  if  the  distribu5on  changes  with  5me  (non-­‐sta5onarity)     This  non-­‐sta5onarity  however  is  something  that  we  could  expect  in  the  context  of  climate   change  /  sea-­‐level-­‐rise     It  can  be  partly  taken  into  account  by  including  a  ‘trend’  predictor       But  I  would  argue  to:     1.  held  out  ~  2  years  of  data  (2009-­‐2010)  as  an  evalua6on  set.     2.  Con6nue  the  models  evalua6on  in  real-­‐6me  to  detect  any  trend  in  the  model’s  performance          
  6. Metrics   What  do  we  want  ?   – A  skill

     score     How  much  beher  than  a  ‘null’  forecast  are  we  doing  ?   –  Random     –  Climatology   – Some  requirements     •  For  regression:     –  penalizing  large  errors   •  For  classifica6on  (probabili6es  ahached  to  each  of  5  classes):     –  Take  into  account  the  probabili6es  ranking  (ordinal)    
  7. Metrics   –  regression:     The  Mean  Square  Error

     (MSE)             1 ! !! −!!! !!! ! • The+MSE+(Mean+Square+Error)! he!MSE!is!the!average!squared!difference!between!the!forecast!and!observation! irs:!! 1 ! !! −!!! ! ! !!! ! ten!the!RMSE!is!given,!calculated!as!the!squareOroot!of!the!MSE.! !translate!either!MAE!or!MSE!into!a!skill+measure+(see!above)!one!can!then! mply!calculate!the!MAE!or!MSE!of!a!climatologyObased!forecast.!! his!skill!score,!for!the!MSE,!is!simply:!! !!!"# = !1 −! !"# ! • The+MSE+(Mean+Square+Error)! he!MSE!is!the!average!squared!difference!between!the!forecast!and!observation! irs:!! 1 ! !! −!!! ! ! !!! ! ten!the!RMSE!is!given,!calculated!as!the!squareOroot!of!the!MSE.! !translate!either!MAE!or!MSE!into!a!skill+measure+(see!above)!one!can!then! mply!calculate!the!MAE!or!MSE!of!a!climatologyObased!forecast.!! his!skill!score,!for!the!MSE,!is!simply:!! !!!"# = !1 −! !"# !"#!"#$ ! can!be!shown!that!this!latter!equation!for!the!MSE!skill!score!is!equivalent!to! Mean  Square  Skill  Score:  
  8. Metrics   –  Classifica5on:     The  Ranked  Probability  Skill

     Score  (RPSS)   In  simple  words,  the  Ranked  Probability  Score  is  the  squared  difference   between  the  cumula6ve  probabili6es  of  the  forecast  (over  5  categories)  and   the  cumula6ve  probabili6es  of  the  observa5ons  (with  1.  for  the  observed   category)           Advantages:   •  Takes  into  account  uncertainty  in  the  forecast     –  i.e.  more  penalized  for  ‘flat’  forecasts   •  Takes  into  account  the  ordinal  nature  of  the  predictand’s  category   –  i.e.  for  an  ‘above’  observed  category,  a  ‘well-­‐below’  forecast  results  in   a  lower  score  than  a  ‘normal’  forecast           his!is!the!equation!for!a!single!forecast!–!observation!pair,!for!a!collection!o orecasts!over!a!given!time!period,!one!simply!averages!the!RPS!values!for!ea orecast!–!observation!pair.!! !"# =! 1 ! !"#! ! !!! !! he!skill!score!can!be!computed!as!usual!as:!! !"## = 1 −! !"# !"#!"#$ ! ome*references* Daniel!S.!Wilks:!Statistical!Methods!in!the!Atmospheric!Sciences.! nternational!Geophysics!Series.!Academic!Press,!627p.!see!in!particular!Chapt :!forecast!verification,!pp!255!–!335.!
  9. The  learning  algorithms   •  Regression:     –  Mul5ple

     Linear  Regression   •  Pros   –  Easy  to  implement  and  interpret     •  Cons   –  Linear   –  Predictors  must  be  independent  (uncorrelated  [e.g.  EOFs])   –  Mul5variate  Adapta5ve  Regression  Splines  (MARS)   •  Pros     –  (linear)  combina6on  of  non-­‐linear  (hinge)  func6ons   –  Represents  interac6ons  between  features  (predictors)   –  Pruning  ~=  feature  selec6on     •  Cons   –  Harder  to  interpret     –  Neural  Networks     •  Pros   –  Sounds  good  !       •  Cons   –  Black  boxes    
  10. The  learning  algorithms   •  Classifica5on:     –  Logis5c

     Regression   •  Pros   –  Simple,  fast     •  Cons   –  Extension  to  mul6-­‐class  (not  binary)     –  Support  Vector  Machines   •  Pros   –  One  of  the  best  ‘out-­‐of-­‐the-­‐box’  classifier   –  Kernels:  very  flexible   •  Cons   –  Beware  of  overfiJng     –  Parametriza6on  (op6mizing  C  and  γ)  can  be  computer  intensive  (grid  search)   –  Linear  Discriminant  Analysis  (Rashed  ?)   •  Pros   –  Simple,  fast   •  Cons   –  Linear,  less  flexible  
  11. The  learning  algorithms   Ensemble  Learning   In  short,  Ensemble

     Learning  Techniques  use  mul6ple  models  to  obtain   beher  predic6ve  performance  than  could  be  obtained  from  any  of  the   cons6tuent  models.     Bootstrap  aggrega6ng,  ooen  abbreviated  as  bagging,  involves  having   each  model  in  the  ensemble  vote  with  equal  weight.  In  order  to   promote  model  variance,  bagging  trains  each  model  in  the  ensemble   using  a  randomly  drawn  subset  of  the  training  set.     The  Random  Forest  algorithm  combines  random  decision  trees  with   bagging  to  achieve  very  high  classifica6on  accuracy.     Ooen  the  winning  algorithm  no  maher  what  the  domain  in  Machine   Learning  compe66ons  (see  hhps://www.kaggle.com/compe66ons)