Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Collaborative Ensemble to Forecast Influenza in the US

Nicholas G Reich
August 29, 2017
110

Building a Collaborative Ensemble to Forecast Influenza in the US

Slides for presentation at CDC, August 2017.

Nicholas G Reich

August 29, 2017
Tweet

Transcript

  1. Building a Collaborative Ensemble to Forecast Influenza in the US

    Nicholas G. Reich for the FluSight Network Collaborative Ensemble August 2017, CDC FluSight meeting
 https://flusightnetwork.github.io/cdc-flusight-ensemble/ 1
  2. 2 Evan Ray Abhinav Tushar Nicholas Reich Logan Brooks Roni

    Rosenfeld etc... Contributors so far Teresa Yamana Jeff Shaman etc... Shaman Group @ Columbia
  3. To what extent can we improve flu prediction by combining

    our forecasts? 3
  4. 4 US + 10 HHS Regions - all forecasts Team

    Avg Log Score exp{avg log score} Delphi-Epicast -0.796 0.451 Delphi-Stat -0.824 0.438 UnwghtAvg -0.845 0.430 CU2 -0.862 0.422 LANL -0.864 0.422 CU3 -0.867 0.420 CU1 -0.869 0.420 CU4 -0.883 0.413 HumNat -0.940 0.391 KOT-Dev -0.963 0.382 KOT-Stable -0.989 0.372 PSI -1.155 0.315 Yale2 -1.215 0.297 Yale1 -1.224 0.294 KBSI -1.266 0.282 ICS -1.318 0.268 Hist-Avg -1.439 0.237 ISU -1.572 0.208 TeamC - UCSF1 -1.685 0.185 NEU -1.872 0.154 HumNat2 -1.893 0.151 TeamA - JL1 -2.325 0.098 In 2016-2017, the CDC 
 implemented a simple average model.
 
 It did really well. Was that luck? (n=1) Can we make a strong, scientific case for this, or another, ensemble approach?
  5. Can we identify an evidence- based "optimal" method for combining

    influenza forecasts? 5
  6. Ensemble project timeline 6 FluSightNetwork 
 established Mar '17 May

    July Sept Nov May '18 Collaborative ensemble guidelines* advertised First submission deadline First CDC submissions due Final CDC submissions Collaborative ensemble "experiments" *tinyurl.com/flusight2017
  7. Guidelines overview • To participate in the collaborative ensemble, a

    team must provide a “common development set” of forecasts for 2010/2011 - 2016/2017 seasons (minimum of 3 seasons). • Essentially, a set of full-season forecast files, as provided to CDC for the real challenge. • Must only use data available “in real-time”. 7 EW10 “1-step” EW11 “2-step” EW12 … e.g. file submitted with "EW10" in filename can use data available up through 11:59pm of Monday EW12 (this reflects the "real-time" reporting process of CDC data). We have made this process “simpler” with some data and code available on GitHub. EW09 "3-step" EW13
  8. Ensemble overview • Using forecasts from the "common development set",

    we will train several different ensembles and compare their prospective performance. • Goal: have a "collaborative ensemble" model that is submitted weekly to the CDC in the 2017/2018 season. • Challenges/open questions: 1. What ensemble methods should we consider? 2. What training/testing regimen should we use? 3. How do we formally compare two or more models? 4. How can we incorporate models that have short "histories" of performance? 8
  9. Model stacking 9 KDE KCDE SARIMA 0 3 6 9

    0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 peak incidence value (binned wILI) probability A: Original predictions πm = 0.2 πm = 0.5 πm = 0.3 0 3 6 9 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 peak incidence value (binned wILI) B: Weighted predictions C: Stacked prediction 0.0 0.1 0.2 0.3 0.4 0.5 0 3 6 9 peak incidence value (binned wILI) Figure from Ray and Reich (under review). Not the only way to make an ensemble of predictive densities, but a nice and simple place to start.
  10. What ensembles to consider? 10

  11. Draft model list for discussion A. Equal weights for each

    model. B. Wts estimated per model. C. Wts estimated per model and region. D. Wts estimated per model and target. E. Wts estimated per model, target, and region. F. Wts estimated per model, target, and time-of-season. G. Wts estimated per model, target, and recent performance. 11 Governing principle: start simple! A-E could be estimated "easily" using degenerate EM algorithm. F and G might require fancier methods.
  12. What training/testing regimen to use? 12

  13. Proposed Regimen A: cross-validation and testing 13 2010/2011 2011/2012 2012/2013

    2013/2014 2014/2015 2015/2016 2016/2017 Leave-one-season-out cross-validation. Two models advance to testing.* Training phase Testing phase Weights fit prospectively for each testing year. Compare two models. *Perhaps more than two models advance, if no significant difference detected between models?
  14. Proposed Regimen B: cross-validation only 14 2010/2011 2011/2012 2012/2013 2013/2014

    2014/2015 2015/2016 2016/2017 Weights fit prospectively at start of each season. Performance of all models compared across 6 seasons. Prospective training phase
  15. How can we make formal model comparisons? 15 And, what

    comparisons do we want to make?
  16. 16 Figure from Ray and Reich (under review). component models

    ensemble models positive value means row model performs better than column permutation-based p-value
  17. 17 Figure from Ray and Reich (under review). component models

    ensemble models "a bad prediction can be worse than no prediction at all"
  18. Observations • Research has shown that one added value of

    ensembles lies in improving the consistency of forecasts: i.e. there is value in comparing average performance AND a measure of consistency (i.e. worst-case performance). • Sample size and statistical power for comparisons is a major issue, especially between models with small differences. • Developing "valid" hypothesis tests is challenging because of non-normal distributions of log scores and presence of correlation in performance across weeks, seasons, and regions. 18
  19. How can we incorporate models that don't have as long

    a history? 19
  20. Trusting "new" models • New data sources may provide very

    valuable and new information that should be incorporated into forecasts. • Each season can be very different. • How much should decision-makers trust/weigh a model that has performed very well for the last two years against models that have been consistently good for 5-10 years? 20
  21. Incorporating a "new" model • One proposed solution: 1. Calculate

    Euclidean distance between vector of log-scores from new model and each existing model. 2. Assign new model the weight for the existing model that is closest to the new model. 3. Standardize weights to sum to 1. 21
  22. Current project status 22

  23. 23 Currently we have 15 models from 3 groups.

  24. 24 https://flusightnetwork.github.io/cdc-flusight-ensemble/

  25. Open Discussion • How many additional teams will provide necessary

    training forecasts? • How difficult to get more training seasons? • Do we use a single ensemble for all targets/regions? Or do we choose target- or region-specific ensembles? • What are the deliverables? 
 - Collaborative submission to CDC
 - Report to CDC documenting ensemble comparison
 - Group publication at end of the 2017/2018 season • What is our plan for implementing all of this? 25 Request to join the group at https://groups.google.com/d/forum/flusightnetwork