Building a Collaborative Ensemble to Forecast Influenza in the US

Building a Collaborative Ensemble to Forecast Influenza in the US
Nicholas G. Reich for the FluSight Network Collaborative Ensemble August 2017, CDC FluSight meeting  https://flusightnetwork.github.io/cdc-flusight-ensemble/ 1

2 Evan Ray Abhinav Tushar Nicholas Reich Logan Brooks Roni
Rosenfeld etc... Contributors so far Teresa Yamana Jeff Shaman etc... Shaman Group @ Columbia

To what extent can we improve ﬂu prediction by combining
our forecasts? 3

4 US + 10 HHS Regions - all forecasts Team
Avg Log Score exp{avg log score} Delphi-Epicast -0.796 0.451 Delphi-Stat -0.824 0.438 UnwghtAvg -0.845 0.430 CU2 -0.862 0.422 LANL -0.864 0.422 CU3 -0.867 0.420 CU1 -0.869 0.420 CU4 -0.883 0.413 HumNat -0.940 0.391 KOT-Dev -0.963 0.382 KOT-Stable -0.989 0.372 PSI -1.155 0.315 Yale2 -1.215 0.297 Yale1 -1.224 0.294 KBSI -1.266 0.282 ICS -1.318 0.268 Hist-Avg -1.439 0.237 ISU -1.572 0.208 TeamC - UCSF1 -1.685 0.185 NEU -1.872 0.154 HumNat2 -1.893 0.151 TeamA - JL1 -2.325 0.098 In 2016-2017, the CDC   implemented a simple average model.    It did really well. Was that luck? (n=1) Can we make a strong, scientiﬁc case for this, or another, ensemble approach?

Can we identify an evidence- based "optimal" method for combining
inﬂuenza forecasts? 5

Ensemble project timeline 6 FluSightNetwork   established Mar '17 May
July Sept Nov May '18 Collaborative ensemble guidelines* advertised First submission deadline First CDC submissions due Final CDC submissions Collaborative ensemble "experiments" *tinyurl.com/ﬂusight2017

Guidelines overview • To participate in the collaborative ensemble, a
team must provide a “common development set” of forecasts for 2010/2011 - 2016/2017 seasons (minimum of 3 seasons). • Essentially, a set of full-season forecast files, as provided to CDC for the real challenge. • Must only use data available “in real-time”. 7 EW10 “1-step” EW11 “2-step” EW12 … e.g. file submitted with "EW10" in filename can use data available up through 11:59pm of Monday EW12 (this reflects the "real-time" reporting process of CDC data). We have made this process “simpler” with some data and code available on GitHub. EW09 "3-step" EW13

Ensemble overview • Using forecasts from the "common development set",
we will train several different ensembles and compare their prospective performance. • Goal: have a "collaborative ensemble" model that is submitted weekly to the CDC in the 2017/2018 season. • Challenges/open questions: 1. What ensemble methods should we consider? 2. What training/testing regimen should we use? 3. How do we formally compare two or more models? 4. How can we incorporate models that have short "histories" of performance? 8

Model stacking 9 KDE KCDE SARIMA 0 3 6 9
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 peak incidence value (binned wILI) probability A: Original predictions πm = 0.2 πm = 0.5 πm = 0.3 0 3 6 9 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 peak incidence value (binned wILI) B: Weighted predictions C: Stacked prediction 0.0 0.1 0.2 0.3 0.4 0.5 0 3 6 9 peak incidence value (binned wILI) Figure from Ray and Reich (under review). Not the only way to make an ensemble of predictive densities, but a nice and simple place to start.

What ensembles to consider? 10

Draft model list for discussion A. Equal weights for each
model. B. Wts estimated per model. C. Wts estimated per model and region. D. Wts estimated per model and target. E. Wts estimated per model, target, and region. F. Wts estimated per model, target, and time-of-season. G. Wts estimated per model, target, and recent performance. 11 Governing principle: start simple! A-E could be estimated "easily" using degenerate EM algorithm. F and G might require fancier methods.

What training/testing regimen to use? 12

Proposed Regimen A: cross-validation and testing 13 2010/2011 2011/2012 2012/2013
2013/2014 2014/2015 2015/2016 2016/2017 Leave-one-season-out cross-validation. Two models advance to testing.* Training phase Testing phase Weights ﬁt prospectively for each testing year. Compare two models. *Perhaps more than two models advance, if no signiﬁcant difference detected between models?

Proposed Regimen B: cross-validation only 14 2010/2011 2011/2012 2012/2013 2013/2014
2014/2015 2015/2016 2016/2017 Weights ﬁt prospectively at start of each season. Performance of all models compared across 6 seasons. Prospective training phase

How can we make formal model comparisons? 15 And, what
comparisons do we want to make?

16 Figure from Ray and Reich (under review). component models
ensemble models positive value means row model performs better than column permutation-based p-value

17 Figure from Ray and Reich (under review). component models
ensemble models "a bad prediction can be worse than no prediction at all"

Observations • Research has shown that one added value of
ensembles lies in improving the consistency of forecasts: i.e. there is value in comparing average performance AND a measure of consistency (i.e. worst-case performance). • Sample size and statistical power for comparisons is a major issue, especially between models with small differences. • Developing "valid" hypothesis tests is challenging because of non-normal distributions of log scores and presence of correlation in performance across weeks, seasons, and regions. 18

How can we incorporate models that don't have as long
a history? 19

Trusting "new" models • New data sources may provide very
valuable and new information that should be incorporated into forecasts. • Each season can be very different. • How much should decision-makers trust/weigh a model that has performed very well for the last two years against models that have been consistently good for 5-10 years? 20

Incorporating a "new" model • One proposed solution: 1. Calculate
Euclidean distance between vector of log-scores from new model and each existing model. 2. Assign new model the weight for the existing model that is closest to the new model. 3. Standardize weights to sum to 1. 21

Current project status 22

23 Currently we have 15 models from 3 groups.

24 https://ﬂusightnetwork.github.io/cdc-ﬂusight-ensemble/

Open Discussion • How many additional teams will provide necessary
training forecasts? • How difficult to get more training seasons? • Do we use a single ensemble for all targets/regions? Or do we choose target- or region-specific ensembles? • What are the deliverables?   - Collaborative submission to CDC  - Report to CDC documenting ensemble comparison  - Group publication at end of the 2017/2018 season • What is our plan for implementing all of this? 25 Request to join the group at https://groups.google.com/d/forum/flusightnetwork

Building a Collaborative Ensemble to Forecast I...

Building a Collaborative Ensemble to Forecast Influenza in the US

Nicholas G Reich

More Decks by Nicholas G Reich

Featured

Transcript

Building a Collaborative Ensemble to Forecast Inﬂuenza in the US

2 Evan Ray Abhinav Tushar Nicholas Reich Logan Brooks Roni

To what extent can we improve ﬂu prediction by combining

4 US + 10 HHS Regions - all forecasts Team

Can we identify an evidence- based "optimal" method for combining

Ensemble project timeline 6 FluSightNetwork   established Mar '17 May

Guidelines overview • To participate in the collaborative ensemble, a

Ensemble overview • Using forecasts from the "common development set",

Model stacking 9 KDE KCDE SARIMA 0 3 6 9

What ensembles to consider? 10

Draft model list for discussion A. Equal weights for each

What training/testing regimen to use? 12

Proposed Regimen A: cross-validation and testing 13 2010/2011 2011/2012 2012/2013

Proposed Regimen B: cross-validation only 14 2010/2011 2011/2012 2012/2013 2013/2014

How can we make formal model comparisons? 15 And, what

16 Figure from Ray and Reich (under review). component models

17 Figure from Ray and Reich (under review). component models

Observations • Research has shown that one added value of

How can we incorporate models that don't have as long

Trusting "new" models • New data sources may provide very

Incorporating a "new" model • One proposed solution: 1. Calculate

Current project status 22

23 Currently we have 15 models from 3 groups.

24 https://ﬂusightnetwork.github.io/cdc-ﬂusight-ensemble/

Open Discussion • How many additional teams will provide necessary