Mathematical modelling of disease / epidemics is very old
Daniel Bernoulli (1760) - studies inoculation against smallpox
John Snow (1855) - cholera epidemic in London tied to a water pump
Ronald Ross (1902) - Nobel Prize in Medicine for work on malaria
Kermack and McKendrick (1927-1933) - basic epidemic (mathematical) model
Forecasting is also old, but not that old
US CDC Flu Challenge began in 2013
CDC’s Influenza Division has collaborated each flu season with external researchers on flu forecasting. CDC has provided forecasting teams data, relevant public health forecasting targets, and forecast accuracy metrics while teams submit their forecasts, which are based on a variety of methods and data sources, each week.
The Covid-19 Pandemic
CDC pivoted their Flu Challenge to Covid-19 in June 2020
Similar efforts in Germany and then Europe
Nothing similar for Canada
CDC now has Flu and Covid simultaneously
Why the Forecast Hubs?
Collect public forecasts in a standard format and visualize
Used internally by CDC
Turns out, most individual teams’ forecasts are … not great
An `epi_df` object, 129,638 x 5 with metadata:
* geo_type = state
* time_type = day
* as_of = 2021-12-01
# A tibble: 129,638 × 5
geo_value time_value version percent_cli case_rate_7d_av
* <chr> <date> <date> <dbl> <dbl>
1 ca 2020-06-01 2020-06-02 NA 6.63
2 ca 2020-06-01 2020-06-06 2.14 6.63
3 ca 2020-06-01 2020-06-07 2.14 6.63
4 ca 2020-06-01 2020-06-08 2.14 6.63
5 ca 2020-06-01 2020-06-09 2.11 6.63
6 ca 2020-06-01 2020-06-10 2.13 6.63
7 ca 2020-06-01 2020-06-11 2.20 6.63
8 ca 2020-06-01 2020-06-12 2.23 6.63
9 ca 2020-06-01 2020-06-13 2.22 6.63
10 ca 2020-06-01 2020-06-14 2.21 6.63
# ℹ 129,628 more rows
Finalized data
Counts are revised as time proceeds
Want to know the final value
Often not available until weeks/months later
Forecasting
At time \(t\), predict the final value for time \(t+h\), \(h > 0\)
Nowcasting
At time \(t\), predict the final value for time \(t\)
Backcasting
At time \(t\), predict the final value for time \(t-h\), \(h < 0\)
Aside on Nowcasting
When Epi’s say “nowcasting”, they typically mean “estimate the time-varying instantaneous reproduction number, \(R_t\)”
My group built {rtestim} doing for this nonparametrically
See also {epinow2} and {epinowcast} for flexible (but slow) Bayesian methods based on compartmental models
Mathematical setup
Suppose today is time \(t\)
Let \(y_i\) denote a series of interest observed at times \(i=1,\ldots, t\).
Let \(0< h_1 < \cdots < h_k\) be a collection of forecast horizons.
Our goal
Produce point forecasts for the finalized values of \(y_{t+h_1}, \ldots, y_{t+h_k}\).
Accompany these with prediction intervals
We also have access to \(p\) other time series \(x_{ij},\; i=1,\ldots,t, \; j = 1,\ldots,p\)
All may be subject to revisions.
2. Standard forecasting models
Models used for COVID-19 hospitalizations
Quick overview of types
SIR / Compartmental model
Deep Learning / ML
Time series
BPagano - SIR
CMU - Time Series
Colorado (4 versions) - SEIR, different scenarios
Georgia Tech - Deep learning…
IHME (rarely) - SEIR
Johns Hopkins (3 in Applied Physics) - SEIR, Time series, ensemble
Johns Hopkins (other, Infection Dynamics) - SEIR
Dean Karlen (Physicist at UVic) - Compartmental
MOBS-Gleam - Agent based mobility model
USC - SEIR
UVA - Ensemble of VAR, LSTM, SEIR
Prolix - “Offsets obtained by correlations, best linear approximation of reproduction rates (using vaccination approximation) by least euclidean distance, and linear prediction.”
Focus: time series type
Pros:
“easy” to implement
trivial to incorporate other features (waste water, google searches, auxilliary indicators)
massive variety of models and engines (linear models, random forests, boosting, etc.)
can be very fast to fit
don’t need to know epidemiological parameters (delay distributions, contact rates)
pivot to categorical forecasts (classification) or similar
can easily borrow information across locations
Cons:
doesn’t directly use epi dynamics
can’t access things like \(R_t\)
no mechanism to interrogate scenarios
Personal hot takes
🐶 Contrast between scenario modelling and forecasting
🐶 You really need to address the revision behaviour
👍 For typical season (influenza) SIR-type gets the dynamics
👎 Novel pathogens make SIR-type extremely difficult
👎 SIR-type requires lots of hand-tuning, and slow to fit
👎 Very difficult for Epi’s at STLTs to implement SIR from scratch
👍 I’m a statistician, so Time Series is much easier for me
Comparison of real-time performance
Deep learning isn’t worth it.
Hospitalization due to Covid; all US states, weekly, from January 2021 to December 2022
3. Building a time series forecaster
Philosophy of forecasting (general predictive modelling)
We should build up modular components
Be able to add/remove layers of complexity sequentially
Preprocessor: do things to the data before model training
Trainer: train a model on data, resulting in an object
Forecast: make forecasts, using an estimated model object
Postprocessor: do things to the predictions before returning
Examples of preprocessing
EDA type stuff
Making locations commensurate (per capita scaling)
# A preprocessing "recipe" that turns raw data into features / responser <-epi_recipe(edf) |>step_epi_lag(case_rate, lag =c(0, 1, 2, 3, 7, 14)) |>step_epi_lag(death_rate, lag =c(0, 7, 14)) |>step_epi_ahead(death_rate, ahead =14) |>step_epi_naomit()# A postprocessing routine describing what to do to the predictionsf <-frosting() |>layer_predict() |>layer_threshold(.pred, lower =0) |># predictions / intervals should be non-negativelayer_add_target_date() |>layer_add_forecast_date()# Bundle up the preprocessor, training engine, and postprocessor# We use quantile regressionewf <-epi_workflow(r, quantile_reg(quantile_levels =c(.1, .5, .9)), f)# Fit it to data (we could fit this to ANY data that has the same format)trained_ewf <- ewf |>fit(edf)# examines the recipe to determine what we need to make the predictionlatest <-get_test_data(r, edf)# we could make predictions using the same model on ANY test datapreds <- trained_ewf |>predict(new_data = latest)
Visualize a result for 1 forecast date, 2 locations