Mathematical modelling of disease / epidemics is very old

Daniel Bernoulli (1760) - studies inoculation against smallpox

John Snow (1855) - cholera epidemic in London tied to a water pump

Ronald Ross (1902) - Nobel Prize in Medicine for work on malaria

Kermack and McKendrick (1927-1933) - basic epidemic (mathematical) model

Forecasting is also old, but not that old

US CDC Flu Challenge began in 2013

CDC’s Influenza Division has collaborated each flu season with external researchers on flu forecasting. CDC has provided forecasting teams data, relevant public health forecasting targets, and forecast accuracy metrics while teams submit their forecasts, which are based on a variety of methods and data sources, each week.

The Covid-19 Pandemic

CDC pivoted their Flu Challenge to Covid-19 in June 2020

Similar efforts in Germany and then Europe

Nothing similar for Canada

CDC now has Flu and Covid simultaneously

Why the Forecast Hubs?

Collect public forecasts in a standard format and visualize

Used internally by CDC

Turns out, most individual teams’ forecasts are … not great

An `epi_df` object, 129,638 x 5 with metadata:
* geo_type = state
* time_type = day
* as_of = 2021-12-01
# A tibble: 129,638 × 5
geo_value time_value version percent_cli case_rate_7d_av
* <chr> <date> <date> <dbl> <dbl>
1 ca 2020-06-01 2020-06-02 NA 6.63
2 ca 2020-06-01 2020-06-06 2.14 6.63
3 ca 2020-06-01 2020-06-07 2.14 6.63
4 ca 2020-06-01 2020-06-08 2.14 6.63
5 ca 2020-06-01 2020-06-09 2.11 6.63
6 ca 2020-06-01 2020-06-10 2.13 6.63
7 ca 2020-06-01 2020-06-11 2.20 6.63
8 ca 2020-06-01 2020-06-12 2.23 6.63
9 ca 2020-06-01 2020-06-13 2.22 6.63
10 ca 2020-06-01 2020-06-14 2.21 6.63
# ℹ 129,628 more rows

Finalized data

Counts are revised as time proceeds

Want to know the final value

Often not available until weeks/months later

Forecasting

At time \(t\), predict the final value for time \(t+h\), \(h > 0\)

Nowcasting

At time \(t\), predict the final value for time \(t\)

Backcasting

At time \(t\), predict the final value for time \(t-h\), \(h < 0\)

Aside on Nowcasting

When Epi’s say “nowcasting”, they typically mean “estimate the time-varying instantaneous reproduction number, \(R_t\)”

My group built {rtestim} doing for this nonparametrically

See also {epinow2} and {epinowcast} for flexible (but slow) Bayesian methods based on compartmental models

Mathematical setup

Suppose today is time \(t\)

Let \(y_i\) denote a series of interest observed at times \(i=1,\ldots, t\).

Let \(0< h_1 < \cdots < h_k\) be a collection of forecast horizons.

Our goal

Produce point forecasts for the finalized values of \(y_{t+h_1}, \ldots, y_{t+h_k}\).

Accompany these with prediction intervals

We also have access to \(p\) other time series \(x_{ij},\; i=1,\ldots,t, \; j = 1,\ldots,p\)

All may be subject to revisions.

2. Standard forecasting models

Models used for COVID-19 hospitalizations

Quick overview of types

SIR / Compartmental model

Deep Learning / ML

Time series

BPagano - SIR

CMU - Time Series

Colorado (4 versions) - SEIR, different scenarios

Georgia Tech - Deep learning…

IHME (rarely) - SEIR

Johns Hopkins (3 in Applied Physics) - SEIR, Time series, ensemble

Johns Hopkins (other, Infection Dynamics) - SEIR

Dean Karlen (Physicist at UVic) - Compartmental

MOBS-Gleam - Agent based mobility model

USC - SEIR

UVA - Ensemble of VAR, LSTM, SEIR

Prolix - “Offsets obtained by correlations, best linear approximation of reproduction rates (using vaccination approximation) by least euclidean distance, and linear prediction.”

Focus: time series type

Pros:

“easy” to implement

trivial to incorporate other features (waste water, google searches, auxilliary indicators)

massive variety of models and engines (linear models, random forests, boosting, etc.)

can be very fast to fit

don’t need to know epidemiological parameters (delay distributions, contact rates)

pivot to categorical forecasts (classification) or similar

can easily borrow information across locations

Cons:

doesn’t directly use epi dynamics

can’t access things like \(R_t\)

no mechanism to interrogate scenarios

Personal hot takes

🐶 Contrast between scenario modelling and forecasting

🐶 You really need to address the revision behaviour

👍 For typical season (influenza) SIR-type gets the dynamics

👎 Novel pathogens make SIR-type extremely difficult

👎 SIR-type requires lots of hand-tuning, and slow to fit

👎 Very difficult for Epi’s at STLTs to implement SIR from scratch

👍 I’m a statistician, so Time Series is much easier for me

Comparison of real-time performance

Deep learning isn’t worth it.

Hospitalization due to Covid; all US states, weekly, from January 2021 to December 2022

3. Building a time series forecaster

Philosophy of forecasting (general predictive modelling)

We should build up modular components

Be able to add/remove layers of complexity sequentially

Preprocessor: do things to the data before model training

Trainer: train a model on data, resulting in an object

Forecast: make forecasts, using an estimated model object

Postprocessor: do things to the predictions before returning

Examples of preprocessing

EDA type stuff

Making locations commensurate (per capita scaling)

# A preprocessing "recipe" that turns raw data into features / responser <-epi_recipe(edf) |>step_epi_lag(case_rate, lag =c(0, 1, 2, 3, 7, 14)) |>step_epi_lag(death_rate, lag =c(0, 7, 14)) |>step_epi_ahead(death_rate, ahead =14) |>step_epi_naomit()# A postprocessing routine describing what to do to the predictionsf <-frosting() |>layer_predict() |>layer_threshold(.pred, lower =0) |># predictions / intervals should be non-negativelayer_add_target_date() |>layer_add_forecast_date()# Bundle up the preprocessor, training engine, and postprocessor# We use quantile regressionewf <-epi_workflow(r, quantile_reg(quantile_levels =c(.1, .5, .9)), f)# Fit it to data (we could fit this to ANY data that has the same format)trained_ewf <- ewf |>fit(edf)# examines the recipe to determine what we need to make the predictionlatest <-get_test_data(r, edf)# we could make predictions using the same model on ANY test datapreds <- trained_ewf |>predict(new_data = latest)

Visualize a result for 1 forecast date, 2 locations