| MAE | RMSE | MAPE | MASE | |
|---|---|---|---|---|
| yhat1 | 2.873 | 4.024 | 43.140 | 66.100 |
| yhat2 | 5.382 | 9.689 | 36.083 | 123.817 |
CANSSI Prairies — Epi Forecasting Workshop 2025
Lecture 4
(with some review)
Assume we observe a predictor \(x_i\) and an outcome \(y_i\) for \(i = 1, \dots, n\).
Linear regression supposes
\[\mathbb{E}[y_i\ \vert\ x_i] \doteq \beta_0 + \beta_1 x_i,\quad i=1,\dots,n.\]
R, run lm(y ~ x), where y is the vector of responses and x the vector of predictors.\[ \begin{aligned} \mathbb{E}[y_i\ \vert\ \mathbf{x}_i] &\doteq \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}\\ &= \mathbf{x}^\mathsf{T}_i\beta &i=1,\dots,n. \end{aligned} \]
\[\mathbb{E}[y_t\ \vert\ x_{t-k}] \doteq \beta + \beta_0 x_{t-k}\]
\[\mathbb{E}[y_{t+k}\ \vert\ x_t] \doteq \beta + \beta_0 x_t\]
\[\mathbb{E}[y_{t+h}\ \vert\ \mathbf{y}_{\leq t}, \mathbf{x}_{\leq t}] \doteq \phi + \sum_{i=0}^p \phi_i y_{t-i} + \sum_{j=0}^q \beta_j x_{t-j}\]
\[\mathbb{E}[y_{t+h}\ \vert\ \mathbf{y}_{\leq t}, \mathbf{x}_{\leq t}] \doteq \phi + \phi_0 y_{t} + \phi_1 y_{t-7} + \phi_2 y_{t-14} + \beta_0 x_{t} + \beta_1 x_{t-7} + \beta_2 x_{t-14}\]
First two here are classic and standard, second two are more recent.
None are particularly well-suited for epi forecasting out-of-the-box (ask me about them if you’re curious)
I’ll briefly mention ARIMA as it’s closest to the setup so far.
{fable}) includes them in a different manner; they are effectively subject to the same lags as the AR and MA terms\[\textrm{MSE} = \frac{1}{N} \sum_{t=1}^N (\tilde y_{t}- \hat y_{t})^2\] \[\textrm{MAE} = \frac{1}{N} \sum_{t=1}^N |\tilde y_{t}- \hat y_{t}|\]
MAE gives less importance to extreme errors than MSE.
MSE is not on the same scale as the data (squared units), use RMSE instead.
Drawback: both metrics are scale-dependent, so they are not universally interpretable.
(For example, if \(y\) captures height, MSE and MAE will vary depending on whether we measure in feet or meters.)
Fixing scale-dependence:
\[\textrm{MAPE} = 100 \times \frac{1}{N} \sum_{t=1}^N \left|\frac{\tilde y_{t}- \hat y_{t}}{\tilde y_{t}}\right|\]
\[\textrm{MASE} = 100 \times \frac{\frac{1}{N} \sum_{t=1}^N |\tilde y_{t}- \hat y_{t}|} {\frac{1}{N-1} \sum_{t=2}^N |\tilde y_{t}- y_{t-1}|}\]
universally interpretable (not scale dependent)
avoids the zero-pitfall (unless the first difference is 0… )
Heuristic description: normalize the error of our forecasts by that of a naive method which always predicts the last observation.
| MAE | RMSE | MAPE | MASE | |
|---|---|---|---|---|
| yhat1 | 2.873 | 4.024 | 43.140 | 66.100 |
| yhat2 | 5.382 | 9.689 | 36.083 | 123.817 |
Given a set of predictive intervals \((l_t^{\alpha_1}, u_t^{\alpha_1}), \dots, (l_t^{\alpha_K}, u_t^{\alpha_K}),\quad t=1,\dots,N\)
\[\textrm{Coverage}(\alpha) = \frac{1}{N} \sum_{t=1}^N \mathbf{1}(l_t^{\alpha} \leq \tilde y_{t} \leq u_t^\alpha)\]
\[\textrm{IS}(\alpha) = \frac{1}{N} \sum_{t=1}^N \alpha|u_t^\alpha - l_t^\alpha| + 2(l_t^\alpha - \tilde y_t)_+ + 2(\tilde y_t - u_t^\alpha)_+\]
\[\textrm{WIS} = \sum_{k=1}^K\textrm{IS}(\alpha_k)\]
MSE <- function(obs, pred) mean((obs - pred)^2)
MAE <- function(obs, pred) mean(abs(obs - pred))
MAPE <- function(obs, pred) 100 * mean(abs(obs - pred) / obs)
MASE <- function(obs, pred) 100 * MAE(obs, pred) / mean(abs(diff(obs)), na.rm = TRUE)
Coverage <- function(obs, ql, qu) mean(obs >= ql & obs <= qu)
IS <- function(obs, ql, qu, alpha) alpha * mean(abs(qu - ql)) + 2 * pmax(ql - obs, obs - qu)Given an error metric, we want to estimate the prediction error under that metric.
The easiest but worst estimate of the prediction error
The training error is
\[\text{TrainMSE} = \frac{1}{N-h} \sum_{t = 1}^{N-h} (\hat y_{t+h|N} - y_{t+h})^2\]
To compute the split-sample error
Split data into training (up to time \(t_0\)), and test set (after \(t_0\))
Estimate the model to the training data only
Make predictions for the test set
Compute the error metric on the test set only
Important
Split-sample estimates of prediction error don’t mimic real forecasting.
We would refit with new data.
Therefore, split-sample is pessimistic if the relation between outcome and predictors changes over time.
Then, the split-sample MSE is
\[\text{SplitMSE} = \frac{1}{N-h-t_0} \sum_{t = t_0}^{N-h} (\hat y_{t+h|t_0} - y_{t+h})^2\]
Re-estimate once new data are available
To get \(h\)-step ahead predictions, for each time \(t = t_0, t_0+1, \dots\),
Estimate the model using data up to time \(t\)
Make a prediction for \(t+h\)
Record the prediction error
\[\textrm{CVMSE} = \frac{1}{N-h-t_0} \sum_{t = t_0}^{N-h} (\hat y_{t+h|t} - y_{t+h})^2\]
But that point about transformations is VERY important. And often overlooked.
scale() each of the predictors.R operations.data(Chicago, package = "modeldata")
Chicago <- select(Chicago, c(ridership, temp, humidity, percip))
chicago_train <- Chicago |>
slice_head(prop = .8) |>
mutate(across(everything(), scale))
mod <- lm(ridership ~ ., data = chicago_train)
chicago_test <- Chicago |>
slice_tail(prop = .2) |>
mutate(across(everything(), scale))
preds <- predict(mod, chicago_test)preds to get them in the original unitsThis is all wrong
{tidymodels}The {tidymodels} suite of packages is intended to handle this situation correctly.
It’s written by programmers at Posit (the people behind {tidyverse})
It doesn’t work for panel data.
That’s what we need for Epidemiological Time Series
We’ve been working with their team to develop this functionality.
glm() vs lm() vs glmnet()){epipredict} providesBasic and easy to use “canned” forecasters:
These are supposed to work easily
Handle lots of cases we’ve already seen
We’ll start here
{epipredict} provides{tidymodels} to panel dataarx_forecaster() on training sett0_date <- as.Date('2021-04-01')
train <- ca |> filter(time_value <= t0_date)
test <- ca |> filter(time_value > t0_date)
ca_arx <- arx_forecaster(
epi_data = train |> as_epi_df(),
outcome = "deaths",
predictors = c("cases", "deaths"),
trainer = linear_reg(),
args_list = arx_args_list(lags = 0, ahead = 28, quantile_levels = c(0.1, 0.9))
)arx_forecaster() output$epi_workflow).
$predictions).arx_forecaster() output══ Epi Workflow [trained] ══════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
Postprocessor: Frosting
── Preprocessor ────────────────────────────────────────────────────────────────
7 Recipe steps.
1. step_epi_lag()
2. step_epi_lag()
3. step_epi_ahead()
4. step_naomit()
5. step_naomit()
6. step_training_window()
7. check_enough_data()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) lag_0_cases lag_0_deaths
21.181007 0.008929 0.321781
── Postprocessor ───────────────────────────────────────────────────────────────
5 Frosting layers.
1. layer_predict()
2. layer_residual_quantiles()
3. layer_add_forecast_date()
4. layer_add_target_date()
5. layer_threshold()
# A tibble: 1 × 5
geo_value .pred .pred_distn forecast_date target_date
<chr> <dbl> <qtls(3)> <date> <date>
1 ca 105. [105] 2021-04-01 2021-04-29
Note
.pred_distn is actually a “distribution”, parameterized by its quantiles
arx_forecaster() estimates the quantiles in a different way than lm()
arx_forecaster() estimates a model on the training set
Outputs only the prediction for time \(t_0+h\)
To get predictions for the test set:
An `epi_df` object, 706 x 6 with metadata:
* geo_type = state
* time_type = day
* as_of = 2025-04-21 14:26:40.214048
# A tibble: 706 × 6
geo_value time_value .pred .pred_distn forecast_date target_date
<chr> <date> <dbl> <qtls(3)> <date> <date>
1 ca 2021-04-02 113. [113] 2021-04-01 2021-04-29
2 ca 2021-04-03 64.3 [64.3] 2021-04-01 2021-04-29
3 ca 2021-04-04 50.9 [50.9] 2021-04-01 2021-04-29
4 ca 2021-04-05 61.2 [61.2] 2021-04-01 2021-04-29
5 ca 2021-04-06 71.9 [71.9] 2021-04-01 2021-04-29
6 ca 2021-04-07 78.1 [78.1] 2021-04-01 2021-04-29
7 ca 2021-04-08 89.2 [89.2] 2021-04-01 2021-04-29
8 ca 2021-04-09 124. [124] 2021-04-01 2021-04-29
9 ca 2021-04-10 101. [101] 2021-04-01 2021-04-29
10 ca 2021-04-11 64.7 [64.7] 2021-04-01 2021-04-29
# ℹ 696 more rows
arx_forecaster() with epi_slide()epi_archive with epix_slide()fc_time_values <- seq(from = t0_date, to = as.Date("2023-02-09"), by = "1 week")
h <- 28 # horizon
w <- 120 + h # trailing window length
pred_arx <- ca_archive |> epix_slide(
~ arx_forecaster(epi_data = .x,
outcome = "deaths",
predictors = c("cases", "deaths"),
trainer = linear_reg(),
args_list = arx_args_list(lags = 0, ahead = h, quantile_levels = c(0.1, 0.9))
) |> pluck("predictions") |> pivot_quantiles_wider(.pred_distn),
.before = w,
.versions = fc_time_values
)Fit the model on a window of data of length \(w\)
Advantage:
Note (window length)
Setting \(w = 120 + h\) actually only uses \(N=120\) observations
It filters to data within the window, then performs leads/lags
To make this explicit, for a horizon \(h\), we need to “back” \(h\) days to see which predictors align with it
Note (all past)
To fitting on all past data up to the forecasting date use
epix_slide(...,.before = Inf)
# A tibble: 98 × 8
version geo_value .pred forecast_date target_date `0.1` `0.5` `0.9`
* <date> <chr> <dbl> <date> <date> <dbl> <dbl> <dbl>
1 2021-04-01 ca 0.396 2021-03-31 2021-04-28 0.192 0.396 0.599
2 2021-04-08 ca 0.395 2021-04-07 2021-05-05 0.197 0.395 0.594
3 2021-04-15 ca 0.403 2021-04-14 2021-05-12 0.211 0.403 0.595
4 2021-04-22 ca 0.312 2021-04-21 2021-05-19 0.142 0.312 0.482
5 2021-04-29 ca 0.261 2021-04-28 2021-05-26 0.0879 0.261 0.433
6 2021-05-06 ca 0.209 2021-05-05 2021-06-02 0.0238 0.209 0.394
7 2021-05-13 ca 0.158 2021-05-12 2021-06-09 0 0.158 0.345
8 2021-05-20 ca 0.118 2021-05-19 2021-06-16 0 0.118 0.296
9 2021-05-27 ca 0.0775 2021-05-26 2021-06-23 0 0.0775 0.239
10 2021-06-03 ca 0.0552 2021-06-02 2021-06-30 0 0.0552 0.137
# ℹ 88 more rows
| MAE | MASE | Coverage 80 |
|---|---|---|
| 0.08 | 218.86 | 0.43 |
arx_forecaster()arx_args_list(
lags = c(0L, 7L, 14L),
ahead = 7L,
n_training = Inf,
forecast_date = NULL,
target_date = NULL,
adjust_latency = c("none", "extend_ahead", "extend_lags", "locf"),
warn_latency = TRUE,
quantile_levels = c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95),
symmetrize = TRUE,
nonneg = TRUE,
quantile_by_key = character(0L),
check_enough_data_n = NULL,
check_enough_data_epi_keys = NULL,
...
)| MAE | MASE | Coverage 80 |
|---|---|---|
| 0.06 | 169.72 | 0.52 |
arx_forecaster()h <- c(7, 14, 21, 28)
forecast_times <- seq(from = t0_date, to = as.Date("2023-02-23"), by = "1 month")
pred_h_days_ahead <- function(epi_archive, ahead = 7) {
epi_archive |>
epix_slide(~ arx_forecaster(epi_data = .x,
outcome = "deaths",
predictors = c("deaths", "doctor_visits"),
trainer = linear_reg(),
args_list = arx_args_list(lags = 0, ahead = ahead, quantile_levels = c(0.1, 0.9))
)|> pluck("predictions") |> pivot_quantiles_wider(.pred_distn),
.before = w,
.versions = forecast_times
)
}
forecasts <- bind_rows(map(h, ~ pred_h_days_ahead(ca_archive_dv, ahead = .x)))Modify trainer to use a model that is not lm (default)
trainer = rand_forest(){parsnip} models, see list{epipredict} has a number of custom engines as wellpred_arx_rf <- ca_archive_dv |>
epix_slide(
~ arx_forecaster(epi_data = .x,
outcome = "deaths",
predictors = c("deaths", "doctor_visits"),
trainer = parsnip::rand_forest(mode = "regression"), # defaults to ranger
args_list = arx_args_list(lags = 0, ahead = 28, quantile_levels = c(0.1, 0.9))
) |> pluck("predictions") |> pivot_quantiles_wider(.pred_distn),
.before = w,
.versions = fc_time_values
)| MAE | MASE | Coverage 80 |
|---|---|---|
| 0.08 | 226.43 | 0.09 |
specify parsnip::rand_forest(mode = "regression", engine = "grf_quantiles")
grf_quantiles()| MAE | MASE | Coverage 80 |
|---|---|---|
| 0.09 | 248.24 | 0.39 |
{epipredict} takes quantiles of training residuals to form its prediction intervalslm(), this is more flexibleTaking quantiles of training residuals can be problematic if the model is overfit.
Quantile regression provides an alternative, wherein we estimate these quantiles directly
Technically, grf_quantiles() was using Quantile Loss with Random Forests
Now we directly target conditional quantiles of the outcome over time.
Estimating tail quantiles requires more data, so
unsuitable for settings with small training set (e.g. trailing window on one state)
can benefit by combination with geo-pooling (much more data to train on)
We could
We will now pool data from all US states to make predictions.
Also switch to using (linear) quantile regression (medians and intervals) quantreg::rq()
pred_arx_geo_pool <- usa_archive_dv |> epix_slide(
~ arx_forecaster(epi_data = .x,
outcome = "deaths",
predictors = c("deaths", "doctor_visits"),
trainer = quantile_reg(),
args_list = arx_args_list(ahead = 28, quantile_levels = c(0.1, 0.9))
) |> pluck("predictions") |> pivot_quantiles_wider(.pred_distn),
.before = w,
.versions = fc_time_values
)Note: geo-pooling is the default in epipredict
Geo-pooled predictions tend to be more stable
Generally with wider intervals (and better coverage)
Meanwhile, predictions from state-specific models tend to be more volatile
The extent to which this occurs differs based on the horizon.
So far, we performed manual pre-processing,
and then relied on a canned forecaster
to automatically perform more pre-processing, training, predicting, and post-processing.
What if we want more direct control on each single step?
arx_forecaster() (roughly)# A preprocessing "recipe" that turns raw data into features / response
rec <- epi_recipe(ca) |>
step_epi_lag(cases, lag = c(0, 7, 14)) |>
step_epi_lag(deaths, lag = c(0, 7, 14)) |>
step_epi_ahead(deaths, ahead = 28) |>
step_epi_naomit()
# Training engine
eng <- quantile_reg(quantile_levels = c(.1, .5, .9))
# A post-processing routine describing what to do to the predictions
frost <- frosting() |>
layer_predict() |>
layer_threshold(.pred, lower = 0) |> # predictions / intervals should be non-negative
layer_add_target_date() |>
layer_add_forecast_date()
# Bundle up the preprocessor, training engine, and postprocessor
# We use quantile regression
ewf <- epi_workflow(rec, eng, frost)
# Fit it to data (we could fit this to ANY data that has the same format)
trained_ewf <- fit(ewf, data = ca)
# Make predictions from the end of our training data
# we could have made predictions using the same model on ANY test data
fcasts <- forecast(trained_ewf)prod_forecaster <- function(epidata, ahead) {
Logit <- function(x, a = 0.01) log((x + a) / (1 - x + a))
Sigmd <- function(y, a = 0.01) (exp(y) * (1 + a) - a) / (1 + exp(y))
quantile_levels <- c(0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975)
forecast_date <- attr(epidata, "metadata")$as_of
target_date <- forecast_date + ahead - 7
r <- epi_recipe(epidata) |>
step_adjust_latency(recipes::all_outcomes(), fixed_forecast_date = forecast_date, method = "extend_ahead") |>
recipes::step_mutate(flu_pct_positive = Logit(flu_pct_positive / 100)) |>
step_epi_lag(flu_pct_positive, lag = c(0, 7, 14)) |>
step_epi_ahead(flu_pct_positive, ahead = ahead) |>
step_climate(flu_pct_positive, forecast_ahead = ahead / 7, time_type = "epiweek", window_size = 3L) |>
step_training_window(n_recent = 12) |>
step_epi_naomit()
e <- quantile_reg(quantile_levels = c(0.025, 0.1, 0.25, 0.5, 0.75, 0.9, 0.975))
f <- frosting() |> layer_predict() |>
layer_quantile_distn(quantile_levels = quantile_levels) |>
layer_point_from_distn() |>
layer_add_forecast_date(forecast_date = forecast_date) |>
layer_add_target_date()
ewf <- epi_workflow(r, e, f)
trained_ewf <- ewf |> fit(epidata)
preds <- forecast(trained_ewf) |>
pivot_quantiles_wider(.pred_distn) |>
mutate(across(c(.pred, `0.025`:`0.975`), ~ Sigmd(.x) * 100)) |>
select(-time_value)
preds
}Instead of choosing one model, we can combine the predictions from multiple base models.
untrained: combine base models, agnostic to past performance
trained: weight base models, accounting for past performance
Simplest untrained method: simple average of base model forecasts
\[ \hat{y}^{\text{avg}}_{t+h|t} = \frac{1}{p} \sum_{j=1}^p \hat{y}^j_{t+h|t} \]
A more robust option: simple median of base model forecasts
\[ \hat{y}^{\text{med}}_{t+h|t} = \mathrm{median}\Big\{ \hat{y}^j_{t+h|t} : j = 1,\dots,p \Big\} \]
ensemble should have accuracy competitive with best individual constituent model
ensemble should have greater robustness than any individual constituent model
Typically these are hard to accomplish simultaneously,
\[ \begin{aligned} &\mathop{\mathrm{minimize}}_{w \in \mathbb{R}^p} \sum_{s=t_0+1}^t \bigg( y_s - \sum_{j=1}^p w_j \cdot \hat{y}^j_{s|s-h} \bigg)^2 \\ &\text{subject to} \quad \sum_{j=1}^p w_j = 1, \;\;\text{and} \;\; w_j \geq 0, \; j=1,\dots,p \end{aligned} \]
then use
\[ \hat{y}^{\text{stack}}_{t+h|t} = \sum_{j=1}^p \hat{w}^t_j \cdot \hat{y}^j_{t+h|t} \]
The stacking optimization problem uses forward-looking predictions
(as in time series cross-validation)
Prediction intervals often have empirical coverage \(\ll\) nominal coverage
Recalibration methods adjust the intervals so that
Produces calibrated prediction intervals from base forecasts and scores.
In the simplest case, we can take the score to be absolute error of point forecasts:
\[e_t = |y_t - \hat y_{t|t-1}|\]
Let \(\hat q_{t}^{1-\alpha}\) be a predicted level \(1-\alpha\) quantile of the distribution of \(e_t\)
Define \(I_{t|t-1}^{1-\alpha} = [\hat{y}_{t|t-1} - \hat{q}_t^{1-\alpha}, \; \hat{y}_{t|t-1} + \hat{q}_t^{1-\alpha}]\).
Note that \(e_t \leq \hat{q}_t^{1-\alpha} \iff y_t \in I_{t|t-1}^{1-\alpha}\)
Therefore we the reduced the problem of producing prediction intervals \(I_{t|t-1}^{1-\alpha}\) to one of tracking a quantile of \(e_t\)
\[\hat q_{t+1}^{1-\alpha} = \begin{cases} \hat q_{t}^{1-\alpha} + \eta(1-\alpha) \quad \text{if } y_t\notin I_{t|t-1}^{1-\alpha} \\ \hat q_{t}^{1-\alpha} - \eta\alpha \quad \quad \quad \,\,\, \text{if } y_t\in I_{t|t-1}^{1-\alpha} \end{cases}\]
In words:
This method has the following guarantee:
\[ \Bigg| \frac{1}{T} \sum_{t=t_0+1}^{t_0+T} 1 \big\{ y_t \in I_{t|t-1}^{1-\alpha} \big\} - (1-\alpha) \Bigg| \leq \frac{b/\eta + 1}{T} \]
where \(b\) is a bound on the errors (largest error possible/observable).
{epipredict}Forecasting — dajmcdon/canssip-epidata-2025