{epiprocess} & {epipredict}

R packages to ramp up forecasting systems


Daniel J. McDonald, Ryan J. Tibshirani, Logan C. Brooks

and CMU’s Delphi Group

Stanford STATS/BIODS 352 — 12 April 2023

Background

  • Covid-19 Pandemic required quickly implementing forecasting systems.

  • Basic processing—outlier detection, reporting issues, geographic granularity—implemented in parallel / error prone

  • Data revisions complicate evaluation

  • Simple models often outperformed complicated ones

  • Custom software not easily adapted / improved by other groups

  • Hard for public health actors to borrow / customize community techniques

{epiprocess}

Basic processing operations and data structures

  • General EDA for “panel data”
  • Calculate rolling statistics
  • Fill / impute gaps
  • Examine correlations
  • Store revision history smartly
  • Inspect revision patterns
  • Find / correct outliers

{epiprocess} Data Structures

epi_df: snapshot of a data set

  • a tibble with a couple of required columns, geo_value and time_value.
  • arbitrary additional columns containing “measured” values, called “signals”
  • additional “keys” that index subsets (age_group, ethnicity, etc.)

epi_df

Represents a snapshot that contains the most up-to-date values of the signal variables, as of a given time.

{epiprocess} Data Structures

epi_archive: collection of epi_dfs

  • full version history of a data set
  • acts like a bunch of epi_dfs
  • but stored “compactly”
  • Allows you to do things you would do on an epi_df but based on the data that “would have been available at the time”

Revisions

Epidemiology data gets revised frequently. (Happens in Economics as well.)

  • We may want to use the data “as it looked in the past”
  • or we may want to examine “the history of revisions”.

Revision patterns

Simple sliding computations

dav14 <- jhu_csse_daily_subset %>%
  group_by(geo_value) %>%
  epi_slide(cases_14dav = mean(cases), n = 14)

{epipredict}

+ Framework for customizing from modular components.

  1. Preprocessor: do things to the data before model training
  2. Trainer: train a model on data, resulting in an object
  3. Predictor: make predictions, using a fitted model object
  4. Postprocessor: do things to the predictions before returning

A very specialized plug-in to {tidymodels}

Making dumb (but useful!) forecasts in epidemiology

  • Suppose we want to predict new hospitalizations \(y\), \(h\) days ahead, at many locations \(j\).

  • We’re going to make a new forecast each week.

Flatline forecaster

For each location, predict \[\hat{y}_{j,i+h} = y_{j,i}\]

AR forecaster

Use an AR model with some covariates, for example: \[\hat{y}_{j,i+1} = \mu + a_0 y_{j,i} + a_7 y_{j,i-7} + b_0 x_{j,i} + b_7 x_{j,i-7}\]

{epipredict}

A forecasting framework

  • Flatline forecaster
  • AR-type models
  • Backtest using the versioned data
  • Easily create features
  • Quickly pivot to new tasks
  • Highly customizable for advanced users

{epipredict}

Canned forecasters that work out of the box.

You can do a limited amount of customization.

We currently provide:

  • Baseline flat-line forecaster
  • Autoregressive-type forecaster
  • Autoregressive-type classifier

Basic autoregressive forecaster

  • Predict death_rate, 1 week ahead, with 0,7,14 day lags of cases and deaths.
  • Use lm for estimation. Also create “intervals”.
library(epipredict)
jhu <- case_death_rate_subset # grab some built-in data
canned <- arx_forecaster(
  epi_data = jhu, 
  outcome = "death_rate", 
  predictors = c("case_rate", "death_rate")
)

The output is a model object that could be reused in the future, along with the predictions for 7 days from now.

Adjust lots of built-in options

rf <- arx_forecaster(
  epi_data = jhu, 
  outcome = "death_rate", 
  predictors = c("case_rate", "death_rate", "fb-survey"),
  trainer = parsnip::rand_forest(mode = "regression"), # use ranger
  args_list = arx_args_list(
    ahead = 14, # 2-week horizon
    lags = list(c(0:4, 7, 14), c(0, 7, 14), c(0:7, 14)), # bunch of lags
    levels = c(0.01, 0.025, 1:19/20, 0.975, 0.99), # 23 ForecastHub quantiles
    quantile_by_key = "geo_value" # vary q-forecasts by location
  )
)

Do (almost) anything manually

# A preprocessing "recipe" that turns raw data into features / response
r <- epi_recipe(jhu) %>%
  step_epi_lag(case_rate, lag = c(0, 1, 2, 3, 7, 14)) %>%
  step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
  step_epi_ahead(death_rate, ahead = 14) %>%
  step_epi_naomit()

# A postprocessing routine describing what to do to the predictions
f <- frosting() %>%
  layer_predict() %>%
  layer_threshold(.pred, lower = 0) %>% # predictions/intervals should be non-negative
  layer_add_target_date(target_date = max(jhu$time_value) + 14) %>%
  layer_add_forecast_date(forecast_date = max(jhu$time_value))

# Bundle up the preprocessor, training engine, and postprocessor
# We use quantile regression
ewf <- epi_workflow(r, quantile_reg(tau = c(.1, .5, .9)), f)

# Fit it to data (we could fit this to ANY data that has the same format)
trained_ewf <- ewf %>% fit(jhu)

# examines the recipe to determine what we need to make the prediction
latest <- get_test_data(r, jhu)

# we could make predictions using the same model on ANY test data
preds <- trained_ewf %>% predict(new_data = latest)

Pivot to some online examples

Long book on {epipredict}, in progress…

Packages are under active development

Thanks:

  • The whole CMU Delphi Team (across many institutions)
  • Optum/UnitedHealthcare, Change Healthcare.
  • Google, Facebook, Amazon Web Services.
  • Quidel, SafeGraph, Qualtrics.
  • Centers for Disease Control and Prevention.
  • Council of State and Territorial Epidemiologists