flowchart LR
A("{epidatr}") --> C("{epiprocess}")
B("{epidatpy}") --> C
D("{covidcast}") --> C
C --> E("{epipredict}")
{epiprocess} and {epipredict}R packages for signal processing and forecastingCDC Flu Site Visit — 19 July 2023
flowchart LR
A("{epidatr}") --> C("{epiprocess}")
B("{epidatpy}") --> C
D("{covidcast}") --> C
C --> E("{epipredict}")
{epiprocess}Tools for signal processing
epi_df: snapshot of a data setgeo_value and time_value.age_group, ethnicity, etc.)epi_df
Represents a snapshot that contains the most up-to-date values of the signal variables, as of a given time.
epi_df: snapshot of a data setcase_death_rate_subset package:epipredict R Documentation
Subset of JHU daily state cases and deaths
Description:
This data source of confirmed COVID-19 cases and deaths is based
on reports made available by the Center for Systems Science and
Engineering at Johns Hopkins University. This example data ranges
from Dec 31, 2020 to Dec 31, 2021, and includes all states.
Usage:
case_death_rate_subset
Format:
A tibble with 20,496 rows and 4 variables:
geo_value the geographic value associated with each row of
measurements.
time_value the time value associated with each row of
measurements.
case_rate 7-day average signal of number of new confirmed COVID-19
cases per 100,000 population, daily
death_rate 7-day average signal of number of new confirmed deaths
due to COVID-19 per 100,000 population, daily
Source:
This object contains a modified part of the COVID-19 Data
Repository by the Center for Systems Science and Engineering
(CSSE) at Johns Hopkins University as republished in the COVIDcast
Epidata API. This data set is licensed under the terms of the
Creative Commons Attribution 4.0 International license by the
Johns Hopkins University on behalf of its Center for Systems
Science in Engineering. Copyright Johns Hopkins University 2020.
Modifications:
• From the COVIDcast Epidata API: These signals are taken
directly from the JHU CSSE COVID-19 GitHub repository without
changes. The 7-day average signals are computed by Delphi by
calculating moving averages of the preceding 7 days, so the
signal for June 7 is the average of the underlying data for
June 1 through 7, inclusive.
An `epi_df` object, 20,496 x 4 with metadata:
* geo_type = state
* time_type = day
* as_of = 2022-05-31 12:08:25.791826
# A tibble: 20,496 × 4
geo_value time_value case_rate death_rate
* <chr> <date> <dbl> <dbl>
1 ak 2020-12-31 35.9 0.158
2 al 2020-12-31 65.1 0.438
3 ar 2020-12-31 66.0 1.27
4 as 2020-12-31 0 0
5 az 2020-12-31 76.8 1.10
6 ca 2020-12-31 96.0 0.751
7 co 2020-12-31 35.8 0.649
8 ct 2020-12-31 52.1 0.819
9 dc 2020-12-31 31.0 0.601
10 de 2020-12-31 65.2 0.807
# ℹ 20,486 more rows
epi_dfepi_dfepi_archive: collection of epi_dfsepi_dfs — but stored compactlyepi_df but using only data that would have been available at the timeRevisions
Epidemiology data gets revised frequently. (Happens in Economics as well.)
archive_cases_dv_subset package:epiprocess R Documentation
Subset of daily doctor visits and cases in archive format
Description:
This data source is based on information about outpatient visits,
provided to us by health system partners, and also contains
confirmed COVID-19 cases based on reports made available by the
Center for Systems Science and Engineering at Johns Hopkins
University. This example data ranges from June 1, 2020 to Dec 1,
2021, and is also limited to California, Florida, Texas, and New
York.
Usage:
archive_cases_dv_subset
Format:
An 'epi_archive' data format. The data table DT has 129,638 rows
and 5 columns:
geo_value the geographic value associated with each row of
measurements.
time_value the time value associated with each row of
measurements.
version the time value specifying the version for each row of
measurements.
percent_cli percentage of doctor’s visits with CLI (COVID-like
illness) computed from medical insurance claims
case_rate_7d_av 7-day average signal of number of new confirmed
deaths due to COVID-19 per 100,000 population, daily
Source:
This object contains a modified part of the COVID-19 Data
Repository by the Center for Systems Science and Engineering
(CSSE) at Johns Hopkins University as republished in the COVIDcast
Epidata API. This data set is licensed under the terms of the
Creative Commons Attribution 4.0 International license by Johns
Hopkins University on behalf of its Center for Systems Science in
Engineering. Copyright Johns Hopkins University 2020.
Modifications:
• From the COVIDcast Doctor Visits API: The signal
'percent_cli' is taken directly from the API without changes.
• From the COVIDcast Epidata API: 'case_rate_7d_av' signal was
computed by Delphi from the original JHU-CSSE data by
calculating moving averages of the preceding 7 days, so the
signal for June 7 is the average of the underlying data for
June 1 through 7, inclusive.
• Furthermore, the data is a subset of the full dataset, the
signal names slightly altered, and formatted into a tibble.
How stale is my data over space and time?
There are (lots) of possible definitions.
time_value available for some location, as of some version
NAs and skipped time_values are latent
NA values are latent
pos_first_true <- function(x) match(TRUE, rev(x), length(x) + 1) - 1
na0_latency <- function(dat, group_key, ref_time_value) {
group_by(dat, geo_value) %>%
complete(time_value = full_seq(time_value, period = 1L)) %>%
mutate(not_na0 = !(is.na(admissions) | admissions == 0L)) %>%
summarize(
nominal_latency = as.integer(ref_time_value - max(time_value)),
na0_latency = pos_first_true(not_na0) + nominal_latency
)
}
hhs_flu_hosp_archive %>%
epix_slide(na0_latency, before = 27, names_sep = NULL) %>%
rename(version = time_value){epipredict}A forecasting framework
We should build up modular components
Be able to add layers sequentially
A very specialized plug-in to {tidymodels}
{epipredict}But, you can adjust a lot of options
But, you can adjust a lot of options
On day \(t\)
We want to predict
We’re going to make a new forecast each week.
But, you can adjust a lot of options
For each location, predict \[\hat{y}_{j,\ t+h} = y_{j,\ t}\]
But, you can adjust a lot of options
Use an AR model with an extra feature, e.g.: \[\hat{y}_{j,\ t+h} = \mu + a_0 y_{j,\ t} + a_7 y_{j,\ t-7} + b_0 x_{j,\ t} + b_7 x_{j,\ t-7}\]
death_rate, 1 week ahead, with 0,7,14 day lags of cases and deaths.lm for estimation. Also create intervals.The output is a model object that could be reused in the future, along with the predictions for 7 days from now.
rf <- arx_forecaster(
epi_data = edf,
outcome = "death_rate",
predictors = c("case_rate", "death_rate", "fb-survey"),
trainer = parsnip::rand_forest(mode = "regression"), # use {ranger}
args_list = arx_args_list(
ahead = 14, # 2-week horizon
lags = list(
case_rate = c(0:4, 7, 14),
death_rate = c(0, 7, 14),
`fb-survey` = c(0:7, 14)
),
levels = c(0.01, 0.025, 1:19/20, 0.975, 0.99), # 23 ForecastHub quantiles
quantile_by_key = "geo_value" # vary noise model by location
)
)# A preprocessing "recipe" that turns raw data into features / response
r <- epi_recipe(edf) %>%
step_epi_lag(case_rate, lag = c(0, 1, 2, 3, 7, 14)) %>%
step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
step_epi_ahead(death_rate, ahead = 14) %>%
step_epi_naomit()
# A postprocessing routine describing what to do to the predictions
f <- frosting() %>%
layer_predict() %>%
layer_threshold(.pred, lower = 0) %>% # predictions / intervals should be non-negative
layer_add_target_date(target_date = max(edf$time_value) + 14) %>%
layer_add_forecast_date(forecast_date = max(edf$time_value))
# Bundle up the preprocessor, training engine, and postprocessor
# We use quantile regression
ewf <- epi_workflow(r, quantile_reg(tau = c(.1, .5, .9)), f)
# Fit it to data (we could fit this to ANY data that has the same format)
trained_ewf <- ewf %>% fit(edf)
# examines the recipe to determine what we need to make the prediction
latest <- get_test_data(r, edf)
# we could make predictions using the same model on ANY test data
preds <- trained_ewf %>% predict(new_data = latest)fd <- as.Date("2021-11-30")
geos <- c("ut", "ca")
tedf <- edf %>% filter(time_value >= fd)
# use most recent 3 months for training
edf <- edf %>% filter(time_value < fd, time_value >= fd - 90L)
rec <- epi_recipe(edf) %>%
step_epi_lag(case_rate, lag = c(0, 7, 14, 21)) %>%
step_epi_lag(death_rate, lag = c(0, 7, 14)) %>%
step_epi_ahead(death_rate, ahead = 1:28)
f <- frosting() %>%
layer_predict() %>%
layer_unnest(.pred) %>%
layer_naomit(distn) %>%
layer_add_forecast_date() %>%
layer_threshold(distn)
ee <- smooth_quantile_reg(
tau = c(.1, .25, .5, .75, .9),
outcome_locations = 1:28,
degree = 3L
)
ewf <- epi_workflow(rec, ee, f)
the_fit <- ewf %>% fit(edf)
latest <- get_test_data(rec, edf, fill_locf = TRUE)
preds <- predict(the_fit, new_data = latest) %>%
mutate(forecast_date = fd, target_date = fd + ahead) %>%
select(geo_value, target_date, distn) %>%
pivot_quantiles(distn) %>%
filter(geo_value %in% geos)
ggplot(preds) +
geom_ribbon(aes(target_date, ymin = `0.1`, ymax = `0.9`),
fill = "cornflowerblue", alpha = .8) +
geom_ribbon(aes(target_date, ymin = `0.25`, ymax = `0.75`),
fill = "#00488E", alpha = .8) +
geom_line(data = bind_rows(tedf, edf) %>% filter(geo_value %in% geos),
aes(time_value, death_rate), size = 1.5) +
geom_line(aes(target_date, `0.5`), color = "orange", size = 1.5) +
geom_vline(xintercept = fd) +
facet_wrap(~geo_value) +
theme_bw(16) +
scale_x_date(name = "", date_labels = "%b %Y") +
ylab("Deaths per 100K inhabitants")
Packages for signal processing and forecasting