library(rtestim)
library(ggplot2)
library(dplyr)
library(nnet)
library(forcats)
library(tidyr)
theme_set(theme_bw())
This package accommodates 3 different ways of specifying a delay distribution in the renewal equation, and this vignette illustrates these in turn. The renewal equation specifies that the expected incidence at time is a weighted sum of previous incidence, multiplied by : calculated by convolving the preceding days of new infections with the discretized generation (or serial) interval distribution of length . It is the allowable distributions that are the focus.
Discretization
First, it is important to recognize that the sequence is usually a discretization of a probability density function, most frequently gamma or Weibull. This is because observed incidence happens at discrete time points like days or weeks, while time is continuous. Using the default (parametric) delay distributions requires calculating a discrete approximation.
Default (parametric) delay distribution
By default, estimate_rt()
uses a gamma distribution
parameterized by the shape
and scale
.
This density has pdf
where
is the indicator function. The mean of this distribution is
,
and the variance is
.
Both
and
must be greater than 0. The figure below shows a few examples densities
from this family.
The default is the orange curve shown above. Given an incidence
sequence of length
,
internally, delay_calculator()
creates
by
and then renormalizing with
w / sum(w)
to ensure that it sums to 1. Here,
is the cumulative distribution function for Gamma (though similar works
for any continuous distribution). Note that this formula assumes that
the probability of a 0 delay is 0 and that the probability of a delay of
is really
When using this default, the delay distribution is necessarily
assumed to be the same for all
,
and we compute the convolution of w
and I
with
the Fast Fourier Transform. Finally, an adjustment is made to handle the
first observation. We define
.
Just to illustrate this behaviour, we show the results for the
default setting on the included cancovid
data.
can_default <- estimate_rt(cancovid$incident_cases, x = cancovid$date, nsol = 20L)
plot(can_default) + coord_cartesian(ylim = c(0.5, 2))
Constant, non-parametric delay distribution
If we don’t believe that the Gamma distribution closely approximates the serial interval distribution, then we can specify our own distribution manually. For example, Backer et al., Table S1 gives observed serial intervals for the Omicron (SGTF) and Delta (non-SGTF) COVID-19 variants during 2 weeks in 2021 in the Netherlands. For the sake of illustration, we aggregate these together and use this as our (constant) delay distribution.
# Data from Backer et al.
delay <- read.csv("backer.csv") %>%
filter(delay > 0) %>%
select(-delay)
delay <- rowSums(delay)
delay <- delay / sum(delay)
This distribution looks something like a gamma, but it has finite support. We can easily use it instead.
can_nonpar <- estimate_rt(
cancovid$incident_cases,
x = cancovid$date,
delay_distn = delay,
nsol = 20L)
plot(can_nonpar) + coord_cartesian(ylim = c(0.5, 2))
The result is much less dramatic than the previous version. This is likely because the distribution is much more concentrated near short delays. We can see this by examining the two CDFs.
Time-varying delays
Finally, we also allow time-varying delay distributions. This is accomplished with a matrix. This requires a bit of work, but is not too challenging. For example, to create the correct matrix using the Baker et al. delays, the necessary code is the following.
# library(Matrix)
n <- nrow(cancovid)
backer_delay <- c(0, delay, rep(0, n - length(delay) - 1))
delay_mat <- matrix(0, n, n)
delay_mat[1,1] <- 1
for (iter in 2:n) delay_mat[iter, 1:iter] <- rev(backer_delay[1:iter])
delay_mat <- drop0(as(delay_mat, "CsparseMatrix")) # make it sparse, not necessary
delay_mat <- delay_mat / rowSums(delay_mat) # renormalize
We could then simply pass delay_mat
to
estimate_rt(..., delay_distn = delay_mat)
. The result would
be the same as that shown above.
Variant-specific delays
For a more illustrative version of the previous example, we revisit the Canadian data, but combine it with some additional information.
We use variant circulation information from CoVaRR-Net’s Duotang notebook.
# Run on 19 April 2024
duotang <- read_tsv("https://github.com/CoVaRR-NET/duotang/raw/main/data_needed/virusseq.metadata.csv.gz")
columnlist <- c(
"fasta_header_name", "province", "host_gender", "host_age_bin",
"sample_collection_date", "sample_collected_by",
"purpose_of_sampling", "purpose_of_sequencing", "lineage",
"raw_lineage", "gisaid_accession", "isolate"
)
unknown.str <- c(
"Undeclared", "Not Provided", "Restricted Access", "Missing",
"Not Applicable", "", "NA", "unknow"
)
duotang <- duotang %>%
rename(province = geo_loc_name_state_province_territory) %>%
select(all_of(columnlist))
meta <- duotang %>%
mutate(
week = cut(sample_collection_date, "week"),
month = gsub("-..$", "", as.character(cut(sample_collection_date, "month")))
)
source("https://github.com/CoVaRR-NET/duotang/raw/main/scripts/scanlineages.R")
meta <- meta %>%
mutate(gisaid_accession = str_replace(gisaid_accession, "EPI_ISL_", "")) %>%
rename(GID = gisaid_accession) %>%
rowwise() %>%
mutate(raw_lineage = ifelse(
grepl("^X", raw_lineage),
str_replace_all(paste0(
realtorawlineage(substr(
raw_lineage, 1, str_locate(raw_lineage, "\\.") - 1
)),
".",
substr(raw_lineage, str_locate(raw_lineage, "\\.") + 1, nchar(raw_lineage))
), "[\r\n]", ""),
raw_lineage
)) %>%
ungroup()
dico <- makepangolindico() # rebuild the lineage dictionary so the correct names gets assigned for XBB descedants not named XBB
VOCVOI <- read_csv("https://raw.githubusercontent.com/CoVaRR-NET/duotang/main/resources/vocvoi.csv")
meta$pango_group <- create.pango.group(VOCVOI, meta)
meta <- select(meta, province, week, pango_group) %>%
mutate(week = as.Date(week))
counts <- group_by(meta, province, week, pango_group) %>%
count() %>%
ungroup() %>%
arrange(province, week, pango_group)
can_counts <- group_by(meta, week, pango_group) %>%
count() %>%
ungroup() %>%
arrange(week, pango_group) %>%
mutate(province = "Canada")
counts <- bind_rows(can_counts, counts)
saveRDS(counts, "duotang-counts.rds")
We smooth the raw data using multinomial logistic regression on a third order orthogonal polynomial to produce the following estimated variant proportions in Canada.
Using the estimated proportions, we label each date with the dominant variant at the time (and restrict ourselves to the time period of the included case data).
Creating delay distributions
We use the delay distribution data from the meta analysis in Xu, X., Wu, Y., Kummer, A.G. et al.. The raw data for their analysis is available on GitHub. We use a simple version of their procedure, taking the median calculations across studies to find the mean and standard deviation of the delay for each variant separately. Then we convert these to the and parameters for the gamma distribution.
data_raw <- readxl::read_excel("xu-etal-DATA_RAW.xlsx") |>
select(type, para, n = Sample_size, mean, sd, se, median) |>
filter(!is.na(type)) |>
mutate(across(-c(type, para), as.numeric))
bonehead_meta <- data_raw |>
group_by(type, para) |>
mutate(
no_n = all(is.na(n)),
n = case_when(!is.na(n) ~ n, no_n ~ 1, TRUE ~ median(n, na.rm = TRUE))
) |>
ungroup() |>
mutate(
mean = case_when(!is.na(mean) ~ mean, TRUE ~ median),
sd = case_when(!is.na(sd) ~ sd, TRUE ~ se * sqrt(n))
) |>
group_by(type, para) |>
summarise(
mean = median(mean, na.rm = TRUE),
sd = median(sd, na.rm = TRUE),
.groups = "drop"
)
## There's only one Beta and only IP.
## We use the corresponding sd for Alpha IP,
## and duplicate Alpha for GT / ST
Beta_IP <- bonehead_meta |> filter(type == "Beta")
Beta_IP$sd = bonehead_meta |>
filter(type == "Alpha", para == "IP") |> pull(sd)
Beta <- bind_rows(
Beta_IP,
bonehead_meta |> filter(type == "Alpha", para != "IP") |> mutate(type = "Beta")
)
delay_dstns_byvar <- bonehead_meta |>
filter(type != "Beta") |>
bind_rows(Beta) |>
arrange(type, para) |>
mutate(shape = mean^2 / sd^2, scale = mean / shape)
saveRDS(delay_dstns_byvar, "delay-distns-byvar.rds")
We’ll use just the Serial Intervals (SI), though the incubation periods (IP), and generation times (GT) are also available. Below, we visualize the estimated delays for the 4 variants that were most prevalent in Canada.
Finally, we use these delays during the period they were most prevalent to estimate Rt. First, we build the delay matrix.
library(Matrix)
n <- nrow(cancovid)
delay_mat <- matrix(0, n, n)
delay_mat[1,1] <- 1
for (iter in 2:n) {
current_var <- can_pred_class$var[iter]
current_pars <- delay_dstns_can |> filter(type == current_var)
delay <- discretize_gamma(0:(iter - 1), current_pars$shape, current_pars$scale)
delay_mat[iter, 1:iter] <- rev(delay)
}
delay_mat <- drop0(as(delay_mat, "CsparseMatrix")) # make it sparse, not necessary
delay_mat <- delay_mat / rowSums(delay_mat) # renormalize
Finally, we use this time-varying delay matrix to estimate Rt.
can_tvar <- estimate_rt(
cancovid$incident_cases,
x = cancovid$date,
delay_distn = delay_mat,
nsol = 20L)
plot(can_tvar) + coord_cartesian(ylim = c(0.5, 2))