Skip to contents

This package accommodates 3 different ways of specifying a delay distribution in the renewal equation, and this vignette illustrates these in turn. The renewal equation specifies that the expected incidence at time tt is a weighted sum of previous incidence, multiplied by RtR_t: 𝔼[ItI1,,It1]=Rta=1twaIta,\begin{equation} \mathbb{E}[I_t \mid I_1,\ldots,I_{t-1}] = R_t \sum_{a=1}^t w_a I_{t-a}, \end{equation} calculated by convolving the preceding aa days of new infections with the discretized generation (or serial) interval distribution ww of length tt. It is the allowable distributions {w}a=1n\{w\}_{a=1}^n that are the focus.

Discretization

First, it is important to recognize that the sequence {w}a=1n\{w\}_{a=1}^n is usually a discretization of a probability density function, most frequently gamma or Weibull. This is because observed incidence happens at discrete time points like days or weeks, while time is continuous. Using the default (parametric) delay distributions requires calculating a discrete approximation.

Default (parametric) delay distribution

By default, estimate_rt() uses a gamma distribution parameterized by the shape kk and scale θ\theta. This density has pdf fW(w)=1Γ(k)θkwk1ew/θI(w>0), f_W(w) = \frac{1}{\Gamma(k)\theta^k} w^{k-1} e^{-w/\theta} I(w > 0), where II is the indicator function. The mean of this distribution is kθk\theta, and the variance is kθ2k\theta^2. Both kk and θ\theta must be greater than 0. The figure below shows a few examples densities from this family.

The default is the orange curve shown above. Given an incidence sequence of length nn, internally, delay_calculator() creates {w}a=1n\{w\}_{a=1}^n by wa=F(a)F(a1), w_a = F(a) - F(a - 1), and then renormalizing with w / sum(w) to ensure that it sums to 1. Here, FF is the cumulative distribution function for Gamma (though similar works for any continuous distribution). Note that this formula assumes that the probability of a 0 delay is 0 and that the probability of a delay of aa is really a1afW(w)dw=F(a)F(a1). \int_{a-1}^a f_W(w) \, dw = F(a) - F(a-1).

When using this default, the delay distribution is necessarily assumed to be the same for all tt, and we compute the convolution of w and I with the Fast Fourier Transform. Finally, an adjustment is made to handle the first observation. We define I0=I1I_0 = I_1.

Just to illustrate this behaviour, we show the results for the default setting on the included cancovid data.

can_default <- estimate_rt(cancovid$incident_cases, x = cancovid$date, nsol = 20L)
plot(can_default) + coord_cartesian(ylim = c(0.5, 2))

Constant, non-parametric delay distribution

If we don’t believe that the Gamma distribution closely approximates the serial interval distribution, then we can specify our own distribution manually. For example, Backer et al., Table S1 gives observed serial intervals for the Omicron (SGTF) and Delta (non-SGTF) COVID-19 variants during 2 weeks in 2021 in the Netherlands. For the sake of illustration, we aggregate these together and use this as our (constant) delay distribution.

# Data from Backer et al.
delay <- read.csv("backer.csv") %>%
  filter(delay > 0) %>%
  select(-delay)
delay <- rowSums(delay)
delay <- delay / sum(delay)

This distribution looks something like a gamma, but it has finite support. We can easily use it instead.

can_nonpar <- estimate_rt(
  cancovid$incident_cases, 
  x = cancovid$date, 
  delay_distn = delay,
  nsol = 20L)
plot(can_nonpar) + coord_cartesian(ylim = c(0.5, 2))

The result is much less dramatic than the previous version. This is likely because the distribution is much more concentrated near short delays. We can see this by examining the two CDFs.

Time-varying delays

Finally, we also allow time-varying delay distributions. This is accomplished with a matrix. This requires a bit of work, but is not too challenging. For example, to create the correct matrix using the Baker et al. delays, the necessary code is the following.

# library(Matrix)
n <- nrow(cancovid)
backer_delay <- c(0, delay, rep(0, n - length(delay) - 1))
delay_mat <- matrix(0, n, n)
delay_mat[1,1] <- 1
for (iter in 2:n) delay_mat[iter, 1:iter] <- rev(backer_delay[1:iter])
delay_mat <- drop0(as(delay_mat, "CsparseMatrix")) # make it sparse, not necessary
delay_mat <- delay_mat / rowSums(delay_mat) # renormalize

We could then simply pass delay_mat to estimate_rt(..., delay_distn = delay_mat). The result would be the same as that shown above.

Variant-specific delays

For a more illustrative version of the previous example, we revisit the Canadian data, but combine it with some additional information.

We use variant circulation information from CoVaRR-Net’s Duotang notebook.

# Run on 19 April 2024
duotang <- read_tsv("https://github.com/CoVaRR-NET/duotang/raw/main/data_needed/virusseq.metadata.csv.gz")
columnlist <- c(
  "fasta_header_name", "province", "host_gender", "host_age_bin",
  "sample_collection_date", "sample_collected_by",
  "purpose_of_sampling", "purpose_of_sequencing", "lineage",
  "raw_lineage", "gisaid_accession", "isolate"
)
unknown.str <- c(
  "Undeclared", "Not Provided", "Restricted Access", "Missing",
  "Not Applicable", "", "NA", "unknow"
)
duotang <- duotang %>%
  rename(province = geo_loc_name_state_province_territory) %>%
  select(all_of(columnlist))

meta <- duotang %>%
  mutate(
    week = cut(sample_collection_date, "week"),
    month = gsub("-..$", "", as.character(cut(sample_collection_date, "month")))
  )
source("https://github.com/CoVaRR-NET/duotang/raw/main/scripts/scanlineages.R")
meta <- meta %>%
  mutate(gisaid_accession = str_replace(gisaid_accession, "EPI_ISL_", "")) %>%
  rename(GID = gisaid_accession) %>%
  rowwise() %>%
  mutate(raw_lineage = ifelse(
    grepl("^X", raw_lineage),
    str_replace_all(paste0(
      realtorawlineage(substr(
        raw_lineage, 1, str_locate(raw_lineage, "\\.") - 1
      )),
      ".",
      substr(raw_lineage, str_locate(raw_lineage, "\\.") + 1, nchar(raw_lineage))
    ), "[\r\n]", ""),
    raw_lineage
  )) %>%
  ungroup()
dico <- makepangolindico() # rebuild the lineage dictionary so the correct names gets assigned for XBB descedants not named XBB

VOCVOI <- read_csv("https://raw.githubusercontent.com/CoVaRR-NET/duotang/main/resources/vocvoi.csv")
meta$pango_group <- create.pango.group(VOCVOI, meta)
meta <- select(meta, province, week, pango_group) %>%
  mutate(week = as.Date(week))

counts <- group_by(meta, province, week, pango_group) %>%
  count() %>%
  ungroup() %>%
  arrange(province, week, pango_group)
can_counts <- group_by(meta, week, pango_group) %>%
  count() %>%
  ungroup() %>%
  arrange(week, pango_group) %>%
  mutate(province = "Canada")
counts <- bind_rows(can_counts, counts)
saveRDS(counts, "duotang-counts.rds")

We smooth the raw data using multinomial logistic regression on a third order orthogonal polynomial to produce the following estimated variant proportions in Canada.

Using the estimated proportions, we label each date with the dominant variant at the time (and restrict ourselves to the time period of the included case data).

Creating delay distributions

We use the delay distribution data from the meta analysis in Xu, X., Wu, Y., Kummer, A.G. et al.. The raw data for their analysis is available on GitHub. We use a simple version of their procedure, taking the median calculations across studies to find the mean and standard deviation of the delay for each variant separately. Then we convert these to the kk and θ\theta parameters for the gamma distribution.

data_raw <- readxl::read_excel("xu-etal-DATA_RAW.xlsx") |>
  select(type, para, n = Sample_size, mean, sd, se, median) |>
  filter(!is.na(type)) |>
  mutate(across(-c(type, para), as.numeric)) 

bonehead_meta <- data_raw |>
  group_by(type, para) |>
  mutate(
    no_n = all(is.na(n)),
    n = case_when(!is.na(n) ~ n, no_n ~ 1, TRUE ~ median(n, na.rm = TRUE))
  ) |>
  ungroup() |>
  mutate(
    mean = case_when(!is.na(mean) ~ mean, TRUE ~ median),
    sd = case_when(!is.na(sd) ~ sd, TRUE ~ se * sqrt(n))
  ) |>
  group_by(type, para) |>
  summarise(
    mean = median(mean, na.rm = TRUE),
    sd = median(sd, na.rm = TRUE),
    .groups = "drop"
  )
## There's only one Beta and only IP. 
## We use the corresponding sd for Alpha IP, 
## and duplicate Alpha for GT / ST

Beta_IP <- bonehead_meta |> filter(type == "Beta")
Beta_IP$sd = bonehead_meta |> 
  filter(type == "Alpha", para == "IP") |> pull(sd)
Beta <- bind_rows(
  Beta_IP,
  bonehead_meta |> filter(type == "Alpha", para != "IP") |> mutate(type = "Beta")
)

delay_dstns_byvar <- bonehead_meta |>
  filter(type != "Beta") |>
  bind_rows(Beta) |>
  arrange(type, para) |>
  mutate(shape = mean^2 / sd^2, scale = mean / shape)
saveRDS(delay_dstns_byvar, "delay-distns-byvar.rds")

We’ll use just the Serial Intervals (SI), though the incubation periods (IP), and generation times (GT) are also available. Below, we visualize the estimated delays for the 4 variants that were most prevalent in Canada.

Finally, we use these delays during the period they were most prevalent to estimate Rt. First, we build the delay matrix.

library(Matrix)
n <- nrow(cancovid)
delay_mat <- matrix(0, n, n)
delay_mat[1,1] <- 1
for (iter in 2:n) {
  current_var <- can_pred_class$var[iter]
  current_pars <- delay_dstns_can |> filter(type == current_var)
  delay <- discretize_gamma(0:(iter - 1), current_pars$shape, current_pars$scale)
  delay_mat[iter, 1:iter] <- rev(delay)
}
delay_mat <- drop0(as(delay_mat, "CsparseMatrix")) # make it sparse, not necessary
delay_mat <- delay_mat / rowSums(delay_mat) # renormalize

Finally, we use this time-varying delay matrix to estimate Rt.

can_tvar <- estimate_rt(
  cancovid$incident_cases, 
  x = cancovid$date, 
  delay_distn = delay_mat,
  nsol = 20L)
plot(can_tvar) + coord_cartesian(ylim = c(0.5, 2))