Epidemiological case definitions in R

07 August 2021

Introduction

In epidemiological analyses an exact distinction between temporal events is not always possible. Therefore, static but reasonable cut-offs are used to distinguish one case of an event from another. This is an important aspect of most case definitions. For example, distinguishing repeat or recurrent infections in a patient from the first occurrence of that infection.

Scripting such case definitions in R can be challenging. episodes() and partitions() provide a convenient but flexible solution to this. It links events into a temporal sequence, creating a unique group identifier with useful information about each group. These identifiers can then be used for record deduplication or further analyses.

Overview

The group identifiers created by episodes() and partitions() are called episodes (epid class) and panes (pane class) respectively. episodes() creates three type of episodes - "fixed", "rolling" and "recursive". In diyar, a fixed episode is created by linking an index event to other events occurring within a specified period from of it. This results in a "Case" (index event) and related duplicate events ("Duplicate_C"). In a rolling episode, the process is repeated using another event from the existing episode as the reference event. This results in a "Recurrent" event and additional duplicate ("Duplicate_R") events. Here, this repetition is referred to as a recurrence. Unless specified, recurrences will continue indefinitely until there are no more events within the period of recurrence. When this happens, the chain of recurrence ends and so does the episode. A recursive episode is similar to a rolling episode except that every event in the existing episode is used as a reference event. On the other hand, a pane is created by separating events into set periods in time. The events in a pane have no relationship with each other, other than occurring in the same period or numeric interval. See the figure and example below.

# Events
event_dt <- seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-11"), by = 1)
s_data <- data.frame(date = event_dt)
# Attribute 1 - Source of infection
attr_1 <- c("BSI", "UTI", "RTI", "RTI", "BSI", "BSI", "BSI", "RTI", "RTI", "BSI", "RTI")
# Attribute 2 - Location 
attr_2 <- c("Ward 1", "Ward 1", "Ward 3", "Ward 3", "Ward 2", "Ward 2", 
            "Ward 1", "Ward 1", "Ward 3","Ward 3", "Ward 2")
s_data$attr <- attr_1
# Fixed episodes
s_data$ep1 <- episodes(event_dt, case_length = 5, episode_type = "fixed")
# Rolling episodes
s_data$ep2 <- episodes(event_dt, case_length = 5, episode_type = "rolling",
                       group_stats = TRUE, data_source = attr_1)
# Recursive episodes
s_data$ep3 <- episodes(event_dt, case_length = 5, episode_type = "recursive")
# Panes
s_data$pn1 <- partitions(event_dt, length.out = 2, separate = TRUE)

# Identifiers
s_data
#>          date attr     ep1                              ep2     ep3      pn1
#> 1  2021-01-01  BSI E.1 (C) E.1 2021-01-01 -> 2021-01-11 (C) E.1 (C) PN.1 (I)
#> 2  2021-01-02  UTI E.1 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.1 (D)
#> 3  2021-01-03  RTI E.1 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.1 (D)
#> 4  2021-01-04  RTI E.1 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.1 (D)
#> 5  2021-01-05  BSI E.1 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.1 (D)
#> 6  2021-01-06  BSI E.1 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.6 (I)
#> 7  2021-01-07  BSI E.7 (C) E.1 2021-01-01 -> 2021-01-11 (R) E.1 (R) PN.6 (D)
#> 8  2021-01-08  RTI E.7 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.6 (D)
#> 9  2021-01-09  RTI E.7 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.6 (D)
#> 10 2021-01-10  BSI E.7 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.6 (D)
#> 11 2021-01-11  RTI E.7 (D) E.1 2021-01-01 -> 2021-01-11 (D) E.1 (D) PN.6 (D)

Each type of identifier has as.data.frame and as.list methods for easy access to their components.

# Components of an episode identifier
as.data.frame(s_data$ep2)
#>    epid sn    wind_nm     case_nm dist_wind_index dist_epid_index epid_total
#> 1     1  1       Case        Case          0 days          0 days         11
#> 2     1  2       Case Duplicate_C          1 days          1 days         11
#> 3     1  3       Case Duplicate_C          2 days          2 days         11
#> 4     1  4       Case Duplicate_C          3 days          3 days         11
#> 5     1  5       Case Duplicate_C          4 days          4 days         11
#> 6     1  6       Case Duplicate_C          5 days          5 days         11
#> 7     1  7 Recurrence   Recurrent          1 days          6 days         11
#> 8     1  8 Recurrence Duplicate_R          2 days          7 days         11
#> 9     1  9 Recurrence Duplicate_R          3 days          8 days         11
#> 10    1 10 Recurrence Duplicate_R          4 days          9 days         11
#> 11    1 11 Recurrence Duplicate_R          5 days         10 days         11
#>    iteration wind_id1 epid_start   epid_end epid_length epid_dataset
#> 1          1        1 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 2          1        1 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 3          1        1 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 4          1        1 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 5          1        1 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 6          0        1 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 7          0        6 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 8          0        6 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 9          0        6 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 10         0        6 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI
#> 11         0        6 2021-01-01 2021-01-11     10 days  BSI,RTI,UTI

Figure 1 gives a visual representation of the difference between these identifiers.

Figure 1: Episodes and panes

Implementation

The main considerations in a case definition are accounted for in these functions using a flexible and modular approach. Therefore, most considerations can be addressed independently or in a compounding manner. These considerations are summarised below.

Matching criteria based on an event’s attribute

Additional matching criteria (separate from temporal links) can be implemented by the strata, case_sub_criteria and recurrence_sub_criteria arguments. strata introduces a blocking attribute which forces separate episodes and panes for different subsets of the dataset.

The figure and example below show how the strata argument is used.

# Matching clinical criteria
ep1 <- episodes(event_dt, strata = attr_1, case_length = 5)
# Matching geographical criteria
ep2 <- episodes(event_dt, strata = attr_2, case_length = 5)

Figure 2: Using a strata to specify additional criteria for linked events

In contrast, the case_sub_criteria and recurrence_sub_criteria arguments apply a set of matching criteria for attributes associated with the events being compared. These arguments take a sub_criteria object. sub_criteria objects and how they are used are described in greater detail in vignette("links"). In summary, they contain a set of atomic vectors as attributes, a set of corresponding logical tests for each attribute and another set of logical tests for the equivalence of values in each attribute. The evaluation of a sub_criteria is recursive and so allows for nested conditions.

The figure and example below show how the case_sub_criteria and recurrence_sub_criteria arguments are used.

# Attribute 3 - Patient sex
attr_3 <- c(rep("Female", 9), "Male", "Female")

# Sub-criteria 1 - Matching source of infection OR patient location
sub_cri_1 <- sub_criteria(attr_1, attr_2, operator = "or")
# Sub-criteria 2 - Matching source of infection AND patient location
sub_cri_2 <- sub_criteria(attr_1, attr_2, operator = "and")
# Sub-criteria 3 - (Matching source of infection AND patient location) OR (Matching patient sex)
sub_cri_3 <- sub_criteria(sub_cri_2, attr_3, operator = "or")
# Sub-criteria 4 - (Matching source of infection AND patient location) AND (Matching patient sex)
sub_cri_4 <- sub_criteria(sub_cri_2, attr_3, operator = "and")

ep3 <- episodes(event_dt, case_length = 5, case_sub_criteria = sub_cri_1)
ep4 <- episodes(event_dt, case_length = 5, case_sub_criteria = sub_cri_2)
ep5 <- episodes(event_dt, case_length = 5, case_sub_criteria = sub_cri_3)
ep6 <- episodes(event_dt, case_length = 5, case_sub_criteria = sub_cri_4)

Figure 3: Using a sub_criteria to specify additional criteria for linked events

Using a sub_criteria incurs additional processing time therefore, it should be reserved for situations when a blocking attribute would not suffice or more complex matching criteria are required. The figure and example below show some examples of this.

# record id
rd_id <- 1:length(attr_1)

# Condition 1 - Each episode must include BSI events
cri_funx_1 <- function(x, y){
  splts <- split(x$attr, y$rd_id)
  splts_lgk <- lapply(splts, function(x){
    "RTI" %in% x
  })
  splts_lgk <- unlist(splts_lgk)
  splts_lgk[match(y$rd_id, names(splts))]
}

# Condition 2 - Each episode must include >=3 different sources of infection
cri_funx_2 <- function(x, y){
  splts <- split(x$attr, y$rd_id)
  splts_lgk <- lapply(splts, function(x){
    length(x[!duplicated(x)]) >= 3
  })
  splts_lgk <- unlist(splts_lgk)
  splts_lgk[match(y$rd_id, names(splts))]
}

# Equivalence - Logical test for matching attributes
eqv_funx <- function(x, y){
  x$rd_id == y$rd_id
}

# Sub-criteria 
sub_cri_5 <- sub_criteria(list(attr = attr_1, rd_id= rd_id), match_funcs = cri_funx_1, 
                          equal_funcs = eqv_funx)

sub_cri_6 <- sub_criteria(list(attr = attr_1, rd_id= rd_id), match_funcs = cri_funx_2,
                          equal_funcs = eqv_funx)

ep7 <- episodes(event_dt, case_length = 2, episode_type = "fixed", 
case_sub_criteria = sub_cri_5)

ep8 <- episodes(event_dt, case_length = 2, episode_type = "fixed",
                case_sub_criteria = sub_cri_6)

Figure 4: Using case_sub_criteria to specify complex criteria for linked events

Separating events into sections of time

This is best handled by partitions(). See the examples below.

# Group events into 2 equal parts over the strata's duration
pn2 <- partitions(event_dt, length.out = 2, separate = TRUE)

# Group events into 3-day sequences over the strata's duration
pn3 <- partitions(event_dt, by = 3, separate = TRUE)

# Group events that occured in a specified period of time 
pn4 <- partitions(event_dt, window = number_line(event_dt[4], event_dt[7]))

# Group events from separate periods into one pane
pn5 <- partitions(event_dt, length.out = 2, separate = FALSE)

Figure 5: Using partitions

Selecting index events and the direction of episode tracking

The from_last argument specifies the direction of episode tracking, while custom_sort specifies a custom preference for selecting index events. The combination of both allows users to choose which event or type of events should be used as the index event. See the examples below.

# Preference for selecting index events
c_sort <- c(rep(2, 5), 1, rep(2, 5))
# Episodes are 6 days (5-day difference) after the earliest event 
ep9 <- episodes(event_dt, case_length = 5, episodes_max = 1)
# Episodes are 6 days (5-day difference) before the most recent event 
ep10 <- episodes(event_dt, case_length = 5, episodes_max = 1, from_last = TRUE)
# Episodes are 6 days (5-day difference) after the 6th event 
ep11 <- episodes(event_dt, case_length = 5, custom_sort = c_sort, episodes_max = 1)
# Episodes are 6 days (5-day difference) before or after the 6th event 
ep12 <- episodes(event_dt, case_length = number_line(-5, 5), custom_sort = c_sort, episodes_max = 1)

Figure 6: Selecting index events when tracking episodes

Recurrence

The episode_type argument can be used to request for rolling or recursive episodes which permit recurrence. reference_event is used to specify which of the events in the existing episode is considered the reference event for the next recurrence. case_for_recurrence determines if the initial occurrence of the event and subsequent recurrences are to be treated in the same way i.e. does recurrent events trigger an initial occurrence of their own?

# Episodes are 4 days (3-day difference) after the earliest event with
# repeat occurrence within 4 days of the last event considered recurrences not duplicates
ep13 <- episodes(event_dt, case_length = 3, episode_type = "rolling")
# Episodes are 4 days (3-day difference) after the earliest event with
# repeat occurrence within 7 days of the last event considered recurrences not duplicates
ep14 <- episodes(event_dt, case_length = 3,  recurrence_length = 6, episode_type = "rolling")
# Episodes are 3 days (2-day difference) after the earliest event with
# repeat occurrence within 6 days of the first event considered recurrences not duplicates
ep15 <- episodes(event_dt, case_length = 2,  recurrence_length = 5, 
                episode_type = "rolling", reference_event = "first_record")
# Episodes are 2 days (1-day difference) after the earliest event with
# repeat occurrence within 4 days of the last event considered recurrences not duplicates and
# the possibility of each repeat occurrence spawning a new occurrence as if it was the initial case
ep16 <- episodes(event_dt, case_length = 1,  recurrence_length = 3, 
                episode_type = "rolling", case_for_recurrence = TRUE)
# Episodes are 2 days (1-day difference) after the earliest event with
# repeat occurrence within 4 days of the last event considered recurrences not duplicates and
# can't recur more than twice
ep17 <- episodes(event_dt, case_length = 1,  recurrence_length = 3, 
                episode_type = "rolling", rolls_max = 2)
# Episodes are 2 days (1-day difference) after the earliest event with
# repeat occurrence within 4 days of the last event considered recurrences not duplicates and
# can't recur more than once times and the selection of index events is recursive
ep18 <- episodes(event_dt, case_length = 1,  recurrence_length = 3, 
                episode_type = "recursive", rolls_max = 1)

Figure 7: Recurrence of the index event