The clustra package was built to cluster medical trajectories (time series) related by an intervention time. This vignette is intended to reproduce the results of an associated paper “clustra: A multi-platform k-means clustering algorithm for analysis of longitudinal trajectories in large electronic health records data.”
To run the paper version, set chunk = figs
and
full_data = TRUE
in the first code chunk below. It takes
about 70 minutes with mc = 5
on a Quad-Core I7, and
requires Internet access to get the full data from GitHub. The default
for CRAN submission is limited to a sample of about 1/8 of the full data
set to satisfy CRAN limits on package data size and runs a smaller
subset of the code chunks (takes about 40 seconds).
Additional plots to explain the methodology and examine the data can
be produced by setting chunk = all
in the chunk below. This
will run over 2 hours with the same system configuration.
Below is the code that controls which code chunks are evaluated and how many cores to use.
knitr::opts_knit$set(
collapse = TRUE,
comment = "#>"
)
cran = list(LibsData = TRUE, plot_true = TRUE, plot_raw = TRUE, run5 = TRUE, run5_v2 = FALSE, iter_plots = TRUE, plot5_side_by_side = FALSE, fig1 = FALSE, comp_true = FALSE, run10 = FALSE, plot10 = FALSE, run2 = FALSE, rand2 = FALSE, silplot = FALSE, silplot2 = FALSE, rand_plot = FALSE, hclust40 = FALSE, SAS_R_data = FALSE, Rand_SAS_R = FALSE, true_group = FALSE)
figs = list(LibsData = TRUE, plot_true = FALSE, plot_raw = FALSE, run5 = TRUE, run5_v2 = TRUE, iter_plots = FALSE, plot5_side_by_side = FALSE, fig1 = TRUE, comp_true = TRUE, run10 = TRUE, plot10 = TRUE, run2 = TRUE, rand2 = TRUE, silplot = TRUE, silplot2 = FALSE, rand_plot = TRUE, hclust40 = FALSE, SAS_R_data = TRUE, Rand_SAS_R = TRUE, true_group = TRUE)
all = list(LibsData = TRUE, plot_true = TRUE, plot_raw = TRUE, run5 = TRUE, run5_v2 = TRUE, iter_plots = TRUE, plot5_side_by_side = TRUE, fig1 = TRUE, comp_true = TRUE, run10 = TRUE, plot10 = TRUE, run2 = TRUE, rand2 = TRUE, silplot = TRUE, silplot2 = TRUE, rand_plot = TRUE, hclust40 = TRUE, SAS_R_data = TRUE, Rand_SAS_R = TRUE, true_group = TRUE)
chunk = cran # chose which version of the vignette to run
full_data = FALSE # choose if full 80,000 id data or sample of 10,000 ids
mc = 2
# If running on a Unix or a Mac platform, set to number of available cores.
# This vignette will experience notable speedup up to about 10 cores.
# On Intel chips with hyperthreading, up to 2x available cores can be used.
if (.Platform$OS.type == "windows") mc = 1
data.table::setDTthreads(1) # manage data.table threads (increase with full_data)
We start by loading the clustra
package and getting the
data. We also set the output path for plots to tempdir()
,
which should be changed to a user directory if they are needed beyond
the Knit output.
library(clustra)
start_time = deltime()
print(paste("Number of cores being used: mc parameter =", mc))
## [1] "Number of cores being used: mc parameter = 2"
repo = "https://github.com/MVP-CHAMPION/"
repo_sas = paste0(repo, "clustra-SAS/raw/main/")
if(full_data) {
url = paste0(repo_sas, "bp_data/simulated_data_27June2023.csv.gz")
data = data.table::fread(url)
} else {
data = bp10k
}
data$true_group = data$group
head(data)
## id group time response true_group
## 1: 18 3 -21 176.9072 3
## 2: 18 3 22 154.3234 3
## 3: 18 3 324 107.2318 3
## 4: 18 3 353 130.8007 3
## 5: 18 3 373 137.3623 3
## 6: 18 3 391 130.8484 3
plot_path = paste0(tempdir(), "/") # output dir for plots
cat("Resulting plots are saved in", plot_path, "directory\n")
## Resulting plots are saved in /var/folders/1t/x5sdn3gs653fr7k6vz6jq3l80000gn/T//RtmpHUZluf/ directory
Select a few random id
s and print their scatterplots
colored by true_group
. This only displays 9 individuals.
The data was generated based on five clusters. As the data used to
simulate this synthetic dataset is extremely noisy, it would be
difficult to ascertain any patterns by looking at a few subjects.
Now plot a 20,000 sample of the raw data. Use the true or simulated group assignment for coloring. As we can see, there is a lot of noise and it would be difficult to identify the clusters of trajectories by looking at the individual points, alone. A greater density of points is seen at time 0, where all trajectories have a data point.
Next, cluster the trajectories. Set k=5
, spline max
degrees of freedom to 30, and maximum iterations to 20. Initial cluster
assignments are the default “random”. We print out full information for
the iterations, including the elapsed time.
set.seed(12345)
cl5 = clustra(data, k = 5, maxdf = 30, conv = c(20, 0.5), mccores = mc, verbose = TRUE)
## Start counts 2001 2004 2016 2010 1969 Starts time: 0
## 1 (M-123)-0.57 (E-12345)-0.47 Changes: 7797 Counts: 1257 820 3529 1099 3295 Deviance: 46903674
## 2 (M-123)-0.31 (E-12345)-0.26 Changes: 2413 Counts: 1373 1774 2712 1450 2691 Deviance: 34739540
## 3 (M-123)-0.26 (E-12345)-0.46 Changes: 1405 Counts: 1384 2495 2156 1549 2416 Deviance: 33236379
## 4 (M-123)-0.31 (E-12345)-0.41 Changes: 938 Counts: 1407 2925 1814 1621 2233 Deviance: 32517385
## 5 (M-123)-0.31 (E-12345)-0.33 Changes: 645 Counts: 1466 3080 1633 1762 2059 Deviance: 32187814
## 6 (M-123)-0.26 (E-12345)-0.41 Changes: 741 Counts: 1536 3031 1549 2011 1873 Deviance: 32013822
## 7 (M-123)-0.31 (E-12345)-0.32 Changes: 929 Counts: 1624 2932 1486 2299 1659 Deviance: 31769678
## 8 (M-123)-0.26 (E-12345)-0.4 Changes: 898 Counts: 1728 2851 1439 2497 1485 Deviance: 31317772
## 9 (M-123)-0.26 (E-12345)-0.25 Changes: 517 Counts: 1857 2846 1401 2490 1406 Deviance: 30949472
## 10 (M-123)-0.26 (E-12345)-0.33 Changes: 220 Counts: 1923 2859 1386 2459 1373 Deviance: 30831685
## 11 (M-123)-0.26 (E-12345)-0.34 Changes: 82 Counts: 1952 2869 1382 2434 1363 Deviance: 30811168
## 12 (M-123)-0.31 (E-12345)-0.32 Changes: 37 Counts: 1971 2876 1380 2415 1358 Deviance: 30807849
## AIC:424.11 BIC:1627.1 edf:119.97 Total time: -8.15 converged
# What happens if we cut the iterations at 5 maximum?
set.seed(12345)
cl5.v2 = clustra(data, k = 5, maxdf = 30, conv = c(5, 0.5), mccores = mc)
Setting the option verbose = 2
produces plots of the
cluster spline centers at every iteration. We also pass the
ylim
parameter because the yaxis extent of the data is much
wider than the models.
set.seed(12345)
clustra(data, k = 5, maxdf = 30, conv = c(10, 0.5), mccores = mc, verbose = 2, ylim = c(110, 170))
## Start counts 2001 2004 2016 2010 1969 Starts time: 0
## 1 (M-123)-0.3 (E-12345)-0.33 Changes: 7797 Counts: 1257 820 3529 1099 3295 Deviance: 46903674
##
## 2 (M-123)-0.33 (E-12345)-0.26 Changes: 2413 Counts: 1373 1774 2712 1450 2691 Deviance: 34739540
##
## 3 (M-123)-0.34 (E-12345)-0.25 Changes: 1405 Counts: 1384 2495 2156 1549 2416 Deviance: 33236379
##
## 4 (M-123)-0.28 (E-12345)-0.26 Changes: 938 Counts: 1407 2925 1814 1621 2233 Deviance: 32517385
##
## 5 (M-123)-0.33 (E-12345)-0.32 Changes: 645 Counts: 1466 3080 1633 1762 2059 Deviance: 32187814
##
## 6 (M-123)-0.29 (E-12345)-0.47 Changes: 741 Counts: 1536 3031 1549 2011 1873 Deviance: 32013822
##
## 7 (M-123)-0.28 (E-12345)-0.33 Changes: 929 Counts: 1624 2932 1486 2299 1659 Deviance: 31769678
##
## 8 (M-123)-0.34 (E-12345)-0.26 Changes: 898 Counts: 1728 2851 1439 2497 1485 Deviance: 31317772
##
## 9 (M-123)-0.28 (E-12345)-0.26 Changes: 517 Counts: 1857 2846 1401 2490 1406 Deviance: 30949472