`CIPerm`

We show that our package replicates the analyses in:

- Nguyen, M.D. (2009), “Nonparametric Inference using Randomization
and Permutation Reference Distribution and their Monte-Carlo
Approximation” [unpublished MS thesis; Mara Tableman, advisor], Portland
State University.
*Dissertations and Theses*. Paper 5927. DOI:10.15760/etd.7798.

Note that our R function arguments and outputs are structured differently than the similarly-named R functions in Nguyen (2009), but the results are equivalent.

We also compare our output against related results in:

- Ernst, M.D. (2004), “Permutation Methods: A Basis for Exact
Inference,”
*Statistical Science*, vol. 19, no. 4, 676-685, DOI:10.1214/088342304000000396.

Following Ernst (2004) and Nguyen (2009), we use the phrase
“permutation methods” to include both *randomization* tests
(where we assume random assignment of 2 treatments) and
*permutation* tests (where we assume random sampling from 2
populations). In the simple settings in this R package, the
randomization and permutation test mechanics are identical, though their
interpretations may differ.

Consider a two-sample testing setup, where \(X_1, \ldots, X_m\) are the responses from the control group and \(Y_1, \ldots, Y_n\) are the responses from the treatment group. Imagine that lower responses are better (e.g., illness recovery times).

With a standard one-sided randomization test or permutation test, we can test \(H_0: \mu_Y - \mu_X \geq 0\) against \(H_A: \mu_Y - \mu_X < 0\).

We could also test for a specific value of \(\mu_Y - \mu_X\), e.g. \(H_0: \mu_Y - \mu_X \geq \Delta\) against \(H_A: \mu_Y - \mu_X < \Delta\). If we assume a constant additive effect, we could carry this out by replacing all the \(Y_j\) values with \(Y_{j,\Delta} = Y_j-\Delta\) and then running the usual randomization test for a difference of 0.

Finally, we can invert the test to get a confidence interval. If we repeat the procedure above for many difference values of \(\Delta\) and record the values where the test fails to reject at level \(\alpha\), we will have a \(1-\alpha\) one-tailed confidence interval for \(\Delta\).

Doing this naively would require running many new permutations at each of many \(\Delta\) values. But Nguyen (2009) derives a shortcut that gives this confidence interval directly from a single set of permutations.

Each permutation involves swapping \(k\) labels: we treat \(k\) of the \(X_i\) as if they were \(Y_j\) and vice versa. Let \(t_{k,d}\) be the difference in group means for \(d^{th}\) of the permutations with \(k\) swaps. For instance, on the original data we have \[t_0 = \frac{\sum_{j=1}^n Y_j}{n} - \frac{\sum_{i=1}^m X_i}{m}\] and other permutations have the form \[t_{k,d} = \frac{\sum_{i=1}^k X_i + \sum_{j=k+1}^n Y_j}{n} - \frac{\sum_{j=1}^k Y_j + \sum_{i=k+1}^m X_i}{m}\]

Now if \(t_{k,d,\Delta}\) is the same statistic but computed on the dataset where \(Y_{j,\Delta}\) replaces \(Y_j\), then Nguyen (2009) shows that \[t_{k,d,\Delta} = t_{k,d} - \Delta + k\left(\frac{1}{n}+\frac{1}{m}\right)\Delta\] So if we keep track of \(k\) for each permutation, then one run of the randomization test is enough – we don’t need to carry out new runs for new \(\Delta\) values.

Next, we don’t actually have to calculate \(t_{k,d,\Delta}\) values directly—only the \(t_{k,d}\) values we would usually calculate for a standard test of no difference in means, and the values of \(k\) for each permutation.

Let \(w_{k,d} = \frac{t_0-t_{k,d}}{k\left(\frac{1}{n}+\frac{1}{m}\right)}\). Nguyen (2009) shows that if we compute the \(w_{k,d}\) for each permutation and sort them, then their upper \(\alpha\) quantile \(w_{(\alpha)}\) gives the \(1-\alpha\) one-sided CI upper bound for \(\Delta\): \[\hat{P}[\Delta \in (-\infty, w_{(\alpha)})] = 1-\alpha\] To get a two-sided CI for \(\Delta\), we can do this for both tails at half the nominal alpha.

In our R package, `dset()`

sets up the usual two-sample
randomization or permutation test, but also records \(k\) and \(w_{k,d}\) for each permutation. Then
`cint()`

finds the appropriate quantiles of \(w_{k,d}\) to return the confidence
interval.

We also track a few other test statistics, allowing
`pval()`

to return p-values for a difference in means, a
difference in medians, or the Wilcoxon rank sum test.

Input the data from Table 1:

```
library(CIPerm)
<- c(19, 22, 25, 26)
x <- c(23, 33, 40) y
```

Replicate Table 2 along with the three extra columns from Table 3:

```
<- dset(x, y, returnData = TRUE)
demo ::kable(demo, digits = 2) knitr
```

ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | diffmean | sum1 | diffmedian | wilsum | k | wkd | w.i |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 19 | 22 | 25 | 26 | 23 | 33 | 40 | -9.00 | 92 | -9.5 | 12 | 0 | NaN | NaN |

2 | 19 | 22 | 25 | 23 | 26 | 33 | 40 | -10.75 | 89 | -10.5 | 10 | 1 | 3.00 | -21.00 |

3 | 19 | 22 | 25 | 33 | 26 | 23 | 40 | -4.92 | 99 | -2.5 | 13 | 1 | -7.00 | -18.00 |

4 | 19 | 22 | 25 | 40 | 26 | 23 | 33 | -0.83 | 106 | -2.5 | 14 | 1 | -14.00 | -16.00 |

5 | 19 | 22 | 26 | 23 | 25 | 33 | 40 | -10.17 | 90 | -10.5 | 11 | 1 | 2.00 | -15.00 |

6 | 19 | 22 | 26 | 33 | 25 | 23 | 40 | -4.33 | 100 | -1.0 | 14 | 1 | -8.00 | -14.50 |

7 | 19 | 22 | 26 | 40 | 25 | 23 | 33 | -0.25 | 107 | -1.0 | 15 | 1 | -15.00 | -14.00 |

8 | 19 | 22 | 23 | 33 | 25 | 26 | 40 | -6.08 | 97 | -3.5 | 12 | 2 | -2.50 | -14.00 |

9 | 19 | 22 | 23 | 40 | 25 | 26 | 33 | -2.00 | 104 | -3.5 | 13 | 2 | -6.00 | -14.00 |

10 | 19 | 22 | 33 | 40 | 25 | 26 | 23 | 3.83 | 114 | 2.5 | 16 | 2 | -11.00 | -13.00 |

11 | 19 | 25 | 26 | 23 | 22 | 33 | 40 | -8.42 | 93 | -9.0 | 13 | 1 | -1.00 | -12.50 |

12 | 19 | 25 | 26 | 33 | 22 | 23 | 40 | -2.58 | 103 | 2.5 | 16 | 1 | -11.00 | -11.00 |

13 | 19 | 25 | 26 | 40 | 22 | 23 | 33 | 1.50 | 110 | 2.5 | 17 | 1 | -18.00 | -11.00 |

14 | 19 | 25 | 23 | 33 | 22 | 26 | 40 | -4.33 | 100 | -2.0 | 14 | 2 | -4.00 | -11.00 |

15 | 19 | 25 | 23 | 40 | 22 | 26 | 33 | -0.25 | 107 | -2.0 | 15 | 2 | -7.50 | -10.00 |

16 | 19 | 25 | 33 | 40 | 22 | 26 | 23 | 5.58 | 117 | 6.0 | 18 | 2 | -12.50 | -9.67 |

17 | 19 | 26 | 23 | 33 | 22 | 25 | 40 | -3.75 | 101 | -0.5 | 15 | 2 | -4.50 | -9.50 |

18 | 19 | 26 | 23 | 40 | 22 | 25 | 33 | 0.33 | 108 | -0.5 | 16 | 2 | -8.00 | -9.00 |

19 | 19 | 26 | 33 | 40 | 22 | 25 | 23 | 6.17 | 118 | 6.5 | 19 | 2 | -13.00 | -8.67 |

20 | 19 | 23 | 33 | 40 | 22 | 25 | 26 | 4.42 | 115 | 3.0 | 17 | 3 | -7.67 | -8.00 |

21 | 22 | 25 | 26 | 23 | 19 | 33 | 40 | -6.67 | 96 | -9.0 | 14 | 1 | -4.00 | -8.00 |

22 | 22 | 25 | 26 | 33 | 19 | 23 | 40 | -0.83 | 106 | 2.5 | 17 | 1 | -14.00 | -7.67 |

23 | 22 | 25 | 26 | 40 | 19 | 23 | 33 | 3.25 | 113 | 2.5 | 18 | 1 | -21.00 | -7.50 |

24 | 22 | 25 | 23 | 33 | 19 | 26 | 40 | -2.58 | 103 | -2.0 | 15 | 2 | -5.50 | -7.50 |

25 | 22 | 25 | 23 | 40 | 19 | 26 | 33 | 1.50 | 110 | -2.0 | 16 | 2 | -9.00 | -7.00 |

26 | 22 | 25 | 33 | 40 | 19 | 26 | 23 | 7.33 | 120 | 6.0 | 19 | 2 | -14.00 | -6.00 |

27 | 22 | 26 | 23 | 33 | 19 | 25 | 40 | -2.00 | 104 | -0.5 | 16 | 2 | -6.00 | -6.00 |

28 | 22 | 26 | 23 | 40 | 19 | 25 | 33 | 2.08 | 111 | -0.5 | 17 | 2 | -9.50 | -5.50 |

29 | 22 | 26 | 33 | 40 | 19 | 25 | 23 | 7.92 | 121 | 6.5 | 20 | 2 | -14.50 | -4.50 |

30 | 22 | 23 | 33 | 40 | 19 | 25 | 26 | 6.17 | 118 | 3.0 | 18 | 3 | -8.67 | -4.00 |

31 | 25 | 26 | 23 | 33 | 19 | 22 | 40 | -0.25 | 107 | 3.5 | 18 | 2 | -7.50 | -4.00 |

32 | 25 | 26 | 23 | 40 | 19 | 22 | 33 | 3.83 | 114 | 3.5 | 19 | 2 | -11.00 | -2.50 |

33 | 25 | 26 | 33 | 40 | 19 | 22 | 23 | 9.67 | 124 | 7.5 | 22 | 2 | -16.00 | -1.00 |

34 | 25 | 23 | 33 | 40 | 19 | 22 | 26 | 7.92 | 121 | 7.0 | 20 | 3 | -9.67 | 2.00 |

35 | 26 | 23 | 33 | 40 | 19 | 22 | 25 | 8.50 | 122 | 7.5 | 21 | 3 | -10.00 | 3.00 |

Replicate the left-tailed p-values reported on pages 5-6.

The first three should be `0.08571429`

, and the last should
be `0.1142857`

.

```
# Difference in means
pval(demo, tail = "Left", value = "m")
#> [1] 0.08571429
# Sum of treatment group
pval(demo, tail = "Left", value = "s")
#> [1] 0.08571429
# Difference in medians
pval(demo, tail = "Left", value = "d")
#> [1] 0.08571429
# Wilcoxon rank sum statistic
pval(demo, tail = "Left", value = "w")
#> [1] 0.1142857
```

Replicate the confidence interval (left-tailed, with confidence level
\((1 - \frac{2}{35})\times 100\% \approx
94.3\%\)) reported on page 11.

This should be `c(-Inf, 2)`

.

```
cint(demo, conf.level = 1-2/35, tail = "Left")
#> Achieved conf. level: 1-(2/35)
#> $conf.int
#> [1] -Inf 2
#>
#> $conf.level.achieved
#> [1] 0.9428571
```

Input and process the data from Table 4:

```
<- c(1.72, 1.64, 1.74, 1.70, 1.82, 1.82, 1.90, 1.82, 2.08)
wl1 <- c(1.78, 1.86, 1.96, 2.00, 2.00, 1.96)
wl2 <- dset(wl1, wl2)
wl
<- c(1.24, 1.38, 1.36, 1.40, 1.38, 1.48, 1.38, 1.54, 1.56)
al1 <- c(1.14, 1.20, 1.30, 1.26, 1.28, 1.18)
al2 <- dset(al1, al2) al
```

Replicate the two-sided p-values from page 14, using the
sum-of-group-1 test statistic.

**NOTE:** Nguyen reports that the p-values as
`0.07172827`

for wing lengths, and `0.002197802`

for antennae lengths. This matches our results.

However, Ernst (2004) reports the p-values as `0.0719`

and
`0.0022`

, in which case the first does not match our results
after rounding, though they are very close.

```
pval(wl, tail = "Two", value = "s")
#> [1] 0.07172827
pval(al, tail = "Two", value = "s")
#> [1] 0.002197802
```

Replicate the two-sided ~95% CIs for the mean difference from page
15.

**NOTE:** Nguyen reports these as 94.90% CIs:
`c(-0.246, 0.01)`

for wing lengths and
`c(0.087, 0.285)`

for antennae lengths.

However, Ernst (2004) reports a 94.90% CI for wing lengths as
`c(-0.250, 0.010)`

and a 94.94% (NOT 94.90%!) CI for antennae
lengths as `c(0.087, 0.286)`

.

Neither of these is an exact match for our own results below, though
both are very close.

```
cint(wl, conf.level = .95, tail = "Two")
#> Achieved conf. level: 1-2*(126/5005)
#> $conf.int
#> [1] -0.2466667 0.0100000
#>
#> $conf.level.achieved
#> [1] 0.9496503
cint(al, conf.level = .95, tail = "Two")
#> Achieved conf. level: 1-2*(126/5005)
#> $conf.int
#> [1] 0.08666667 0.28500000
#>
#> $conf.level.achieved
#> [1] 0.9496503
```

The confidence levels in Ernst (2004) and Nguyen (2009) might differ from ours due to typos or rounding, since it is not possible to achieve exactly 94.90% or 94.94% confidence with this method. There are \({9+6 \choose 9} = 5005\) combinations. The closest confidence levels would be achieved if we used the 126th, 127th, or 128th lowest and highest \(w_{k,d}\) values to get our two-sided CI endpoints. In those cases, the respective confidence levels would be \(1-2\times(126/5005)\approx 0.94965\), \(1-2\times(127/5005)\approx 0.94925\), or \(1-2\times(128/5005)\approx 0.94885\).

Also, Ernst (2004) may have different CI endpoints because he used an
entirely different, less computationally-efficient method to calculate
CIs, based on Garthwaite (1996), “Confidence intervals from
randomization tests,” *Biometrics* 52, 1387-1393, DOI:10.2307/2532852.

With a modern computer, it takes little time to run all \({9+6 \choose 9} = 5005\) possible combinations. But for larger datasets, where there are far more possible combinations, it can make sense to take a smaller Monte Carlo sample from all possible permutations.

In our `dset()`

function, the `nmc`

argument
lets us choose the number of Monte Carlo draws.

As in Nguyen (2009), we repeat the above analysis, but this time we use 999 Monte Carlo draws instead of all 5005 possible combinations. Due to the randomness inherent in Monte Carlo sampling, these results will not match the exact results above nor Nguyen (2009)’s own Monte Carlo results; but they should be fairly close.

```
<- dset(wl1, wl2, nmc = 999)
wl <- dset(al1, al2, nmc = 999)
al pval(wl, tail = "Two", value = "s")
#> [1] 0.08308308
pval(al, tail = "Two", value = "s")
#> [1] 0.002002002
cint(wl, conf.level = .95, tail = "Two")
#> Achieved conf. level: 1-2*(25/999)
#> $conf.int
#> [1] -0.25 0.02
#>
#> $conf.level.achieved
#> [1] 0.9499499
cint(al, conf.level = .95, tail = "Two")
#> Achieved conf. level: 1-2*(25/999)
#> $conf.int
#> [1] 0.08666667 0.28000000
#>
#> $conf.level.achieved
#> [1] 0.9499499
```