DHARMa: residual diagnostics for hierarchical (multi-level/mixed) regression models

Florian Hartig, Theoretical Ecology, University of Regensburg website

2024-10-17

Abstract

The ‘DHARMa’ package uses a simulation-based approach to create readily interpretable scaled (quantile) residuals for fitted generalized linear (mixed) models. Currently supported are linear and generalized linear (mixed) models from ‘lme4’ (classes ‘lmerMod’, ‘glmerMod’), ‘glmmTMB’, ‘GLMMadaptive’ and ‘spaMM’; phylogenetic linear models from ‘phylolm’ (classes ‘phylolm’ and ‘phyloglm’); generalized additive models (‘gam’ from ‘mgcv’); ‘glm’ (including ‘negbin’ from ‘MASS’, but excluding quasi-distributions) and ‘lm’ model classes. Moreover, externally created simulations, e.g. posterior predictive simulations from Bayesian software such as ‘JAGS’, ‘STAN’, or ‘BUGS’ can be processed as well. The resulting residuals are standardized to values between 0 and 1 and can be interpreted as intuitively as residuals from a linear regression. The package also provides a number of plot and test functions for typical model misspecification problems, such as over/underdispersion, zero-inflation, and residual spatial, temporal and phylogenetic autocorrelation.

Motivation

The interpretation of conventional residuals for generalized linear (mixed) and other hierarchical statistical models is often problematic. As an example, here the result of conventional Deviance, Pearson and raw residuals for two Poisson GLMMs, one that is lacking a quadratic effect, and one that fits the data perfectly. Could you tell which is the correct model?

Just for completeness - it was the second one. But don’t get too excited if you got it right. Probably you were just lucky - I can’t really tell a difference. But even so, would you have added a quadratic effect, instead of adding an overdispersion correction? The point here is that misspecifications in GL(M)Ms cannot reliably be diagnosed with standard residual plots, and thus GLMMs are often not as thoroughly checked as they should.

One reason why GL(M)Ms residuals are harder to interpret is that the expected distribution of the data (aka predictive distribution) changes with the fitted values. Reweighting with the expected dispersion, as done in Pearson residuals, or using deviance residuals, helps to some extent, but it does not lead to visually homogenous residuals, even if the model is correctly specified. As a result, standard residual plots, when interpreted in the same way as for linear models, seem to show all kind of problems, such as non-normality, heteroscedasticity, even if the model is correctly specified. Questions on the R mailing lists and forums show that practitioners are regularly confused about whether such patterns in GL(M)M residuals are a problem or not.

But even experienced statistical analysts currently have few options to diagnose misspecification problems in GLMMs. In my experience, the current standard practice is to eyeball the residual plots for major misspecifications, potentially have a look at the random effect distribution, and then run a test for overdispersion, which is usually positive, after which the model is modified towards an overdispersed / zero-inflated distribution. This approach, however, has a number of drawbacks, notably:

DHARMa aims at solving these problems by creating readily interpretable residuals for generalized linear (mixed) models that are standardized to values between 0 and 1, and that can be interpreted as intuitively as residuals for the linear model. This is achieved by a simulation-based approach, similar to the Bayesian p-value or the parametric bootstrap, that transforms the residuals to a standardized scale. The basic steps are:

  1. Simulate new response data from the fitted model for each observation.

  2. For each observation, calculate the empirical cumulative density function for the simulated observations, which describes the possible values (and their probability) at the predictor combination of the observed value, assuming the fitted model is correct.

  3. The residual is then defined as the value of the empirical density function at the value of the observed data, so a residual of 0 means that all simulated values are larger than the observed value, and a residual of 0.5 means half of the simulated values are larger than the observed value.

These steps are visualized in the following figure

The key advantage of this definition is that the so-defined residuals always have the same, known distribution, independent of the model that is fit, if the model is correctly specified. To see this, note that, if the observed data was created from the same data-generating process that we simulate from, all values of the cumulative distribution should appear with equal probability. That means we expect the distribution of the residuals to be flat, regardless of the model structure (Poisson, binomial, random effects and so on).

I currently prepare a more exact statistical justification for the approach in an accompanying paper, but if you must provide a reference in the meantime, I would suggest citing

p.s.: DHARMa stands for “Diagnostics for HierArchical Regression Models” - which, strictly speaking, would make DHARM. But in German, Darm means intestines; plus, the meaning of DHARMa in Hinduism makes the current abbreviation so much more suitable for a package that tests whether your model is in harmony with your data:

From Wikipedia, 28/08/16: In Hinduism, dharma signifies behaviours that are considered to be in accord with rta, the order that makes life and universe possible, and includes duties, rights, laws, conduct, virtues and ‘right way of living’.

Workflow in DHARMa

Installing, loading and citing the package

If you haven’t installed the package yet, either run

install.packages("DHARMa")

Or follow the instructions on https://github.com/florianhartig/DHARMa to install a development version.

Loading and citation

library(DHARMa)
citation("DHARMa")
## Para citar o pacote 'DHARMa' em publicações use:
## 
##   Hartig F (2024). _DHARMa: Residual Diagnostics for Hierarchical
##   (Multi-Level / Mixed) Regression Models_. R package version 0.4.7,
##   <http://florianhartig.github.io/DHARMa/>.
## 
## Uma entrada BibTeX para usuários(as) de LaTeX é
## 
##   @Manual{,
##     title = {DHARMa: Residual Diagnostics for Hierarchical (Multi-Level / Mixed) Regression Models},
##     author = {Florian Hartig},
##     year = {2024},
##     note = {R package version 0.4.7},
##     url = {http://florianhartig.github.io/DHARMa/},
##   }

Calculating scaled residuals

Let’s assume we have a fitted model that is supported by DHARMa.

testData = createData(sampleSize = 250)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) , 
                     family = "poisson", data = testData)

Most functions in DHARMa can be calculated directly on the fitted model object. For example, if you are only interested in testing for dispersion problems, you could run

testDispersion(fittedModel)

In this case, the randomized quantile residuals are calculated on the fly inside the function call. If you work in this way, however, residual calculation will be repeated by every test / plot you call, and this can take a while. It is therefore highly recommended to first calculate the residuals once, using the simulateResiduals() function

simulationOutput <- simulateResiduals(fittedModel = fittedModel, plot = F)

which calculates randomized quantile residuals according to the algorithm discussed above. The function returns an object of class DHARMa, containing the simulations and the scaled residuals, which can later be passed on to all other plots and test functions. When specifying the optional argument plot = T, the standard DHARMa residual plot is displayed directly. The interpretation of the plot will be discussed below. Using the simulateResiduals function has the added benefit that you can modify the way in which residuals are calculated. For example, you may want to change the number of simulations, or the REs to condition on. See ?simulateResiduals and section “simulation options” below for details.

The calculated (scaled) residuals can be plotted and tested via a number of DHARMa functions (see below), or accessed directly via

residuals(simulationOutput)

To interpret the residuals, remember that a scaled residual value of 0.5 means that half of the simulated data are higher than the observed value, and half of them lower. A value of 0.99 would mean that nearly all simulated data are lower than the observed value. The minimum/maximum values for the residuals are 0 and 1. For a correctly specified model we would expect asymptotically

Note: the uniform distribution is the only differences to “conventional” residuals as calculated for a linear regression. If you cannot get used to this, you can transform the uniform distribution to another distribution, for example normal, via

residuals(simulationOutput, quantileFunction = qnorm, outlierValues = c(-7,7))

These normal residuals will behave exactly like the residuals of a linear regression. However, for reasons of a) numeric stability with low number of simulations, which makes it neccessary to descide on which value outliers are to be transformed and b) my conviction that it is much easier to visually detect deviations from uniformity than normality, DHARMa checks all residuals in the uniform space, and I would personally advice against using the transformation.

Plotting the scaled residuals

The main plot function for the calculated DHARMa object produced by simulateResiduals() is the plot.DHARMa() function

plot(simulationOutput)

The function creates two plots, which can also be called separately, and provide extended explanations / examples in the help

plotQQunif(simulationOutput) # left plot in plot.DHARMa()
plotResiduals(simulationOutput) # right plot in plot.DHARMa()

To provide a visual aid in detecting deviations from uniformity in y-direction, the plot function calculates an (optional default) quantile regression, which compares the empirical 0.25, 0.5 and 0.75 quantiles in y direction (red solid lines) with the theoretical 0.25, 0.5 and 0.75 quantiles (dashed black line), and provides a p-value for the deviation from the expected quantile. The significance of the deviation to the expected quantiles is tested and displayed visually, and can be additionally extracted with the testQuantiles function.

By default, plotResiduals plots against predicted values. However, you can also use it to plot residuals against a specific other predictors (highly recommend).

plotResiduals(simulationOutput, form = YOURPREDICTOR)

If the predictor is a factor, or if there is just a small number of observations on the x axis, plotResiduals will plot a box plot with additional tests instead of a scatter plot.

plotResiduals(simulationOutput, form = testData$group)

See ?plotResiduas for details, but very shortly: under H0 (perfect model), we would expect those boxes to range homogeneously from 0.25-0.75. To see whether there are deviations from this expecation, the plot calculates a test for uniformity per box, and a test for homogeneity of variances between boxes. A positive test will be highlighted in red.

Goodness-of-fit tests on the scaled residuals

To support the visual inspection of the residuals, the DHARMa package provides a number of specialized goodness-of-fit tests on the simulated residuals:

See the help of the functions and further comments below for a more detailed description.

Simulation options

There are a few important technical details regarding how the simulations are performed, in particular regarding the treatments of random effects and integer responses. It is strongly recommended to read the help of

?simulateResiduals

Refit

simulationOutput <- simulateResiduals(fittedModel = fittedModel, refit = T)
  • if refit = F (default), new datasets are simulated from the fitted model, and residuals are calculated by comparing the observed data to the new data

  • if refit = T, a parametric bootstrap is performed, meaning that the model is refit to all new datasets, and residuals are created by comparing observed residuals against refitted residuals

The second option is much much slower, and also seemed to have lower power in some tests I ran. It is therefore not recommended for standard residual diagnostics! I only recommend using it if you know what you are doing, and have particular reasons, for example if you estimate that the tested model is biased. A bias could, for example, arise in small data situations, or when estimating models with shrinkage estimators that include a purposeful bias, such as ridge/lasso, random effects or the splines in GAMs. My idea was then that simulated data would not fit to the observations, but that residuals for model fits on simulated data would have the same patterns/bias than model fits on the observed data.

Note also that refit = T can sometimes run into numerical problems, if the fitted model does not converge on the newly simulated data.

Conditinal vs. unconditinal simulations

The second option is the treatment of the stochastic hierarchy. In a hierarchical model, several layers of stochasticity are placed on top of each other. Specifically, in a GLMM, we have a lower level stochastic process (random effect), whose result enters into a higher level (e.g. Poisson distribution). For other hierarchical models, such as state-space models, similar considerations apply, but the hierarchy can be more complex. When simulating, we have to decide if we want to re-simulate all stochastic levels, or only a subset of those. For example, in a GLMM, it is common to only simulate the last stochastic level (e.g. Poisson) conditional on the fitted random effects, meaning that the random effects are set on the fitted values.

For controlling how many levels should be re-simulated, the simulateResidual function allows to pass on parameters to the simulate function of the fitted model object. Please refer to the help of the different simulate functions (e.g. ?simulate.merMod) for details. For merMod (lme4) model objects, the relevant parameters are “use.u”, and “re.form”, as, e.g., in

simulationOutput <- simulateResiduals(fittedModel = fittedModel, n = 250, use.u = T)

If the model is correctly specified and the fitting procedure is unbiased (disclaimer: GLMM estimators are not always unbiased), the simulated residuals should be flat regardless how many hierarchical levels we re-simulate. The most thorough procedure would be therefore to test all possible options. If testing only one option, I would recommend to re-simulate all levels, because this essentially tests the model structure as a whole. This is the default setting in the DHARMa package. A potential drawback is that re-simulating the random effects creates more variability, which may reduce power for detecting problems in the upper-level stochastic processes, in particular overdispersion (see section on dispersion tests below).

Note: Although unconditional residuals implicitly also test the normal distribution of the REs, it is probably not a bad idea to look addditionally check for normality of the RE distribution. As this is not based on quantile residuals, there is no special DHARMa function for this, so you should just extract the REs, and then run e.g. a Shapiro test.

Integer treatment / randomization

A third option is the treatment of integer responses. The background of this option is that, for integer-valued variables, some additional steps are necessary to make sure that the residual distribution becomes flat (essentially, we have to smoothen away the integer nature of the data). The idea is explained in

  • Dunn, K. P., and Smyth, G. K. (1996). Randomized quantile residuals. Journal of Computational and Graphical Statistics 5, 1-10.

DHARMa currently implements two procedures for randomization. The default procedure will randomize automatically. The second option requires knowledge about whether the model is integer-valued, which is usually implemented automatically. See ?simulateResiduals for details. Usually, these options should simply be kept at their defaults.

Calculating residuals for groups or subsets

In many situations, it can be useful to look at residuals per group, e.g. to see how much the model over / underpredicts per plot, year or subject. To do this, use the recalculateResiduals() function, together with a grouping variable (group) or a subsetting variable (sel), which can also be used in combination.

simulationOutput = recalculateResiduals(simulationOutput, group = testData$group)

Note, however, that you will have to change the selection of variables that you provide to plots and tests (e.g. in plotResiduals or testSpatialAutocorrelation) accordingly when you group or subset residuals.

Reproducibility notes, random seed and random state

As DHARMa uses simulations to calculate the residuals, a naive implementation of the algorithm would mean that residuals would look slightly different each time a DHARMa calculation is executed. This might both be confusing and bear the danger that a user would run the simulation several times and take the result that looks better (which would amount to multiple testing / p-hacking). By default, DHARMa therefore fixes the random seed to the same value every time a simulation is run, and afterwards restores the random state to the old value. This means that you will get exactly the same residual plot each time. If you want to avoid this behavior, for example for simulation experiments on DHARMa, use seed = NULL -> no seed set, but random state will be restored, or seed = F -> no seed set, and random state will not be restored. Whether or not you fix the seed, the setting for the random seed and the random state are stored in

simulationOutput$randomState

If you want to reproduce simulations for such a run, set the variable .Random.seed by hand, and simulate with seed = NULL.

Moreover (general advice), to ensure reproducibility, it’s advisable to add a set.seed() at the beginning, and a session.info() at the end of your script. The latter will list the version number of R and all loaded packages.

Interpreting residuals and recognizing misspecification problems

General remarks on interperting residual patterns and tests

So far, all shown DHARMa results were calculated for a correctly specified model, resulting in “perfect” residual plots and diagnostics. In this section, we discuss how to recognize and interpret diagnostics that indicate a misspecified model. Before going into the details, note, however, that

  1. No residual pattern does not “prove” that the model is correct: The fact that none of the DHARMa tests indicate a problem does not “prove” that the model is correctly specified. For any model, there are likely a large number of structural problems that do not create a pattern in the DHARMa diagnostics. In good old Popper fashion, you should interpret no residual problems as your working hypothesis not being rejected in that particular test, which increases confidence in the model, but does not constitute a conclusive proof. So, keep your skepticism alive, and if you find the results fishy, keep searching and testing.

  2. Once a residual effect is statistically significant, look at the magnitude to decide if there is a problem: It is crucial to note that significance is NOT a measure of the strength of the residual pattern, it is a measure of the signal/noise ratio, i.e. whether you are sure there is a pattern at all. Significance in hypothesis tests depends on at least 2 ingredients: the strength of the signal and the number of data points. If you have a lot of data points, residual diagnostics will nearly inevitably become significant, because having a perfectly fitting model is very unlikely. That, however, doesn’t necessarily mean that you need to change your model. The p-values confirm that there is a deviation from your null hypothesis. It is, however, in your discretion to decide whether this deviation is worth worrying about. For example, if you see a dispersion parameter of 1.01, I would not worry, even if the dispersion test is significant. A significant value of 5, however, is clearly a reason to move to a model that accounts for overdispersion.

  3. A residual pattern does not indicate that the model is unusable: While a significant pattern in the residuals indicates with good reliability that the observed data did likely not originate from the fitted model, this doesn’t necessarily imply that the inferential results from this wrong model are not usable. There are many situations in statistics where it is common practice to work with “wrong models”. For example, many statistical models use shrinkage estimators, which purposefully bias parameter estimates to certain values. Random effects are a special case of this. If DHARMa residuals for these estimators are calculated, they will often show a slight pattern in the residuals even if the model is correctly specified, and tests for this can get significant for large sample sizes. For this reason, DHARMa is excluding RE estimates in the predictions when plotting res ~ pred. Another example is data that is missing at random (MAR). Since it is known that this phenomenon does not create a bias on the fixed effects estimates, it is common practice to fit these data with standard mixed models. Nevertheless, DHARMa recognizes that the observed data looks different from what would be expected from the model assumptions, and flags the model as problematic (see here).

Important conclusion: DHARMa only flags a difference between the observed and expected data - the user has to decide whether this difference is actually a problem for the analysis!

Recognizing over/underdispersion

GL(M)Ms often display over/underdispersion, which means that residual variance is larger/smaller than expected under the fitted model. This phenomenon is most common for GLM families with constant (fixed) dispersion, in particular for Poisson and binomial models, but it can also occur in GLM families that adjust the variance (such as the beta or negative binomial) when distribution assumptions are violated. A few general rules of thumb about dealing with dispersion problems:

Residual patterns of over/underdispersion

This this is how overdispersion looks like in the DHARMa residuals. Note that we get more residuals around 0 and 1, which means that more residuals are in the tail of distribution than would be expected under the fitted model.

testData = createData(sampleSize = 200, overdispersion = 1.5, family = poisson())
fittedModel <- glm(observedResponse ~  Environment1 , family = "poisson", 
                   data = testData)

simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)

If you see this pattern, note that overdispersion is often caused by model misfit. Thus, before moving to a GLM with variable dispersion (for count data this would typically be a negative binomial), you should check your model for misfit, e.g. by plotting residuals against all predictors using plotResiduals().

Next, this is an example of underdispersion. Here, we get too many residuals around 0.5, which means that we are not getting as many residuals as we would expect in the tail of the distribution than expected from the fitted model.

testData = createData(sampleSize = 500, intercept=0, fixedEffects = 2,
                      overdispersion = 0, family = poisson(),
                      roundPoissonVariance = 0.001, randomEffectVariance = 0)
fittedModel <- glmer(observedResponse ~ Environment1 + (1|group) , 
                     family = "poisson", data = testData)

simulationOutput <- simulateResiduals(fittedModel = fittedModel)
plot(simulationOutput)

If you see this pattern, note that a common reason for underdispersion is overfitting, i.e. your model is too complex. Other possible explanations to check for include zero-inflation (best to check by comparing to a ZIP model, but see also DHARMa::testZeroInflation), non-independence of the data (e.g. temporal autocorrelation, check via DHARMa:: testTemporalAutocorrelation) that your predictors can use to overfit, or that your data-generating process is simply not a Poisson process.

From a technical side, underdispersion is not as concerning as over dispersion, as it will usually bias p-values to the conservative side, but if your goal is to get a good power, you may want to consider a simpler model. If that is not helping, you can move to a distribution for underdispersed count data (e.g. Conway-Maxwell-Poisson, generalized Poisson).

Formal tests for over/underdispersion

Although, as discussed above, over/underdispersion will show up in the residuals, and it’s possible to detect it with the testUniformity function, simulations show that this test is less powerful than more targeted tests. DHARMa contains several overdispersion tests that compare the dispersion of simulated residuals to the observed residuals.

  1. default: a non-parametric test that compares the variance of the simulated residuals to the observed residuals (default), which has some analogy to the variance test implemented in aer::dispersiontest
  2. PearsonChisq: alternatively, DHARMa implements the Pearson-chi2 test that is popular in the literature, suggested in the glmm Wiki, and implemented in some other R packages such as performance::check_overdispersion
  3. refit if residual simulations are done via refit, DHARMa will compare the the Pearson residuals of the re-fitted simulations to the original Pearson residuals. This is essentially a nonparametric version of test 2.

All of these tests are included in the testDispersion function, see ?testDispersion for details.

testDispersion(simulationOutput)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.067211, p-value < 2.2e-16
## alternative hypothesis: two.sided

IMPORTANT INFO: we have made extensive simulations, which have shown that the various tests have certain advantages and disadvantages. The basic results are that:

  • The most powerful and reliable test is option 3, but this costs a lot of time and is not available for all regression packages, as it requires that Pearson residuals are available
  • Option 2, the parametric Pearson-chi2 is fast if Pearson residuals are available, but based on a naive expectation of df (counts RE as 1 df) and the test statistic is thus biased towards underdispersion for mixed models. Similar to the df approximation, Bias increasing with the number of RE levels. When testing only for overdispersion (alternative = “greater”), this makes the test more conservative, but it also costs power.
  • The DHARMa default option 1 is fast, nearly unbiased (i.e. you can test under and overdispersion), and only slightly less powerful as test 3, PROVIDED that simulations are made conditional on the fitted REs. Note that the latter is not the DHARMa default, so you have to actively request conditional simulations, e.g. for lme4 by specifying re.form = NULL. Power compared to the parametric Pearson-chi2 test depends on the number of RE levels, it will be more powerful for typical number of RE levels.

As support for these statements, here results of the simulation, which compares the uniform (KS) test with the standard simuation-based test (conditional and unconditional) and the Pearson-chi2 test (two-sided and greater) for an n=200 Poisson GLMM with 30 RE levels.

Thus, my current recommendation is: for most users, use the default DHARMa test, but create simulations conditionally.

Zero-inflation / k-inflation or deficits

A common special case of overdispersion is zero-inflation, which is the situation when more zeros appear in the observation than expected under the fitted model. Zero-inflation requires special correction steps. More generally, we can also have too few zeros, or too much or too few of any other values. We’ll discuss that at the end of this section.

Residual patterns

Here an example of a typical zero-inflated count dataset, plotted against the environmental predictor

testData = createData(sampleSize = 500, intercept = 2, fixedEffects = c(1),
                      overdispersion = 0, family = poisson(), 
                      quadraticFixedEffects = c(-3), randomEffectVariance = 0,
                      pZeroInflation = 0.6)

par(mfrow = c(1,2))
plot(testData$Environment1, testData$observedResponse, xlab = "Envrionmental Predictor", 
     ylab = "Response")
hist(testData$observedResponse, xlab = "Response", main = "")