Checking that models adequately represent data is an essential component of applied statistical inference. Ecologists increasingly use hierarchical Bayesian statistical models in their research. The appeal of this modeling paradigm is undeniable, as researchers can build and fit models that embody complex ecological processes while simultaneously accounting for observation error. However, ecologists tend to be less focused on checking model assumptions and assessing potential lack of fit when applying Bayesian methods than when applying more traditional modes of inference such as maximum likelihood. There are also multiple ways of assessing the fit of Bayesian models, each of which has strengths and weaknesses. For instance, Bayesian *P* values are relatively easy to compute, but are well known to be conservative, producing *P* values biased toward 0.5. Alternatively, lesser known approaches to model checking, such as prior predictive checks, cross‐validation probability integral transforms, and pivot discrepancy measures may produce more accurate characterizations of goodness‐of‐fit but are not as well known to ecologists. In addition, a suite of visual and targeted diagnostics can be used to examine violations of different model assumptions and lack of fit at different levels of the modeling hierarchy, and to check for residual temporal or spatial autocorrelation. In this review, we synthesize existing literature to guide ecologists through the many available options for Bayesian model checking. We illustrate methods and procedures with several ecological case studies including (1) analysis of simulated spatiotemporal count data, (2) N‐mixture models for estimating abundance of sea otters from an aircraft, and (3) hidden Markov modeling to describe attendance patterns of California sea lion mothers on a rookery. We find that commonly used procedures based on posterior predictive *P* values detect extreme model inadequacy, but often do not detect more subtle cases of lack of fit. Tests based on cross‐validation and pivot discrepancy measures (including the “sampled predictive *P* value”) appear to be better suited to model checking and to have better overall statistical performance. We conclude that model checking is necessary to ensure that scientific inference is well founded. As an essential component of scientific discovery, it should accompany most Bayesian analyses presented in the literature.

Ecologists increasingly use Bayesian methods to analyze complex hierarchical models for natural systems (Hobbs and Hooten ). There are clear advantages of adopting a Bayesian mode of inference, as one can entertain models that were previously intractable using common modes of statistical inference (e.g., maximum likelihood). Ecologists use Bayesian inference to fit rich classes of models to their data sets, allowing them to separate measurement error from process error, and to model features such as temporal or spatial autocorrelation, individual level random effects, and hidden states (Link et al. , Clark and Bjørnstad , Cressie et al. ). Applying Bayesian calculus also results in posterior probability distributions for parameters of interest; used together with posterior model probabilities, these can provide the basis for mathematically coherent decision and risk analysis (Link and Barker , Berger , Williams and Hooten ).

Ultimately, the reliability of inference from a fitted model (Bayesian or otherwise) depends on how well the model approximates reality. There are multiple ways of assessing a model's performance in representing the system being studied. A first step is often to examine diagnostics that compare observed data to model output to pinpoint if and where any systematic differences occur. This process, which we term *model checking*, is a critical part of statistical inference because it helps diagnose assumption violations and illuminate places where a model might be amended to more faithfully represent gathered data. Following this step, one might proceed to compare the performance of alternative models embodying different hypotheses using any number of model comparison or out‐of‐sample predictive performance metrics (see Hooten and Hobbs for a review) to gauge the support for alternative hypotheses or optimize predictive ability (Fig. ).

Non‐Bayesian statistical software often include a suite of goodness‐of‐fit diagnostics that examine different types of lack of fit (Table ). For instance, when fitting generalized linear (McCullagh and Nelder ) or additive (Wood ) models in the R programming environment (R Development Core Team ), one can easily access diagnostics such as quantile‐quantile, residual, and leverage plots. These diagnostics allow one to assess the assumed probability model, to examine whether there is evidence of heteroskedasticity, and to pinpoint outliers. Likewise, in capture–recapture analysis, there are established procedures for assessing overall fit and departures from specific model assumptions that are coded in user‐friendly software such as U‐CARE (Choquet et al. ). Results of such goodness‐of‐fit tests are routinely reported when publishing analyses in the ecological literature.

The implicit requirement that one conduct model checking exercises is not often adhered to when reporting results of Bayesian analyses. For instance, a search of *Ecology* articles published in 2014 indicated that only 25% of articles employing Bayesian analysis on real data sets reported any model checking or goodness‐of‐fit testing (Fig. ). There are several reasons why Bayesian model checking (hereafter, BMC) is uncommon. First, it likely has to do with inertia; the lack of precedent in ecological literature may lead some authors looking for templates on how to publish Bayesian analyses to conclude that model checking is unnecessary. Second, when researchers seek to publish new statistical methods, applications may be presented more as proof‐of‐concept exhibits than as definitive analyses that can stand up to scrutiny on their own. In such studies, topics like goodness‐of‐fit and model checking are often reserved for future research, presumably in journals with less impact. Third, all of the articles we examined did a commendable job in reporting convergence diagnostics to support their contention that MCMC chains had reached their stationary distribution. Perhaps there is a mistaken belief among authors and reviewers that convergence to a stationary distribution, combined with a lack of prior sensitivity, implies that a model fits the data. In reality, convergence diagnostics such as trace plots only allow us to check the algorithm for fitting the model, not the model itself. Finally, it may just be a case of fatigue: it takes considerable effort to envision and code complex hierarchical models of ecological systems, and the extra step of model checking may seem burdensome. Regardless of the reason for not reporting BMC, it is concerning because poorly specified models can lead to incorrect scientific inference.

If we accept the premise that Bayesian models should be routinely checked for compatibility with data, a logical next question is how best to conduct such checks. Unfortunately, there is no single best answer. Most texts in ecology (e.g., King et al. , Link and Barker , Kéry and Schaub ) focus on posterior predictive checks, as pioneered by Guttman (), Rubin (), Rubin et al. (), and Gelman et al. () (among others). These procedures are also the main focus of popular Bayesian analysis texts (e.g., Cressie and Wikle , Gelman et al. ) and are based on the intuitive notion that data simulated from the posterior distribution should be similar to the data one is analyzing. However, “Bayesian *P* values” generated from these tests tend to be conservative (biased toward 0.5) because the data are used twice (once to fit the model and once to test the model; Bayarri and Berger , Robins et al. ). Depending on the data, the conservatism of Bayesian *P* values can be considerable (Zhang ) and can be accompanied by an inability to detect lack of fit (Yuan and Johnson , Zhang ). By contrast, other less familiar approaches (such as prior predictive checks, sampled posterior *P* values, cross‐validated probability integral transforms, and pivot discrepancy measures) may produce more accurate characterizations of model fit.

In this monograph, we collate relevant statistical literature with the goal of providing ecologists with a practical guide to BMC. We start by defining a consistent notation that we use throughout the paper. Next, we inventory a number of BMC procedures, providing pros and cons for each approach. We illustrate BMC with several examples; code to implement these examples are available in an accompanying R package, HierarchicalGOF (Conn et al. ). In the first example, we use simulation to study the properties of a variety of BMC procedures applied to spatial models for count data. In the second example, we apply BMC procedures to check the closure assumption (i.e., that the population being sampled is closed with respect to births, deaths, and movement) of N‐mixture models, using both simulated data and data from northern sea otters (*Enhydra lutris kenyoni*) in Glacier Bay, Alaska, USA. Finally, we apply BMC to examine attendance patterns of California sea lions (CSL; *Zalophus californianus*) using capture‐recapture data from a rookery on San Miguel Island, California, USA. We conclude with several recommendations on how model checking results should be presented in the ecological literature.

Before describing specific model checking procedures, we first establish common notation. Bayesian inference seeks to describe the posterior distribution, [**θ**|** y**], of model parameters,

The posterior distribution is often written as** y**|

In describing different model checking procedures, we often refer to data simulated under an assumed model. We use *i*th simulated data set under the model that is being checked. In some situations, we may indicate that the data set was simulated using a specific parameter vector, **θ**_{i}; in this case, denote the simulated data set as *T*(** y**,

Our goal in this section is to review relevant BMC procedures for typical models in ecology, with the requirement that such procedures be accessible to statistically minded ecologists. As such, we omit several approaches that have good statistical properties but have been criticized (e.g., Johnson , Zhang ) as too computationally intensive, conceptually difficult, or problem specific. For instance, we omit consideration of double‐sampling methods that may increase the computational burden of a Bayesian analysis by an order of magnitude (Johnson ), including “partial posterior” and “conditional predictive” *P* values (e.g., Bayarri and Berger , Robins et al. , Bayarri and Castellanos ). A brief summary of the model checking procedures we consider is provided in Table ; we now describe each of these approaches in greater depth.

PIT, probability integral transform.

For each method, we describe whether each method (1) tends to be “conservative” (i.e., an overstated α value), (2) whether all levels of the modeling hierarchy can be evaluated (“all levels”), (3) whether out‐of‐sample data are used to assess lack of fit (“out of sample”), and (4) computing cost (“cost”).

Box () argued that the hypothetico‐deductive process of scientific learning can be embodied through successive rounds of model formulation and testing. According to his view, models are built to represent current theory and an investigator's knowledge of the system under study; data are then collected to evaluate how well the existing theory (i.e., model) matches up with reality. If necessary, the model under consideration can be amended, and the process repeats itself.

From a Bayesian standpoint, such successive rounds of *estimation* and *criticism* can be embodied through posterior inference and model checking, respectively (Box ). If one views a model, complete with its assumptions and prior beliefs, as a working model of reality, then data simulated under a model should look similar to data gathered in the real world. This notion can be formalized through a prior predictive check, where replicate data *y*^{rep} are simulated via** y** via a discrepancy function (Appendix S1: Algorithm 1).

When the prior distribution [**θ**] is proper (i.e., integrates to 1.0), *P* values from prior predictive checks are uniformly distributed under the null model (Bayarri and Berger ). The main problem with this approach is that prior distributions must be able to predict the likely range of data values; therefore, they require substantial expert opinion or data from previous studies. In our experience, when Bayesian inference is employed in ecological applications, this is not often the case. Still, prior predictive checks may be useful for Bayesian models that serve as an embodiment of current theory about a study system (e.g., population or ecosystem dynamics models). Alternatively, a subset of data (test data) can be withheld when fitting a model, and the posterior distribution [**θ**|* y*] can be substituted for [

Prior predictive checks appear to have found little use in applied Bayesian analysis (but see Dey et al. ), at least in the original form proposed by Box (). However, they are important as historical precursors of modern‐day approaches to Bayesian model checking. Further, several researchers have recently used discrepancy measures calculated on prior predictive data sets to help calibrate posterior predictive (e.g., Hjort et al. ) or joint pivot discrepancy (Johnson ) *P* values so that they have a uniform null distribution. These calibration exercises are not conceptually difficult but do have a high computational burden (Yuan and Johnson ). The properties (e.g., type I error probabilities, power) of *P* values produced with these methods also depend critically on the similarity of the real world data‐generating process with the prior distributions used for calibration (Zhang ).

Posterior predictive checks are the dominant form of Bayesian model checking advanced in statistical texts read by ecologists (e.g., King et al. , Link and Barker , Kéry and Schaub , Gelman et al. ). Although sample size was small (*n* = 25), a survey of recent *Ecology* volumes indicated that posterior predictive checks are also the dominant form of BMC being reported in ecological literature (Fig. ). Posterior predictive checks are based on the intuition that data simulated under a fitted model should be comparable to the real‐world data the model was fitted to. If observed data differ from simulated data in a systematic fashion (e.g., excess zeros, increased skew, increased variance, lower kurtosis), it indicates that model assumptions are not being met.

Posterior predictive checks can be used to look at differences between observed and simulated data graphically, or can be used to calculate “Bayesian *P* values” (Appendix S12: Algorithm ). Bayesian *P* values necessarily involve application of a discrepancy function, *T*(** y**,

Posterior predictive checks are straightforward to implement. Unfortunately, Bayesian *P* values based on these checks tend to be conservative in the sense that the distribution of *P* values calculated under a null model (i.e., when the data generating model and estimation model are the same) is often dome shaped instead of the uniform distribution expected of frequentist *P* values (Robins et al. ). This feature arises because data are used twice: once to approximate the posterior distribution and to simulate the reference distribution for the discrepancy measure, and a second time to calculate the tail probability (Bayarri and Berger ). As such, the power of posterior predictive Bayesian *P* values to detect significant differences in the discrepancy measure is low. Evidently, the degree of conservatism can vary across data, models, and discrepancy functions, making it difficult to interpret or compare Bayesian *P* values across models. In an extreme example, Zhang () found that posterior predictive *P* values almost never rejected a model, even when the model used to fit the data differed considerably from the model used to generate it.

Another possible criticism of posterior predictive checks is that they rely solely on properties of simulated and observed data. Given that a lack of fit is observed, it may be difficult to diagnose where misspecification has occurred within the modeling hierarchy (e.g., priors, mean structure, choice of error distribution). Further, a poorly specified mean structure (e.g., missing important covariates) may still result in reasonable fit if the model is made sufficiently flexible (e.g., via random effects or covariance).

These cautions do not imply that posterior predictive checks are devoid of value. Indeed, given that tests are conservative, small (e.g., <0.05) or very large (e.g., >0.95) *P* values strongly suggest lack of fit. Further, graphical displays (see Graphical techniques) and targeted discrepancies (Table ) may help pinpoint common assumption violations (e.g., lack of independence, zero inflation, overdispersion). However, it is often less clear how to interpret *P* values and discrepancies that indicate no (or minor) lack of fit. In these cases, it seems necessary to conduct simulation‐based exercises to determine the range of *P* values that should be regarded as extreme, and to possibly calibrate the observed *P* value with those obtained in simulation exercises (e.g., Dey et al. , Hjort et al. ).

Some practical suggestions may help to reduce the degree of conservatism of posterior predictive *P* values. Lunn et al. () suggest that the level of conservatism depends on the discrepancy function used; discrepancy functions that are solely a function of simulated and observed data (e.g., proportion of zeros, distribution of quantiles) may be less conservative than those that also depend on model parameters (e.g., summed Pearson residuals). Similarly, Marshall and Spiegelhalter () suggest reducing the impact of the double use of data by iteratively simulating random effects when generating posterior predictions for each data point, a procedure they term a “mixed predictive check” (also called “ghosting”). For instance, rather than basing a posterior prediction directly on random effect realizations available from MCMC sampling, we could instead simulate random effects from a leave‐one‐out distribution. For an example of this latter approach, see Spatial regression simulations.

Posterior predictive checks rely on multiple draws from a posterior distribution. Alternatively, one can simulate a single parameter vector from the posterior, *P* value in the same manner. This choice may seem strange because the resulting *P* value can vary depending upon the posterior sample, *P* values are guaranteed to at least have an asymptotic uniform distribution under the null (Gosselin ). Sampled posterior *P* values can also be calculated using pivotal discrepancy measures (PDMs), reducing computational burden (i.e., eliminating the requirement that replicate data sets be generated). We describe an example of this approach in Spatial regression simulations.

In addition to overstated power to detect model lack of fit, posterior predictive *P* values are limited to examining systematic differences between observed data and data simulated under a hypothesized model. As such, there is little ability to examine lack of fit at higher levels of modeling hierarchy. One approach to conducting goodness‐of‐fit tests at multiple levels of the model is to use discrepancy functions based on pivotal quantities (Johnson , Yuan and Johnson ). Pivotal quantities are random variables that can be functions of data, parameters, or both, and that have known probability distributions that are independent of parameters (e.g., Casella and Berger : section 9.2.2). For instance, if*z* = (*y* − μ)/σ has a standard *z* is a pivotal quantity in that it has a known distribution independent of μ or σ.

This suggests a potential strategy for assessing goodness‐of‐fit; for instance, in a Bayesian regression model**X** represents a design matrix, **β** is a vector of regression coefficients, and **I** is an identity matrix, we might keep track of*j* ∈ 1, 2, … , *n* samples from the posterior distribution (i.e., drawing each (**β**_{j}, σ_{j}) pair from [**θ**|*y*]). Systematic departures of *z*_{ij} from the theoretical

The advantage of using PDMs is that the reference distribution is known and does not necessarily involve simulation of replicated data sets, *y*^{rep}. In practice, there are several difficulties with using pivotal quantities as discrepancy measures in BMC. First, as with the sampled predictive *P* value, *P* values using PDMs are only guaranteed to be uniform under the null if calculated with respect to a single posterior parameter draw, *i* ∈ 1, 2, … , *n* samples from the posterior distribution are not independent because they depend on the same observed data, * y* (Johnson ). As with the Bayesian

A second problem is that, to apply these techniques, one must first define a pivotal quantity and ascertain its reference distribution. Normality assessment is relatively straightforward using standardized residuals (e.g., Eq. ), but pivotal quantities are not necessarily available for other distributions (e.g., Poisson). However, Yuan and Johnson (), building upon work of Johnson (), proposed an algorithm based on cumulative distribution functions (CDFs) that can apply to any distribution, and at any level of a hierarchical model (Appendix S1: Algorithm 3). For continuous distributions, this algorithm works by defining a quantity *w*_{ij} = *g*(*y*_{ij}, **θ**) (this can simply be *w*_{ij} = *y*_{ij}) with a known CDF, *F*. Then, according to the probability integral transformation, *F*(*w*) will be uniformly distributed if the modeled distribution function is appropriate. Similarly, for discrete distributions, we can apply a randomization scheme (Smith , Yuan and Johnson ) to transform discrete variables into continuously distributed uniform variates. For example, when *y*_{ij} has integer valued support, we can define *u*_{ij} is a continuosly uniform random variable on (0,1) and *F*() and *f*() are the cumulative mass and probability mass functions associated with [* y*|

We have written the PDM algorithm in terms of the data distribution [* y*|

Cross‐validation consists of leaving out one or more data points, conducting an analysis, and checking how model predictions match up with actual observations. This process is often repeated sequentially for different partitions of the data. It is most often used to examine the relative predictive performance of different models (i.e., for model selection; Arlot and Celisse ). However, one can also use cross‐validation to examine model fit and determine outliers. The primary advantage of conducting tests in this fashion is that there is no duplicate use of data as with posterior predictive tests or those based on joint PDMs. However, cross‐validation can be computationally intensive (sometimes prohibitively so) for complicated Bayesian hierarchical models.

One approach to checking models using cross‐validation is the cross‐validated probability integral transform (PIT) test, which has long been exploited to examine the adequacy of probabilistic forecasts (e.g., Dawid , Früiiwirth‐Schnatter , Gneiting et al. , Czado et al. ). These tests work by simulating data at a set of times or locations, and evaluating the CDF of the predictions at a set of realized data (where realized data are not used to fit the model). This can be accomplished in a sequential fashion for time series data, or by withholding data (as with leave‐one‐out cross‐validation). In either case, if the distribution of the CDF values from the realized data diverge from a Uniform(0,1) distribution it is indicative of model deficiency. In particular, a U‐shape suggests an underdispersed model, a dome‐shape suggests an overdispersed model, and skew (i.e., mean not centered at 0.5) suggests bias. Congdon () provided an algorithm for computing PIT diagnostic histograms for both continuous and discrete data in Bayesian applications (see Appendix S1: Algorithm 4).

Cross‐validation can also be useful for diagnosing outliers in spatial modeling applications. For instance, Stern and Cressie () and Marshall and Spiegelhalter () use it to identify regions that have inconsistent behavior relative to the model. Such outliers can indicate that the model does not sufficiently explain variation in responses, that there are legitimate “hot spots” worthy of additional investigation (Marshall and Spiegelhalter ), or both.

For certain types of data and models, it is possible to approximate leave‐one‐out cross‐validation tests with a single sample from the posterior distribution. For instance, in random effects models, importance weighting and resampling can be used to approximate the leave‐one‐out distribution (Stern and Cressie , Qiu et al. ). Similarly, Marshall and Spiegelhalter () use a “ghosting” procedure to resample random effects and thereby approximate the leave‐one‐out distribution. When applicable, such approaches have well known properties (i.e., a uniform distribution of *P* values under the null; Qiu et al. ).

Lunn et al. () suggest several informal tests based on distributions of Pearson and deviance residuals. These tests are necessarily informal in Bayesian applications because residuals all depend on **θ** and are thus not truly independent as required in unbiased application of goodness‐of‐fit tests. Nevertheless, several rules of thumb can be used to screen residuals for obvious assumption violations. For example, standardized Pearson residuals for continuous data

For time series, spatial, and spatiotemporal models, failure to account for autocorrelation can result in bias and overstated precision (Lichstein et al. ). For this reason, it is important to look for evidence of residual spatiotemporal autocorrelation in analyses where data have a spatiotemporal index. There are a variety of metrics to quantify autocorrelation, depending upon the ecological question and types of data available (e.g., Perry et al. ). For Bayesian regression models, one versatile approach is to compute a posterior density associated with a statistic such as Moran's *I* (Moran ) or Getis‐Ord G* (Getis and Ord ) on residuals. For example, calculating Moran's *I* for each posterior sample *j* relative to posterior residuals *y* − *E*(* y*|

Many of the previously described tests require discrepancy functions, and it may be difficult to formulate such functions for different types of lack of fit (e.g., Table ). Displaying model checking information graphically may lead to more rapid intuition about where models do or do not fit the data. Alternative plots can be made for each type of model checking procedure (e.g., posterior predictive checks, sampled predictive checks, or even PDMs). For instance, Ver Hoef and Frost () plotted posterior predictive χ^{2} discrepancy values for different sites where harbor seal counts had been performed. Models accounting for overdispersion clearly resulted in improved fit at a majority of sites. The consistency of predictions was clear in this case, whereas a single *P* value would not effectively communicate where and how predictions were inaccurate.

Gelman et al. () argued that residual and binned residual plots are instructive for revealing patterns of model misspecification. In spatial problems, maps of residuals can be helpful in detecting whether lack of fit is spatially clustered. The types of plots that are possible are many and varied, so it is difficult to provide a comprehensive list in this space. However, we illustrate several types of diagnostic plots in the following examples.

We conduct all subsequent analyses using a combination of R (R Development Core Team ) and JAGS (Plummer ). We used R to simulate data and to conduct model testing procedures; JAGS was used to conduct MCMC inference and produce posterior predictions. We developed an R package, HierarchicalGOF, that contains all of our code. This package is publicly available and has been archived on Zenodo (Conn et al. ; package available online). The code is predominantly model‐specific; however, it can be used as a template for ecologists conducting their own model checking exercises.

We examined alternative model checking procedures for spatially explicit regression models applied to simulated count data. Such models are often used to describe variation in animal or plant abundance over space and time, and can be used to map abundance distributions or examine trends in abundance (e.g., Sauer and Link , Conn et al. ). A common question when modeling count data is whether there is overdispersion relative to the commonly chosen Poisson distribution. In ecological data, several sources of overdispersion are often present, including a greater number of zero counts than expected under the Poisson (zero ination; Agarwal et al. ), and heavier tails than predicted by the Poisson (Potts and Elith , Ver Hoef and Boveng ). Another important question is whether there is residual spatial autocorrelation that needs to be taken into account for proper inference (Legendre , Lichstein et al. ).

In this simulation study, we generated count data under a Poisson distribution where the true mean response is a function of a hypothetical covariate, spatially autocorrelated error, and additional Gaussian noise. Data simulated in this manner arise from a spatially autocorrelated Poisson‐normal mixture, and can be expected to be overdispersed relative to the Poisson, in much the same way as a negative binomial distribution (a Poisson‐gamma mixture). We then examined the effectiveness of alternative model checking procedures for diagnosing incorrect model specification, such as when spatial independence is assumed. We also studied properties of model checking procedures when the correct estimation model is specified.

For a total of 1,000 simulation replicates, this study consisted of the following steps:

*n* = 200 points at random in a square study area *n* = 200 points **x** using a Matérn cluster process on **μ** = exp(**Xβ** + **η** + **ε**) where **X** is an (*n* × 2) design matrix, **β** are regression coefficients, **η** are spatially autocorrelated random effects (see Appendix S2), and **ε** are iid Gaussian errors. The first column of **X** is a vector of all 1s, and the second column consists of **x**.*y*_{i}|μ_{i} ˜ Poisson(μ_{i}), at each of the *i* ∈ {1, 2, … , 200} points.

Pois0: Poisson model with no overdispersion

PoisMixSp: The data‐generating model, consisting of a Poisson‐normal mixture with both independent and spatially autocorrelated errors induced by a predictive process (cf. Banerjee et al. )

A depiction of the data‐generating algorithm (i.e., steps 1–4) is provided in Fig. ; mathematical details of this procedure, together with a description of Bayesian analysis methods used in step 5 are provided in Appendix S2. We now describe model checking procedures (step 6) in greater detail.

For each data set and statistical model, we calculated several posterior predictive *P* values with different discrepancy measures. These included χ^{2}, Freeman‐Tukey, and deviance‐based omnibus *P* values, as well as directed *P* values examining tail probabilities (Table ). Tail probabilities were examined by comparing the 95% quantile of simulated and estimated data.

For the Pois0 model, calculation of posterior predictive *P* values was straightforward; posterior predictions (*y*^{rep}) were simulated from a Poisson distribution, with an expectation that depends on posterior samples of [**β**|* y*]. For the other two models (i.e., PoisMix and PoisMixSp), it was less obvious how best to calculate posterior predictions. For instance, we identified at least three ways to simulate replicated data,

To calculate some of the omnibus discrepancy checks (Table ), one must also specify a method for calculating the expectation, *E*(*y*_{i}|**θ**). As with posterior predictions, this calculation depends on what one admits to being a parameter (e.g., are the latent **ν** variables part of the parameter set, **θ**?). We opted to start with the lowest level parameters possible. For instance, for PoisMix we calculate the expectation relative to the parameter set **θ** ≡ {**β**, τ_{ε}}; as such, the lognormal expectation is **θ** ≡ {**β**, τ_{ε}, τ_{η}}, so that

We used Algorithm 3 (Appendix S1) to conduct PDM tests on each simulated data set and model type. For all models, we assessed fit of the Poisson stage; for the PoisMix and PoisMixSp models, we also applied PDM tests on the Gaussian stage (see, e.g., Fig. ). These tests produce a collection of *P* values for each fitted model; one for each posterior parameter sample (i.e., one for each MCMC iteration). We used the median *P* value from this collection to summarize overall PDM goodness‐of‐fit.

In addition to the median *P* value from applying PDM tests, we also sampled a single PDM *P* value at random from each MCMC run. This *P* value was used as the sampled predictive *P* value for each fitted model.

We used a cross‐validation procedure to estimate an omnibus *P* value for the PoisMix model, but did not attempt to apply it to the Pois0 or PoisMixSp models owing to high computational cost. To improve computational efficiency, we modified Algorithm 4 (Appendix S1) to use *k*‐fold cross‐validation instead of leave‐one‐out cross‐validation. For each simulated data set, we partitioned data into *k* = 40 “folds” of *m* = 5 observations each. We then fit the PoisMix model to each unique combination of 39 of these groups, systematically leaving out a single fold for testing (each observation was left out of the analysis exactly once). We then calculated an empirical CDF value for each omitted observation *i* as

Here, *i* at MCMC sample *j* (*i*. The binary indicator function

According to PIT theory, the *u*_{i} values should be uniformly distributed on (0, 1) if the model being tested does a reasonable job of predicting the data. For each simulated data set, we used a χ^{2} test (with 10 equally space bins) to test for uniformity; the associated *P* value was used as an omnibus cross‐validation *P* value.

To test for residual spatial autocorrelation, we calculated a posterior distribution for the Moran's *I* statistic on residuals for each model fitted to simulated data. For each of *j* ∈ 1, 2, … , *n* samples from the posterior distribution (e.g., for each MCMC sample), Moran's *I* was calculated using the residuals * y* −

Posterior predictive *P* values were extremely conservative, with *P* values highly clustered near 0.5 under the null case where the data generating model and estimation model were the same (Fig. 7). In contrast, an unbiased test should generate an approximately uniform distribution of *P* values under the null. Tests using the median *P* value associated with PDMs were also conservative, as were mixed predictive checks and those calculated relative to posterior Moran's *I* statistics. At least in this example, the mixed predictive check actually appeared slightly more conservative than posterior predictive checks. Posterior predictive checks that depended on parameters in the discrepancy function (e.g., χ^{2}, deviance‐based discrepancies) appeared to be slightly more conservative than those that depended solely on observed and simulated data properties (e.g., the “tail” discrepancy comparing upper quantiles). In fact, the only *P* values that appeared to have good nominal properties were sampled predictive *P* values and cross‐validation *P* values. We did not explicitly quantify null properties of cross‐validation *P* values, but these should be uniform under the null because the data used to fit and test the model were truly independent in this case.

For the Pois0 model, the mean directed posterior predictive *P* value examining tail probabilities was 0.09 over all simulated data sets; the means of all other *P* values (posterior predictive and otherwise) were <0.01. As such, all model checking procedures had high power to appropriately detect the inadequacy of the basic Poisson model. Examining a representative plot of over‐ and under‐predictions shows the inadequacy of the Pois0 model: overdispersion is clearly present, and residuals are spatially clustered (Fig. ).

For the PoisMix model, only the cross‐validation test, the Moran *I* test, and tests based on PDMs of the Gaussian portion of the model had any power to detect model inadequacy (Fig. ). Of these, the sampled predictive *P* value had higher power than the *P* value based on the median PDM. The remaining model checking approaches (notably including those based on posterior predictive checks) had no power to detect model inadequacy (Fig. ).

N‐mixture models are a class of hierarchical models that use count data collected from repeated visits to multiple sites to estimate abundance in the presence of an unknown detection probability (Royle ). That is, counts *y*_{ij} are collected during sampling visits *j* = 1, … , *J*, at sites *i* = 1, … , *n*, and are assumed to be independent binomial random variables, conditional on constant abundance *N*_{i} and detection probability *p*;* y*_{ij} ~ Binomial(*N*_{i}, *p*). Additionally, *N*_{i} is assumed to be an independent random variable with probability mass function *N*_{ij} = *N*_{i}∀*j* is critical for accurate estimates of *N*_{i} and *p* (Barker et al. ). In practice, this assumption implies that a population at site *i* is closed with respect to births, deaths, immigration, and emigration, for all replicate temporal surveys at the site. Violation of this assumption can lead to non‐identifiability of the *N* and *p* parameters, or worse, posterior distributions that converge, but result in *N*_{i} being biased high and *p* being biased low (Kéry and Royle ; Appendix S3).

The appropriateness of the closure assumption has often been determined by judgment of the investigators, who assess whether time between replicate surveys is short relative to the dynamics of the system, and whether individual movement is small, compared to the size of sample plots (e.g., Efford and Dawson , but see Dail and Madsen for a frequentist test of this assumption using a model selection approach). As an alternative, we consider the utility of BMC to assess the closure assumption for N‐mixture models. We first consider a brief simulated example where truth is known. We then examine real data consisting of counts of sea otters from aerial photographs taken in Glacier Bay National Park, southeastern Alaska. For additional model checking examples and other violations of assumptions of the N‐mixture model, including zero‐inflation, extra‐Poisson dispersion, extra‐binomial dispersion, unmodeled site covariates, and unmodeled detection covariates, see Kéry and Royle (: section 6.8).

We examined the most common form of N‐mixture model for ecological data *p*_{i} and the expected abundance λ_{i} depend on covariates **w**_{i} and **x**_{i}, respectively. We used Eq. to simulate data, with one additional step to induce violation of the closure assumption. We examined a series of eight cases where the closure assumption was increasingly violated by letting *j* = 2, … , *J*, and *c* = {0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35}, where *c* can be interpreted as the maximum proportion of the population that could move in or out of a site between *j* − 1 and *j*. When *c* equals zero, *N*_{i,j} = *N*_{i,j−1}, and thus, *N*_{i,j} = *N*_{i}, and the closure assumption is met. For all values of *c*, we set **β** = (4, 1)′ and **α** = (1, −1)′, *i* = 1, … , *n* = 300, *j* = 1, … , *J* = 5. The covariate matrices **X** and **W** each had dimensions 300 × 2, where the first column was all ones, and the second column was generated by sampling from a Bernoulli distribution with probability 0.5 for all *i*. We then fit Eq. to the generated data using an MCMC algorithm written in R. Using the fitted model, we assessed the effectiveness of posterior predictive and sampled predictive *P* values for diagnosing the closure assumption. When *c* = 0, the model used to generate the data was the same as the model used to fit the data, and our model checking procedures should indicate no lack of model fit. In all other cases, the closure assumption was violated, with the degree of violation proportional to the value of *c*. Annotated R code, results, and figures from the simulation are provided in Appendix S3.

When the closure assumption was met (*c* = 0), the estimated posterior distributions recovered true parameter values well, which was expected (Table , Appendix S3). The posterior predictive *P* value was 0.48, and the sampled predictive *P* value was 0.27, suggesting no lack of model fit from either model checking proceedure (Table ).

The notation *c* represents the maximum proportion of the population that could move in or out of a site between *j* − 1 and *j*,* P* value is the posterior predicitive *P* value using a χ‐squared goodness‐of‐fit statistic, sppv is the sampled predictive *P* value using the sum of variance test statistic, Abundance is the mean of the marginal posterior distribution for total abundance at the 300 sites, the 95% CRI are the 95% credible intervals, GR is the multi‐variate Gelman‐Rubin convergence diagnostic, and ESS is the effective sample size of 1,000,000 Markov chain Monte Carlo (MCMC) iterations.

When the closure assumption was violated (i.e., *c* > 0), MCMC chains appeared to converge (Appendix S3), and convergence was often supported by Gelman‐Rubin diagnostics (Table ). However, abundance was always overestimated when the closure assumption was violated, and the true abundance value used to simulate the data was always outside estimated 95% credible intervals (Table ). The posterior predictive *P* values did not suggest lack of model fit when *c* < 0.10, and suggested lack of model fit otherwise (Table ). The sampled predictive *P* value correctly identified violation in the closure assumption (assuming a type I error rate of 0.05) for all values of *c*, for this simulation (Table ). The effective sample sizes of the MCMC chains were small due to the autocorrelation between abundance and detection probability in the N‐mixture model (Table ). Mean abundance estimates erroneously increased, with increased violation in the closure assumption, and confidence intervals failed to cover the true abundance value by allowing just 5% of the population to move in or out of a site between surveys.

We note that assessing the closure assumption of N‐mixture models using posterior predictive *P* values and sampled predictive *P* values may be challenging in some areas of the parameter space, because the biased parameter estimates obtained from fitting data from an open population can produce data *N*_{i} (or λ_{i}) are not identifiable also lead to data that are indistinguishable from data generated under an N‐mixture model (Barker et al. ). Thus, model‐checking is an important step in evaluating a model, but is not a replacement for proper study design.

Williams et al. () describe a framework for using aerial photograph data to fit N‐mixture models, where photographs are taken such that a subset of images overlap in space. The subset of overlapping images provides temporal replication of counts of individuals at spatial locations that can be used to estimate *p* in the N‐mixture modeling framework. To assess the utility of their approach, Williams et al. () conducted an aerial survey in Glacier Bay National Park, southeastern Alaska. During the survey, they identified groups of sea otters at the surface of the ocean and then flew over the groups of sea otters multiple times. They captured an image of the group of sea otters during each flight over the group. In their study, a primary observer operated the camera, and a secondary observer watched the groups of sea otters to ensure the closure assumption of N‐mixture models was met. That is, whether sea otters dispersed out of, or into, the footprint of the photograph among temporal replicates. According to observer notes, 20 of the 21 groups of sea otters that were photographed multiple times did not appear to violate the closure assumption. For analysis, Williams et al. () omitted the one site that appeared to violate the closure assumption. Here, we use Bayesian model checking as a formal method for assessing the closure assumption of two data sets that are used to fit the N‐mixture model. The first data set is the complete set of 21 observations initially collected for Williams et al. (). The second data set is the data provided in Table of Williams et al. (), which omits the problematic site. The full data set is provided in the R package HierarchicalGOF (Conn et al. ). As in our N‐mixture model simulation study above, we used Bayesian *P* values and sampled posterior predictive values to check our model. We used each data set to fit the model *P* value for the full data set (21 sites) was 0.048 and the sampled posterior predictive value was 0.059, suggesting potential lack of model fit. The Bayesian *P* value for the restricted data set used in Williams et al. () was 0.5630 and the sampled posterior predictive value was 0.823, suggesting no lack of model fit. Thus, model checking proceedures can provide a formal method for examining the closure assumption of N‐mixture models for our example, and corroborates the auxillary information collected by the observers. We note that successful identification of violation of the closure assumption in the sea otter case should not be taken as evidence that these will be readily detected in other cases.

In this example, we present another assessment of goodness‐of‐fit for a model that is quickly becoming popular within the ecological community, the Hidden Markov Model (HMM; Zucchini and MacDonald ). HMMs are a general class of models for time series data that describe the dynamics of a process in terms of potentially unobserverable (latent) states that generate observable data according to state‐dependent distributions. Using HMMs, ecologists can construct models that make inference to biologically relevant “states” (e.g., infection status, foraging/not foraging) even when data consist solely of cues (e.g., field observations, locations of satellite tags).

One implicit (and seldom tested) assumption of HMM models is that the amount of time spent within a state (the *residence time*) is geometrically distributed. The geometric distribution implies a strictly decreasing distribution of residence times, and may not be realistic for certain ecological time series. For instance, if a hidden state corresponds to “foraging,” one might expect a dome shaped distribution of residence times.

In this section, we use BMC to assess the assumption of geometrically distributed residence times in HMMs applied to California sea lion (CSL) rookery attendance patterns. We do this by comparing the fit of a Bayesian HMM, as well as the fit of an alternative Bayesian hidden *semi*‐Markov model (HSMM) that allows more flexible residence time distributions.

The HMM is formed by considering a time series of categorical variables, *Z*_{1}, … , *Z*_{T} that represent the hidden states. For each *t*,* Z*_{t} ∈ {1, … , *S*}, where *S* is the number of latent states. The *Z*_{t} process follows a Markov chain with transition matrix **Γ**_{t} in which the *j*,* k* entry is Γ_{tjk} = [*Z*_{t} = *k* | *Z*_{t−1} = *j*]. The state process is hidden (at least partially), so, the researcher is only able to make observation *y*_{t} with distribution [*y*_{t}|*Z*_{t}] and observations are independent given the hidden states. For *n* independent individual replications, the complete likelihood is **ψ**_{t} is a parameter vector for the observation process. For Bayesian inference within an MCMC algorithm, we make use of the forward algorithm (see Zucchini and MacDonald ) to integrate over the missing state process and evaluate the integrated likelihood [** y**|

The CSL data are composed of a time series or capture‐history of 66 females on San Miguel Island, California over the course of 2 months (61 d) during the pupping season. It was noted whether or not a previously marked CSL female was seen on a particular day (i.e., *y*_{it} = 1, 0, respectively, *i* = 1, … , 66 and *t* = 1, … , 61). The probability of observing a particular CSL female on a given day depends on her unobserved reproductive state: (1) pre‐birth, (2) neonatal, (3) at sea foraging, and (4) on‐land nursing. The detection probability for CSL females in the pre‐birth state is likely to be low because without a pup they are not attached to the rookery and can come and go as they please. In the neonatal state, the female remains on shore for approximately 5–7 d to nurse the newborn pup. After this period, the female begins foraging trips where it feeds for several days and returns to nurse the pup. While the CSL female is at sea it has a detection probability of 0.0. For females that have just given birth, or are returning from a foraging trip, they will be tending to their pups and are more available to be detected.

To make inference on the attendance patterns of the CSL we used an HMM with the state transition matrix

This allows the process to pass from each state to the next in the reproductive schedule with alternating (3) at sea and (4) on‐land states. Conditioning on the reproductive state, the observation model is _{1}, ψ(3) = 0, and ψ(2) = ψ(4) = ψ_{2}. The parameters ψ_{1} and ψ_{2} represent pre‐birth and after‐birth detection probability.

To assess model fit, we used the Freeman‐Tukey fit statistic *d*_{t} is the number of observed detections on occasion *t* and *E*[*d*_{t}] is the expected number of detections given by the HMM model. The Freeman‐Tukey statistic is less sensitive to small expected values than other discrepancy functions (e.g., χ^{2}), which is important in this example since the expected number of detections is small in early summer. For day *t*, the expected number of detections is **δ** = (1, 0, 0, 0)′, as all animals start in the pre‐birth state, and **ψ** = (ψ_{1}, ψ_{2}, 0, ψ_{2})′.

Two versions of the HMM model were fit to the data, one in which ψ_{1} and ψ_{2} were constant through time and one in which they were allowed to vary with each occasion (shared additive time effect). For variable time ψ models, detection was parameterized logit (ψ_{lt}) = logit (ψ_{l}) + ε_{t} for *l* = 1, 2, *t* = 1, … , 61, and ε_{1} = 0 for identifiability. We used the following prior distributions in this analysis:

The Laplace prior for ϵ_{t} was used to shrink unnecessary deviations to zero.

A collapsed MCMC sampler using the forward algorithm to calculate [** y**|

The Markov assumption of the latent state process implies that, after landing in state *k*, the amount of time spent there is geometrically distributed with parameter 1 − γ_{k}. Further, this implies that the most common (i.e., modal) amount of time spent is one time step. As γ_{k} approaches 1, this distribution flattens out, but retains a mode of 1. An alternative model that relaxes this assumption is the HSMM. In the HSMM, the residence time is explicitly modeled and, at the end of the residence period, a transition is made to another state with probability

In terms of the CSL analysis, the off‐diagonal elements of the HSMM transition matrix occur at the same locations as in the HMM but are all equal to 1 because after the residence time has expired, the animal immediately moves to the next stage in the reproductive schedule (alternating between at sea and on‐land at the end). The residence time was modeled using a shifted Poisson (λ_{k}) distribution; that is, residence time minus 1 is Poisson distributed. We set prior distributions for residence time parameters as [log λ_{k}] ∝ 1. Prior distributions for the detection parameters remained the same as before. Using the “HSMM as HMM” technique of Langrock and Zucchini (), we sampled the posterior distributions using the same MCMC algorithm as in the HMM case.

The *P* value for the Tukey fit statistic under the constant time model was 0.09, so, it was an improvement over the HMM models, but still low enough to cause concern. However, for the time varying ψ HSMM model, the *P* value was 0.82, indicating a substantial improvement in fit. By reducing the probability that an animal would transition from pre‐birth to birth states immediately after the start of the study, the HSMM model was able to accommodate a similar average residence time to the HMM without maintaining a mode of 1 (Fig. ), producing a more biologically realistic model.

Ecologists increasingly use hierarchical Bayesian models to analyze their data. Such models are powerful, allowing researchers to represent complex, and often dynamic, ecological processes. Under the Bayesian calculus, ecologists can partition observation error from process error, produce detailed predictions, and properly carry through uncertainty when making inferences. The ability to build complex models is exciting, but does not absolve us of the need to check whether models fit our data. If anything, complicated models should be subject to *more* scrutiny than simple models, as there are more places where things can go wrong.

One way to ensure a model fits the data is simply to build a sufficiently flexible model. To take an extreme example, a saturated model (one where there is a separate parameter for each datum) fits the data perfectly. No one would actually do this in practice; science proceeds by establishing generalities, and there is no generality implicit in such a model. Further, there is no way to predict future outcomes. Indeed, models with high complexity can fit the data well, but may have poorer predictive ability and inferential value than a model of lower complexity (Burnham and Anderson , Hooten and Hobbs ).

When unsure of the desirable level of complexity or number of predictive covariates to include in a model, one approach is to fit a number of different models and to average among the models according to some criterion (e.g., Green , Hoeting et al. , Link and Barker ). Still, unless one conducts model checking exercises, there is no assurance that *any* of the models fit the data. Further, there are costs to model averaging, especially in Bayesian applications where considerable effort is needed to implement an appropriate algorithm. In such cases, it may make more sense to iterate on a single model (Ver Hoef and Boveng ), and thus, model checking becomes even more important.

We have described a wide variety of Bayesian model checking procedures with the aim of providing ecologists an overview of possible approaches, including strengths and limitations. Our intention is not to be prescriptive, but to guide ecologists into making an appropriate choice. For instance, using simulation, we showed that the popular posterior predictive *P* value (and several other metrics) can have a larger than nominal α value, so that our ability to “reject” the null hypothesis that data arose from the model is overstated. In the spatial regression example, the Bayesian *P* value often failed to reject models without spatial structure even when data were simulated with considerable spatial autocorrelation. The overstated probability of rejection is due to the double use of data, which are used both to fit the model and also to calculate a tail probability. However, as shown in the sea otter and California sea lion examples, the posterior predictive *P* value can be useful in diagnosing obvious cases of lack of fit and in producing more biologically realistic models. Other choices, such as those based on cross‐validation, have better stated properties and would be preferable on theoretical grounds, but may be more difficult to implement. Regardless of the approach(es) chosen, ecologists can start incorporating BMC as a standard part of their analysis workflow (e.g., Fig. ). As in the case of “*P* hacking” (Head et al. ), care should be taken to choose appropriate goodness of fit measures without first peeking at results (i.e., employing multiple discrepancy measures and only reporting those that indicate adequate fit).

In ecology, simplistic processes are rare: we often expect heterogeneity among individuals, patchy responses, and variation that is partially unexplained by gathered covariates. Therein lies an apparent contradiction: we expect lack of fit in our models, but still want to minimize biases attributable to poor modeling assumptions. From our perspective, the goal of model checking should not be to develop a model that fits the data perfectly, but rather to probe models for assumption violations that result in systematic errors. For instance, employing an underdispersed model (e.g., the Poisson instead of the negative binomial or the normal instead of the *t* distribution) will often lead to estimates that are too precise and to predictions that are less extreme than real world observations. Basing inference on such a model could have real world implications if used to inform environmental policy, as it would tend to make decision makers overly confident in their projections. In the case of basic science, such over‐confidence can have real ramifications for confirmation or refutation of existing theory. It is therefore vital that we do a better job of conducting and reporting the results of model checks when publishing ecological research.

So far, we have viewed Bayesian model checking primarily as a confirmatory procedure used to validate (or more precisely, invalidate) modeling assumptions. However, model checking can be an important tool for scientific discovery. If a model fails to fit the data, the inquisitive ecologist will want to ask “why?” In some cases, deviations in model predictions from observations may actually suggest alternative scientific hypotheses worthy of additional investigation. We thus urge ecologists to view model checking not as a box to check off on their way to publication, but as an important tool to learn from their data and hasten the process of scientific discovery.

We thank B. Brost, A. Ellison, T. Ergon, J. Ver Hoef, and an anonymous reviewer for comments on previous versions of our manuscript, and J. Laake for initial ideas on modeling detectability when estimaing CSL attendance patterns. The findings and conclusions in the paper of the NOAA authors do not necessarily represent the views of the reviewers nor the National Marine Fisheries Service, NOAA. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Data associated with this study are available from Zenodo: