This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

**Supinfo**

**Handling Editor** Miguel Acevedo

Large‐scale, long‐term biodiversity monitoring is essential to conservation and land management and identifying threats to biodiversity. Such comprehensive datasets increasingly include multispecies surveys that capture information‐rich co‐occurrence data, enabling community‐level analyses (Iknayan et al., 2014; Ovaskainen et al., 2017). However, multispecies surveys are prone to various types of errors, including false absences where a species is present but not detected (Dorazio & Royle, 2005), and misidentification, where a species is encountered but its species identity is not correctly recorded (Miller et al., 2011).

Certain classes of occupancy models account for observation error in biodiversity surveys that seek to understand species distributions, track population changes and describe mechanisms underlying population and community dynamics (MacKenzie et al., 2002). Latent presence/absence states are modelled explicitly, with an observation model that accounts for the details of the detection process, including the potential for false negatives (non‐detections at occupied sites) and false positives (detections at unoccupied sites; Chambert et al., 2015; Miller et al., 2012; Royle & Link, 2006; Wright et al., 2020). Disregarding false positives in biodiversity monitoring data can bias estimates of ecologically important quantities such as demographic rates, occurrence and species richness (Chambert et al., 2015, 2018; McClintock et al., 2010).

Multi‐species surveys are also subject to errors in species identifications by imperfect classifiers. Imperfect classifiers include citizen scientists (e.g. North American Breeding Bird Survey; Sauer et al., 2017), technicians trained in local taxonomy (e.g. invertebrate trapping by NEON; Hoekman et al., 2017), automated methods (e.g. bat acoustic recording software; Wright et al., 2020) or convolutional neural networks used with camera trap data; Tabak et al., 2019). Previous methods assume an imperfect classifier produces species‐level classifications, but in practice, particularly with human observers, we may end up with extraspecific classifications including ‘unknown’, morphospecies designations (i.e. individuals that cannot be taxonomically identified so are grouped by morphology) and taxonomic identifications coarser than species.

If species observations are prone to misclassification, then samples with verified species identities might be used to estimate misclassification probabilities. We refer to this situation as ‘semi‐supervised’: true species identities are known for some but not all individuals. However, leveraging these partially observed, individual‐level validation data for the rest of the dataset present a methodological challenge. Previous multi‐species occupancy models that accommodate misclassification have used site‐level validation data where the occupancy state of a species is known only at a site but not at an individual sample level (Chambert et al., 2018) or multinomial models with site‐level covariates that aggregate individual samples (Wright et al., 2020). Using an individual, sample‐level approach can help in resolving non‐species (e.g. morphospecies) identities to the true species identity.

Misclassified species identities can be dealt with using one of two contrasting approaches. A simple two‐step approach in which (a) a classifier is used to assign species identities to each individual (creating one complete synthetic dataset from classifier output, for which species identities are treated as known or are verified using an unambiguous classification method), then (b) the synthetic dataset is analysed using a downstream model (e.g. an occupancy model). This two‐stage approach does not use any information about occupancy or encounter rates in the first stage. An alternative approach is to simultaneously model the classification process and the ecological process in a single joint model. A joint model directly uses classifier output as data, relating the observation process to underlying, imperfectly observed, ecological states in one step. Such an approach can simultaneously account for uncertainty in species identities, and use information about occupancy and encounter rates to inform species identity estimates (Wright et al., 2020). However, there remains the practical question of how much value is added by a joint model vs. a two‐stage approach. A priori, we expect that a joint model should produce better true species identities simultaneously by directly modelling the link between ecological states and the observation process, but this has not yet been tested.

Here we present an individual‐level, semi‐supervised, dynamic occupancy model that accounts for species non‐detection and misclassification. Our Bayesian approach extends the classification‐occupancy model of Wright et al. (2020) to (a) accommodate extinction and colonization dynamics, (b) allow for additional uncertain morphospecies designations in the imperfect species classifications and (c) make use of labelled samples with known species identities in a semi‐supervised setting. Furthermore, we compare the classification performance (i.e. accuracy and precision of posterior draws) of a joint classification‐occupancy model to a reduced classification‐only model that discards information about occupancy and encounter rate on a withheld test set. We demonstrate our model using simulations and with an empirical case study of the carabid beetle (Carabidae) community at the National Ecological Observatory Network (NEON) Niwot Ridge Mountain Research Station (NIWO), west of Boulder, CO, USA.

Consider data collected at sites $i=1,\dots ,N$, according to a robust design (Hoekman et al., 2017) where each site is visited $j=1,\dots ,J$ times within primary periods $t=1,\dots ,T$, where the occupancy states are assumed to be static within primary periods. We are interested in occupancy states (true presence or absence) and encounter rates (observed frequency) for species $k=1,\dots ,K.$

We assume that the objective of classification is to use the resulting data in an ecological model describing species occurrence over space and time. Let ${z}_{i,k,t}$ be the binary occurrence state for species $k=1,\dots ,K$ at site $i$ and during time $t$. Sites are either occupied (${z}_{i,k,t}=1$) or not $\left({z}_{i,k,t}=0\right)$. We assume that the occupancy states arise as Bernoulli random variables:

On any particular sampling occasion $j$ at site $i$ in period $t$, we encounter ${L}_{i,j,k,t}$ individuals with encounter rate ${\lambda}_{i,j,k,t}$. We assume that the number of encounters is a Poisson random variable: ${L}_{i,j,k,t}\sim \text{Poisson}\left({z}_{i,k,t}{\lambda}_{i,j,k,t}\right)$. So ${L}_{i,j,k,t}$ = 0 indicates non‐detection, due to either the species not occupying that site (i.e. ${z}_{i,k,t}=0$) or the species may occupy the site but was not encountered (i.e. ${\lambda}_{i,j,k,t}=0$). In a setting with misclassification, the number of encountered individuals ${L}_{i,j,k,t}$ is not observed directly because of uncertainty in the true species identities of encountered individuals. We do however observe the total number of individuals across all species encountered on any particular occasion: ${L}_{i,j,.,t}={\sum}_{k=1}^{K}{L}_{i,j,k,t}.$ The properties of sums of Poisson random variables allow us to model these observed totals as:

In addition to observing the total number of encountered individuals on an occasion, ${L}_{i,j,.,t}$, we assume that we also obtain imperfect species classifications for each encountered individual. In cases where individuals have been encountered (${L}_{i,j,.,t}>0$), we obtain imperfect classifications of individuals $l=1,\dots ,{L}_{i,j,.,t}$ and model these as arising from a categorical distribution with a species‐specific probability vector:

True species identities are modelled as:

If ground‐truthed species identity data are available for some individuals, then $k\left[i,j,l,t\right]$ is partly observed and this model can be used in a semi‐supervised setting. In an unsupervised setting, this individual‐level formulation is a disaggregated version of the single‐season multinomial model of (Wright et al., 2020; Appendix S1). An aggregated version would pool the counts so no identifying information is attributed to any single sample. The disaggregated, or sample‐level, approach facilitates the treatment of non‐species (e.g. morphospecies) identities and allows for covariates to be included in the model at the individual/observation level, which could improve the estimation of classification probabilities and possibly ecological parameters. For example, sample confidence (i.e. the observer's confidence in species classification for an individual) is a sample‐level covariate (like the observation‐level covariate, sample quality, in Augustine et al. (2020)) that might be correlated with how samples might be prioritized for verification. This is the case for how samples are prioritized for verification in the NEON dataset, but sample confidence is not available.

In some settings, the imperfect classifier might assign more classes than there are unique species so that the vector ${\mathit{\theta}}_{k}$ has more than $K$ elements. For example, if an imperfect classifier is unable to identify a set of species, they may classify those individuals as ‘unknown’ or as a unique morphospecies associated with a given sampling occasion. Thus, it is possible for individuals to be classified into $\tilde{K}\ge K$ classes, where $\tilde{K}$ is the sum of the number of species and the total number of morphospecies designations. In such cases, the matrix $\mathbf{\Theta}=\left({\mathit{\theta}}_{1}^{\prime},\dots ,{\mathit{\theta}}_{K}^{\prime}\right)$ can be rectangular, with the first $K$ columns corresponding to the classification probabilities for species $1,\dots ,K$, and the remaining columns corresponding to classification probabilities for non‐species (e.g. morphospecies) classes:

We fit our model to the carabid pitfall trap sampling data collected by NEON at NIWO during 2015–2019 (National Ecological Observatory Network, 2021). Carabids are a ubiquitous and speciose family of ground‐dwelling invertebrates that are commonly collected by passive sampling methods, like pitfall traps, as described in Hoekman et al. (2017). A well‐studied sentinel group, carabids make an excellent study system for assessing community occupancy rates and classification accuracy. Collecting and identifying carabids is resource‐intensive, but NEON lowers this barrier to entry by providing a public carabid dataset with three levels of classification (parataxonomist, expert taxonomist, then DNA barcoding). Although NEON processes carabid samples at the domain level (sampling locations within the same ecoregion; Hoekman et al., 2017), we focus our analysis on one NEON sampling location, NIWO, to assess occupancy across co‐occurring species. We use the 2015–2019 dataset since carabid sampling started in 2015 at NIWO and expert classification data were not yet fully available for 2020 at the time of analysis due to data latency (National Ecological Observatory Network, 2021). NIWO is a site in the southern Rocky Mountains, spanning subalpine conifer forest and alpine tundra.

We outline the relevant data collection protocol here, but Hoekman et al. (2017) offer more details regarding NEON's carabid pitfall trap data product. The sampling design at every NEON sampling location consists of 10 permanent sites with four pitfall traps per site. Traps are sampled and reset biweekly during the growing season, with a range of 5–7 collections per year at NIWO. In 2018, one site was permanently relocated to ensure sampling was allocated proportionally to the NLCD cover types represented (NEON help desk, personal communication). Variables in our model are defined at the site level.

All carabid samples are classified by a parataxonomist, and a subset are sent to an expert taxonomist for verification (Figure 1; Hoekman et al., 2017). Species classification by parataxonomists is considered imperfect. Identification by an expert taxonomist is treated as confirmation data but is limited due to budget constraints. We confirmed the accuracy of the expert taxonomist classifications in finding that all individuals sent for DNA barcoding by NEON match the expert taxonomist's identification for the samples we used. In the few cases where the expert taxonomist could not identify a specimen to species level, we use their genus‐level classification for the validation dataset.

Our dataset contains 5,865 individual specimens, 1,910 of which were identified by an expert taxonomist, and 62 species classified by the parataxonomist, 23 of which are morphospecies. Morphospecies identifications are unique to each NEON sampling location and year. We fit our model using all individuals and used no environmental covariates. Having both parataxonomist and expert taxonomist classifications complicates the use of NEON's carabid pitfall trap data (Figure 1). Only one study to date has been published using the NEON carabid pitfall trap data (Egli et al., 2020), but Egli et al. (2020) analyse only the subset of individuals that have expert taxonomist classifications. The classification‐only model to which we compare our joint model can be fit only to the thinned dataset of verified samples, resulting in a loss of information.

We used informative priors for the species classification probability vectors ${\mathit{\theta}}_{1},\dots ,{\mathit{\theta}}_{K}$ that placed higher probability density on the correct species classification. In the case of NEON beetle data, this is reasonable given the training that parataxonomists receive in beetle identification. Because all elements of each ${\mathit{\theta}}_{k}$ vector need to sum to one, and each element is bounded between 0 and 1, we used a Dirichlet prior: ${\mathit{\theta}}_{k}\sim \text{Dirichlet}\left({\mathit{\alpha}}_{k}\right)$. We chose the Dirichlet concentration values ${\mathit{\alpha}}_{k}$ by comparing draws from the Dirichlet prior distribution to our prior intuition about imperfect classifier accuracy, making an assumption that there was a 65% chance that a species is correctly classified and some small probability that it is assigned to another specific class. Additionally, with smaller differences between the values of the Dirichlet prior, the model appeared not identifiable. This prior is informative particularly for components of $\mathbf{\Theta}$ that have few observations (e.g. fewer than 80 observations on the diagonal, or fewer than 2 off‐diagonal; see Table S2.1) such that the prior may be more informative than the data.

We used multivariate normal priors at the species and site level for initial occupancy, persistence, colonization and encounter rates. Correlated priors allow information sharing among parameters (Figure S3.1). The motivation for this stemmed from a prior expectation that these parameters could be related. For example, species with a higher encounter rate might be more likely to occur initially, persist or colonize new sites. Similar arguments could be made about relationships among site‐level parameters. Each species is associated with a vector ${\mathbf{\u03f5}}_{k}$ of length 4, where ${\mathrm{\u03f5}}_{k,1}$, ${\mathrm{\u03f5}}_{k,2}$, ${\mathrm{\u03f5}}_{k,3}$ and ${\mathrm{\u03f5}}_{k,4}$ are species‐specific adjustments on each of the four ecological parameters. The multivariate normal priors have a mean of zero and an unknown covariance matrix: ${\mathbf{\u03f5}}_{k}\sim \text{Normal}\left(\mathbf{0},{\mathbf{\sum}}^{\left(\alpha \right)}\right)$ with an Inverse‐Wishart prior on $\mathbf{\sum}$. Similarly, site‐specific adjustments ${\mathbf{\u03f5}}_{i}$ were drawn from a different multivariate normal prior. These adjustments were added together on a transformed scale to compute initial occupancy, persistence, colonization and encounter rates, for example, $\text{logit}\left({\psi}_{i,k,1}\right)={\mathrm{\u03f5}}_{i,1}+{\mathrm{\u03f5}}_{k,1}$. A full model specification for the case study is available in Appendix S4 (Plummer et al., 2003).

To evaluate how the occupancy and encounter rate components of the full model informed classification probability estimates, we developed a reduced model that discards all information about occupancy and abundance, using just the true and imperfect species classifications to estimate the classification matrix $\mathbf{\Theta}$. This reduced model uses the same $\mathbf{\Theta}$ prior as the joint model for consistency. Additionally, imperfect classifications are modelled as arising from a categorical distribution with a species‐specific probability vector, as in the joint model. However, since occupancy and encounter rate information are discarded, true species classifications are not modelled and instead rely entirely on data. Thus, the reduced model is limited to the subset of data with verification. The comparison between full and reduced models reveals the extent to which occupancy and encounter rates inform classification probabilities. If there are no differences in the estimates of classification probabilities, then a two‐stage model which first models misclassification and then passes the posterior of species classification on as a prior for an occupancy model should perform as well as the joint model in which the classification model is integrated with the occupancy model. However, we do not directly investigate the comparison in occupancy and encounter rate estimation between the joint and two‐stage models.

All models were fit using JAGS with four chains of 20,000 iterations (Appendix S4, dclone, and R v4.0.2 (Plummer et al., 2003; R Core Team, 2020; Sólymos, 2010); output diagnostics were checked visually using traceplots and verifying that $\widehat{R}<1.1$ for estimates; and results were visualized with ggplot (Wickham, 2016).

We conducted a simulation study to study the general behaviour of the model. We simulated 15,000 datasets with two species ($K$ = 2) and three imperfect classifications ($\tilde{K}$ = 3) while varying the fraction of verified samples. We expect model performance to decline as this fraction decreases.

Each dataset represents a single season of sampling with three surveys at each of 30 sites. The idea behind the three imperfect classifications is that two are the true species and one is a morphospecies (e.g. Species A, Species B and Morphospecies 1 in Figure 1). Every dataset was simulated using the priors used in the case study's joint classification‐occupancy model (Appendices S3 and S4). The simulations use different Dirichlet concentration ($\alpha $) values, and thus a different $\mathbf{\Theta}$ prior, than the model used in the case study to accommodate a two‐species dataset. We use a smaller dataset in the simulations compared to in the case study, specifically fewer species and a single season rather than multi‐season. This dataset simplification was motivated for practical reasons of computational capacity, and we chose a similar number of species as was used in the simulations in Wright et al. (2020).

For a given simulated dataset, we know the true species classification for every sample. We fit both the joint classification‐occupancy and classification‐only models to each dataset 100 times, where each time a different fraction of samples is verified. The 100 fractions for each dataset are randomly generated between 0.01 and 0.99. While this fraction range covers a wide variability in the verification effort, the results are sensitive to the choice of $\mathbf{\Theta}$ prior, specifically the lower fractions would face estimation challenges (i.e. issues with parameter identifiability or MCMC convergence). If you arrange a dataset's 100 iterations in order from largest fraction of verified samples to smallest, when a sample's true species identity is removed in one iteration, it stays removed for subsequent iterations. We subset the samples this way to better isolate the effect that the validation fraction has on model results without the confounding effects of sample selection.

To evaluate model predictive performance, we first calculate validation metrics for the respective classification performance of the models by comparing estimated species identities to true species identities for samples with the species identity left out. However, we could not calculate validation metrics for the classification‐only model since it cannot be fit without every sample having a verified species identity. For the purposes of calculating the validation metrics, we compare the true species identity, which is known in the simulations, to the distribution of median estimated species identities. That is, the estimated species identities are selected as the species that the median from the categorical posterior distribution (see Observation model: classification) lies on for each of the 1,200 chains for each iteration of every dataset. We evaluate how well the simulations predicted ecological parameters by measuring the coverage, difference between the estimated parameter value and true value, and width of 95% credible interval (CI) width for encounter rate ($\lambda $), occupancy ($\psi $) and classification probabilities (${\mathrm{\Theta}}_{\text{full}}$) for the full model (joint classification‐occupancy) and classification probabilities (${\mathrm{\Theta}}_{\text{reduced}}$) for the reduced model (classification‐only).

Occupancy estimates varied across species and through time (Figure S2.2). No occupancy model was fit for the reduced, two‐stage model, so there are not occupancy estimates to which to compare the full model's results. The joint occupancy model was designed to allow correlation between parameters across sites and species. Occupancy, growth and turnover rates also varied through time. Sites with high encounter rates tended to have low initial occupancy and colonization probabilities and high persistence probabilities (Figure S3.1). Furthermore, sites with high colonization rates tended to have high initial occupancy probabilities and low persistence probabilities. At the species level, we saw positive correlations among many of the model components, but in particular, species' encounter rate was positively correlated with species' initial occupancy, persistence and colonization rates (Figure S3.1). Species varied in their detection success by the imperfect classifier, from ones that were common and consistently identified correctly (e.g. *Calathus advena*) to ones that were not identified at all (e.g. *Dicheirotrichus mannerheimii*) but were caught by the expert taxonomist.

The model yielded high probabilities of classification along the diagonal of the $\mathbf{\Theta}$ confusion matrix where the true and imperfect identifications match (Figure 2). The parataxonomists, or imperfect classifiers in the NEON dataset, were trained in beetle taxonomy, so we built the model to favour the imperfect classifier by giving more weight in the $\mathbf{\Theta}$ prior to diagonal values, making morphospecies classifications less probable. However, some species were just as or more likely to be identified as a morphospecies by the imperfect classifier than as the correct species. For example, the parataxonomist was more likely to classify *Pterostichus* (*Hypherpes*) sp. as morphospecies D13.2016.MorphBT than as the true species (see white arrows in Figure 2). However, no species had more than 3% probability (median) of being classified as another species (i.e. our model results indicate that the parataxonomist is most likely to identify a species either correctly or as a morphospecies). Samples with morphospecies classifications make up a sizeable portion of the carabid dataset, 812 out of the 5,865 total individuals identified by the imperfect classifier.

To evaluate the value added by informing the classification model with occupancy and encounter rates, we compared the full model to a reduced classification‐only model that discards all information about occupancy and abundance. A difference between the models is their access to data. While the reduced model can use only a subset of the dataset, those samples with validation data, the advancement in the full model is that it allows the entire partially validated dataset to inform classification. Most ${\theta}_{k}$ probability vectors do not differ between the full and reduced model results. However, we see differences for a few species where there is less overlap in $\theta $ posteriors between the full and reduced models (e.g. Theta[*P*. (*Hypherpes*) sp., *P*. (*Hypherpes*) sp.] and Theta[*P. restrictus*, *P. restrictus*], Figure 3). These differences are found most notably for the abundant species. The full model yielded higher correct classification probabilities for the abundant species. Furthermore, the reduced model has larger 95% (CI) widths compared to the full model for many $\mathbf{\Theta}$ indices (Figure 4). Thus, we find that a joint classification‐occupancy model outperforms a two‐stage model (classification, then occupancy) due to both the improvements in the modelling framework and the joint model's expanded access to the full dataset. Since the models being compared are fit to different sets of empirical data, we cannot quantify bias, which is important in model comparison.

We evaluated the performance of the full model's species classification by withholding some verified samples to see how well the model did in reproducing those true species identities. Validation metrics cannot be calculated for the reduced model since this model cannot be fit if true species identities are missing. We withheld a randomly selected 20% (382 individuals) of the true species identities (1,910 total). For the withheld samples, the classification accuracy was 89.9%, precision (for each imperfect classification, number correctly matched over total labelled as imperfect classification) 80.6%, and recall (for each species, number correctly matched over total samples of that species) 51.9%. Validation metric macro‐averages are listed in Table 1.

We found that model performance improved when fit to datasets with larger fractions of verified samples; however, the gains in model performance were modest. We illustrate model performance through model validation (Figure S5.1), coverage (Figure S5.3), difference between estimated parameter value and true value (Figure S5.2), and width of 95% credible intervals (Figure S5.4). In all of these metrics, we see that the gains in model performance are small or non‐existent as the fraction of samples verified increases. We also found that the joint classification‐occupancy model outperformed the classification‐only model. Specifically, the joint model had higher precision (Figure 5). These results are specific to the dataset scenario specified for these simulations (e.g. 2 species, 1 morphospecies) and if parameter values were used for a dataset with higher species richness, the results would likely change.

We developed a statistical approach that can be applied to datasets with imperfect observations that enhances multispecies classification by leveraging occupancy dynamics. This approach builds on recent work that integrates classification into occupancy models (Devarajan et al., 2020, and references therein) by evaluating the advantage of a joint classification‐occupancy model, which allows imperfect classification categories to outnumber species; leverages individual‐level confirmation data in a semi‐supervised setting; and allows for the option to include covariates at the level of the individual/observation. Our probabilistic framework can be generalized to modelling abundance or other latent state variables that are estimated from multivariate count data with imperfect detection and classification (Chambert et al., 2016; Conn et al., 2013), as long as the data are classified at the individual level. For example, the $z*\lambda $ component of the encounter and observation models could alternatively be written as $p*N$ to model abundance. While analyses targeting species richness may be shielded to a certain extent from imperfect classification (Egli et al., 2020), any population‐ or community‐level analysis with taxonomic specificity requires an understanding of classification uncertainty in the data. Our model provides a coherent statistical framework for ecological estimation in the presence of classification uncertainty.

False‐positive and false‐negative species classifications are inevitable in any field collection, caused by misclassification or non‐detection and due to time and money constraints or imperfect classifiers (Hoekman et al., 2017; McClintock et al., 2010; Miller et al., 2012; Royle & Link, 2006). Misclassification may be caused by a number of extrinsic factors, including site‐ and survey‐level covariates or observer error, the latter of which we focus on in the case study. Accounting for false identifications is important to reduce bias in occupancy dynamics estimated from multispecies biodiversity monitoring datasets (Chambert et al., 2015; McClintock et al., 2010; Miller et al., 2011; Miller et al., 2015) or in multi‐state capture–recapture models (Pradel, 2005). Alternative models that account for false positives may consider data from only the focal species (Chambert et al., 2015) or from binary observations (Chambert et al., 2017). Like Wright et al. (2020), we use available counts from an imperfect classifier (Figure 1). However, we use all species detected, no matter how rare. By using a rectangular classification matrix that allows for propagation of taxonomic uncertainty for multispecies datasets where imperfect classifications outnumber species (e.g. unknown, morphospecies, to the family level; Figure 2), we remove a limitation that previous occupancy modelling methods have used (Chambert et al., 2018; Wright et al., 2020). For example, although the model priors favour imperfect classifier accuracy, the model found likely species matches for a couple of morphospecies that were abundant in the data (e.g. D13.2015.MorphO, D13.2016.MorphBT; Figure 2).

Our model is semi‐supervised and makes use of data at the individual level. Whereas alternative models use data pooled at the site or visit level (Chambert et al., 2018; Wright et al., 2020), our model leverages the rich, individual‐level information to reveal which species are commonly mistaken by the imperfect classifier and how often, allowing species counts to inform classification (Chambert et al., 2017). Verified individuals can be used as partially observed occupancy data in our semi‐supervised model. Our model could be expanded to include observation‐level covariates (e.g. sample confidence), which is an extension that our individual‐level, occupancy‐detection model can achieve that a count‐detection model cannot. The case study in Wright et al. (2020) classifies bat calls to species at the site and visit level. All calls are classified by an automated, imperfect classifier and only the subset of calls that are classified manually by an error‐free classifier are used to fit their model, though the authors provide a model extension if detections are unable to be confirmed. The Wright et al.'s (2020) model is analogous to our joint model if the data were aggregated to the site and visit level.

Ours is the first model to consider how occupancy and encounter rates contribute to improving species classification (although Augustine et al. (2020) developed a capture–recapture model for improving individual classifications). We found that our joint classification‐occupancy model outperformed a classification‐only model that disregarded occupancy dynamics and could use only a supervised subset of the data in estimating imperfect classification. Specifically, the joint model yielded more precise estimates (Figures 4 and 5). Accuracy of classification estimates was not assessed for either model because true misclassification probabilities are not known for the case study data. While there was large agreement between the confusion matrix $\mathbf{\Theta}$ of both models, the full model had higher probability estimates for abundant species (Figure S2.1).

The simulations explore how model behaviour changes with the fraction of samples verified. Error‐free classification of samples is often costly and time‐intensive, which makes understanding the added value of more verification useful for any researchers conducting species surveys on a budget. From our simulations, we found that a higher fraction of verified samples yielded better model performance, though the difference was modest (Appendix S5). For example, the difference between the classification accuracy for a dataset with 90% of its samples verified versus a dataset with almost none of its samples verified is roughly 5% (Figure S5.1). While this difference is small, it can likely be explained by the informative Dirichlet prior paired with a dataset with few species. While the validation metrics (Figure S5.1) illustrate some improvement with more samples validated, this trend is mostly unnoticeable for ecological parameter estimates from the simulations (Figures S5.3, S5.2 and S5.4). This modest difference in model performance also highlights the balance that is to be struck between budgeting for more sample verification to meet desired research goals and getting fewer samples verified but still yielding reasonable, though not optimal, model performance. The simulations presented are for a two‐species case, but model performance may look different for scenarios with more species. Researchers that have an idea of how many species they will encounter in a survey and how many samples they can budget for verification can adapt these simulations to inform how many samples they choose to be verified.

Our model has a number of limitations. For one, the joint classification‐occupancy model takes more computing power to fit than the classification‐only model. For the NEON data used in the case study, the imperfect classifier had high agreement with verified classifications, and we reflected this in the model by using informed Dirichlet priors that offer the imperfect classifier high accuracy. However, the model results may vary when fit to datasets with a less reliable imperfect classifier. The extent to which imperfect classifier accuracy affects model behaviour could be addressed through simulations. A model assumption is that samples were selected at random for verification. However, the dataset used in the case study gave verification priority to samples that the imperfect classifier could not identify to species level (National Ecological Observatory Network, 2021). Violating a model assumption by having samples that are low in classification confidence preferentially selected for verification introduces bias into the confusion matrix parameter estimates, likely causing misclassification probabilities to be overestimated. In other cases, like when data collection methodology changes in long‐term surveys, simulations could inform which types of samples should be prioritized for verification to yield desired model results.

We tried various iterations of the model before coming to the final semi‐supervised, individual‐level model. While an aggregated data approach theoretically takes less processing time, we found the model fit to aggregated data would either not converge or struggled with identifiability, yielding multimodal posteriors for $\theta $. Changing the Dirichlet priors to favour imperfect classifier accuracy helped but did not eliminate the problem. In the case study, we investigate differences in classification performance between the full and reduced models and do not directly investigate and compare performance of the model in estimating ecological parameter estimates. Future work could more explicitly address false positives by informing $\theta $ priors with species commonly misidentified by the imperfect classifier or by inducing sparsity in the $\theta $ matrix through setting certain priors to 0.

Large‐scale, long‐term biodiversity surveys are critical to inform land management and conservation policy (Hughes et al., 2017) and require accurate species classifications to achieve conservation objectives. This probabilistic approach can model species occupancy while accounting for imperfect detection and classification. Considering misclassification is even more important when making inferences about temporal transition (i.e. extinction, colonization) than for occupancy because misclassification in either of the focal time periods ($t$ or $t+1$) can produce bias. Innovations in occupancy models, in general, are rapidly being made to consider an expanding variety of study systems and experimental designs (Bailey et al., 2014) and most of these approaches rely on observation‐level, ground‐truthed verification data for model training and validation. The proposed model can be extended to enhance occupancy inferences made from citizen science surveys (Sauer et al., 2017); long‐term, multi‐PI surveys where methodologies may vary through time or across sites; automated imperfect classifiers (i.e. machine learning algorithm) applied to large volumes of open data (e.g. satellite or airborne remote sensing, camera traps); or research scenarios where data provenance is limited, which makes propagating classification uncertainty important. Future work could explore through simulation ways that model parameters affect behaviour (e.g. effects of different weights on the informed Dirichlet prior, imperfect classifier accuracy or sample prioritization for verification) and use observation‐level covariates to enhance estimation of classification probabilities and ecological parameters.

The National Ecological Observatory Network is a program sponsored by the National Science Foundation and operated under cooperative agreement by Battelle. This material is based in part upon work supported by the National Science Foundation through the NEON Program. We thank G. Vagle for taking part in the conception of this project and J. Coulombe for her graphical design assistance. The work was supported by the CU Boulder Grand Challenge investment in Earth Lab. AIS was supported as a GRA at Earth Lab for work on this project. Any use of trade, product or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

The authors have no conflict of interest to declare.

A.I.S., C.L.T. and M.B.J. conceived the project idea; A.I.S., J.A.R. and M.B.J. designed the methodology; A.I.S. curated the data; A.I.S. and M.B.J. analysed the data; A.I.S. and M.B.J. led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.

The peer review history for this article is available at

Carabid data are publicly accessible through the NEON data portal at