Probabilistic forecasts are commonly used to communicate uncertainty in the occurrence of hydrometeorological events. Although probabilistic forecasting is common, conventional methods for assessing the reliability of these forecasts are approximate. Among the most common methods for assessing reliability, the decomposed Brier Score and Reliability Diagram treat an observed string of events as samples from multiple Binomial distributions, but this is an approximation of the forecast reliability, leading to unnecessary loss of information. This article suggests testing the hypothesis of reliability via the Poisson‐Binomial distribution, which is a generalized solution to the Binomial distribution, providing a more accurate model of the probabilistic event forecast verification setting. Further, a two‐stage approach to reliability assessment is suggested to identify errors in the forecast related to both bias and overly/insufficiently sharp forecasts. Such a methodology is shown to more effectively distinguish between reliable and unreliable forecasts, leading to more robust probabilistic forecast verification.

Hydrometeorological events (e.g., precipitation occurrence, droughts, floods) are often forecasted as probabilities, representing a forecaster's certainty that a given event will occur [*Murphy et al*., ; *Madadgar and Moradkhani*, ; *Wetterhall et al*., ; *Yan and Moradkhani*, ]. Such probabilistic forecasts are motivated by the presence of uncertainties in land surface and atmospheric processes, which undermine the ability to precisely predict future event occurrences [*Slingo and Palmer*, ; *DeChant and Moradkhani*, ]. Since forecasters do not have complete knowledge of future events, hydrologists and meteorologists alike have recognized the benefits of communicating uncertainty in their forecasts [*Hamill*, ; *Pappenberger et al*., ]. This is evidenced by the wealth of operational probabilistic forecasting systems [*Buizza et al*., ; *Demargne et al*., ; *Park et al*., ; *Saha et al*., ] and probabilistic forecasting research initiatives [*Schaake et al*., ]. By issuing probabilistic forecasts, the end user is notified of the imperfect nature of the forecast, and therefore should only rely on a forecasted event occurring with the designated probability [*Joslyn and Savelli*, ; *Gigerenzer et al*., ]. Further, this communication of forecast uncertainty can improve risk management when resources are in danger, assuming that the forecasts accurately represent the uncertainty of an event occurring [*Carriquiry and Osgood*, ]. This necessitates detailed examination of forecast quality to ensure effective management of risk.

Two characteristics indicate the quality of a probabilistic forecast: reliability and sharpness. Reliability, also termed calibration, refers to the accuracy of the forecasted probability in conveying the true probability of an event occurring [*Christensen et al*., ]. For example, an event that is forecasted with a probability of 50% should occur in 50% of instances. Alternatively, sharpness is the level of certainty in the forecast, where greater sharpness indicates a reduction in uncertainty, which may be measured by the forecast variance or entropy [*Machete*, ]. A shaper forecast will have a tendency to generate probabilities approaching zero or one, with a perfectly sharp forecast only generating values of zero or one (deterministic forecast). With both the reliability and sharpness components of a forecast being important, it becomes necessary to have a multiobjective verification system for full assessment of forecast quality.

Multiobjectivity in forecast verification may be achieved through either a continuous function or rule based comparison. Continuous functions used for assessing probabilistic event forecasts should be strictly proper [*Bröcker*, ; *Christensen et al*., ; *Gneiting and Raftery*, ], with typical examples being quadratic, spherical, or logarithmic functions [*Bickel*, ]. Of these functions, quadratic is particularly common, which is often referred to as the Brier Score (BS) [*Brier*, ]. The BS is a smooth function that is strictly proper, providing a statistically sound method for comparing competing forecasts, but the BS has a complex relationship between sharpness and reliability [*Mason*, ]. Alternatively, a rule‐based approach may exhibit more control over the interaction between reliability and sharpness. This study takes the perspective that reliability should be held paramount, and therefore follows the paradigm “maximizing sharpness subject to calibration,” as stated in *Gneiting et al*. []. Within this paradigm, reliability of a forecast is a requisite condition for acceptability [*Mitchell and Wallis*, ]. Although sharper forecasts are desired, it is imperative to ensure that sharpness is not a factor when comparing an unreliable forecast to a reliable forecast. Through this framework, it is essential that reliability assessment be accurate, motivating a detailed look at the typical methods for reliability evaluation. The remainder of this manuscript will examine reliability assessment in probabilistic event forecasting, with the intention of assuring maximum accuracy when assessing reliability.

Assume that some forecast methodology, *f*, using some information, *D _{t}*, estimates the probability of an event,

Likewise, assume that an observation, *O _{t}*, is available at each forecast time, which may be either 0 or 1, with 1 indicating event occurrence and 0 indicating event nonoccurrence. This is the typical verification setting, where the forecasted probabilities and observed event occurrences compose all available information. With this information, the forecaster will attempt to determine if the forecast is a reliable predictor of the event of interest.

A probabilistic forecast is deemed reliable if the forecasted event probabilities are statistically indistinguishable from the true event probabilities [*Annan and Hargreaves*, ]. Note that the term “true probability” used here refers to the probability that properly represents the uncertainty in the forecast. Reliability assessment therefore becomes an examination of the similarity between the forecasted and true probabilities. Although the true probabilities are not directly available in the verification setting, the forecaster may assume that the observations provide information about the true probabilities. A prudent approach is to view the observations as random binary variables, each drawn according to the true event probability. By viewing the observations as random variables, the observations become representative of the true event probability. Since the forecaster must evaluate the similarity between the forecasted and true probabilities, and the observations are assumed to be drawn with the true probability, the problem may be inverted by quantifying the probability that the observations were drawn based on the forecasted probabilities. This will be referred to as the probability of reliability.

Drawing a random binary variable based on a forecasted probability is modeled by the Bernoulli distribution. In order to estimate the probability of reliability, each forecast should be viewed as a Bernoulli trial, with the probability of

Equation provides a means to estimate the probability of a single observation of the event, assuming that the forecasted probability is equal to the true probability. Although equation allows the forecaster to estimate the probability of each observation being drawn with the forecasted probability, the forecaster will be required to estimate the probability of a set of forecasts and observations occurring simultaneously in order to have sufficient information for robust reliability assessment. A first step is estimating the probability of the specific set of forecasted probabilities (

While equation provides the forecaster with the probability of the specific forecast and observation sequence, this probability will become infinitesimal for a large number forecast and observation pairs. It is suggested here that the probability of reliability should be formulated into a probability distribution, which may be achieved by viewing the observations as random variables. When viewing the observations as random variables, all permutations of *K* events occurring, where *K* is estimated according to equation .

This necessitates the summation of *K* observations in *T* trials, estimating the probability that *K* events may occur. Within this setting, the Poisson‐Binomial distribution [*Hodges and Le Cam*, ; *Hong*, ] estimates the probability of reliability exactly, and the corresponding Probability Mass Function (PMF) is shown in equation .

In equation , *S* is the set of all the permutations of *K* event occurrences in *T* trials that satisfy equation , *A* represents a specific permutation drawn from *S*, and *A ^{c}* is the complement of

In this article, it is suggested that reliability assessment should take a rejectionist approach, where a forecaster hypothesizes that the forecast is reliable (null hypothesis), and attempts to disprove that hypothesis. If the forecaster cannot provide sufficient evidence to prove that the true probabilities are different from the forecasted probabilities, then the hypothesis of reliability cannot be rejected. Verification with this methodology is regularly performed for continuous predictands, typically with the use of the chi‐squared test [*Joliffe and Primo*, 2008], but is rare among forecasts of dichotomous hydrometeorological events.

Such a hypothesis test may be performed with the Poisson‐Binomial distribution, but requires the formulation of the Cumulative Distribution Function (CDF). The CDF of the Poisson‐Binomial distribution is estimated according to equation .

In order to perform this hypothesis test, a significance level (*p*‐value) will need to be selected to reject the null hypothesis, which will be 0.05 throughout this article. More specifically, if

Direct estimation of equation is computationally infeasible for any useful sample size due to the large number of permutations of the observed events [*Hong*, ]. In order to overcome this issue, it is possible to use the Discrete Fourier Transform and the Characteristic Function, as demonstrated by *Hong* [], to solve the Poisson‐Binomial CDF at any practically relevant sample size. This provides an exact solution to estimate the Poisson‐Binomial CDF, thus allowing for precise hypothesis testing.

The Poisson‐Binomial distribution is absent from the hydrometeorological literature, and only approximations are present for probabilistic event forecast verification. All conventional reliability metrics are based on the Binomial Distribution, which is a specific case of the Poisson‐Binomial distribution, where all forecasted probabilities are equal. Use of the Binomial distribution is therefore an approximation in the probabilistic verification setting, leading to a loss of statistical power, with the exception of reliance on climatology, where the historical frequency of the event is used for forecasting. The Binomial CDF is much simpler than the Poisson‐Binomial CDF, as shown in equation , and has therefore been an attractive alternative for general use.

In equation ,

The Binomial CDF provides a simplified function for estimating the probability of reliability, but this will become increasingly approximate as the variability in forecasted probabilities increases. In order to reduce these errors, it has become common to group similarly valued forecasts, referred to as binning. Although binning is utilized to reduce error in the Binomial Distribution, it has the added benefit of identifying complex types of unreliability, and therefore may also be necessary when using the more appropriate Poisson‐Binomial Distribution. This binning approach will divide the possible range of probabilities (*B* groups, which are typically evenly spaced. According to equation , each of the probabilities within the bin limits *b*, where *b* is the selected bin number, which contains

Along with the binned probabilities, the observations must be binned as well, which is shown in equation , and the total number of observed occurrences within each bin is estimated according to equation , which is the application of equation to multiple bins.

In order to evaluate the Binomial distribution at each bin, the bin‐averaged forecast probability (

With this bin averaged probability, the Binomial CDF may be evaluated according to equation .

By binning the forecasted probabilities, the forecast verification problem is broken up into multiple separate problems, where the bin‐averaged probability becomes increasingly representative of the set of probabilistic forecasts with decreasing bin size.

Rather than directly estimating the Binomial CDF, meteorologists and hydrologist commonly use approximations. The most common verification methods are the BS and the Reliability Diagram. The original form of the BS is presented in equation , estimating the mean square error (MSE) of the forecasted probabilities and corresponding observations. As mentioned before, a perfect BS requires both perfect reliability and sharpness.

In order to assess reliability directly, the BS must be decomposed [*Murphy*, ]. Decomposition of the BS requires binning forecasted probabilities and observations, allowing for the comparison of bin average forecasted probabilities (*Stephenson et al*., 2007].

In equation , *b*_{R}), which is minimized with a perfectly reliable forecast. The second summation in equation is the within bin variance of the observation and forecast, and the within bin covariance of the observation and forecast, which is minimized with perfect sharpness. Through equation , the BS_{R} may be directly estimated as the MSE of the bin averaged forecast probabilities and the bin observation frequencies. For the remainder of the article, the error in the bin averaged forecast probabilities (_{R} approach perfect estimation of the probability of reliability from the Binomial distribution with increasing sample size [*Feller*, ], with the exception that the BS_{R} is inversely proportional to the probability from the Binomial distribution. Although the BS_{R} will approach the exact solution to the Binomial distribution as *B* approaches infinite, there will be some error due to this approximation at any practical number of bins.

The Reliability Diagram provides a means for graphical comparison of the probabilistic residuals, allowing for visual assessment of forecast performance. In this diagram, *Bröcker and Smith* [] translated the Reliability Diagram into probability space using the Binomial CDF. This provides a more accurate assessment of reliability from the Reliability Diagram.

The use of the BS_{R} and Reliability Diagram provides simple means for assessing the reliability of probabilistic hydrometeorological event forecasts, but these simplifications have drawbacks. First, these methods are approximations of the Binomial distribution, except in the case described in *Bröcker and Smith* []. As approximations, it is not clear the extent to which these methods damage the assessment of forecast reliability. Second, both methods are based on the Binomial distribution, which is limiting. It becomes a balance between having sufficiently small forecast bin variance to reduce errors, and enough observations in each bin to draw meaningful conclusions. A certain number of bins may be necessary to fully assess reliability, but the required number of bins to reduce approximation errors in the Binomial distribution is potentially greater than the number required for reliability assessment with the Poisson‐Binomial distribution. Further discussion on necessary bin size for different distributions is provided in sections 6 and 7. Finally, thresholds for hypothesis testing within the BS_{R} cannot be derived theoretically, and therefore the BS_{R} cannot precisely distinguish between reliable and unreliable forecasts. Although the BS_{R} provides a useful method for comparing the probability of reliability, it is restricting from the rejectionist perspective. Due to the problems highlighted above, it is necessary to examine the impacts conventional verification tools have on reliability assessment. Such an examination was performed with numerical experiments, as described in sections 5 and 6.

Multiple synthetic probabilistic forecasting experiments were performed to examine the performance of conventional reliability assessment in comparison to the Poisson‐Binomial distribution. Within these experiments, three forecast cases were implemented to examine the effects of varying degrees of forecast sharpness. The first case is presented in equation , where the forecasts are sampled from the standard uniform distribution. Based on case 1, a second case creates forecasts with probabilities tending toward zero, as shown in equation . For the generation of the forecasts for case 2, the exponent *x* will be set to a value of 2 throughout the experiments presented in section 6.1, but will range from 1 to 1.25 in the experiments presented in section 6.2. A third case is generated according to equation , creating to a “U”‐shaped distribution. Case three is the sharpest of all the forecasting cases, and is therefore the best case, assuming that all forecasts are reliable. Note that

Histograms of these forecasts are provided in Figure . From Figure , it is clear that the case 1 makes every probability equally likely to be forecasted, case 2 has a tendency to forecast towards 0, and case 3 tends towards both 0 and 1.

In section 6.1, the different verification methods will be examined under reliable forecasting conditions. This requires sampling the observations according to the forecasted probabilities, thus ensuring that the forecasted probabilities are the true probabilities. The sampling of observations is shown in equation , where

Further experiments are performed to determine the ability of the verification methods to reject unreliable forecasts. In order to perform this analysis, the exponent (*x*) in case 2 ranges from 1 to 1.25, and the corresponding values are estimated for case 3. These new cases (case 2 and 3 with *x* values ranging from 1 to 1.25) are then compared to observations drawn with probabilities according to case 1 (

A first examination of the errors related to conventional metrics requires a comparison of the Binomial and Poisson‐Binomial distribution. This is presented for each forecast case in Figure , where the Binomial and Poisson‐Binomial probability distributions are presented for each case, with the use of a single bin. A first observation from this figure is that the Binomial distribution is wider than the Poisson‐Binomial distribution for every case. This increased width of the Binomial Distribution is expected, as the variance of the Binomial distribution will always be greater than the Poisson‐Binomial distribution, except in the case where all probabilities are equal (climatology). This is proven in Appendix .

Figure also shows that the difference between the Binomial and Poisson‐Binomial distribution increases as forecast sharpness increases, which is supported by the presentation in Appendix . A wider distribution suggests that simplifying the verification problem, through the use of the Binomial Distribution, reduces one's ability to reject the hypothesis of reliability, thus increasing the possibility of type II errors. This error is largest in Case 3, which happens to be the sharpest case. Given that each of the three forecast cases is reliable, Case 3 should be selected as it provides a reliable forecast with the most certainty. In the event that all cases are unreliable, Case 3 is the most probable to be erroneously deemed reliable, as it widens the Binomial CDF, increasing the likelihood of incorrectly selecting Case 3 as the best forecast based on the Binomial CDF. Overall the single bin analysis shows that use of the Binomial distribution reduces statistical power.

Due to the loss of information caused by simplifying the problem with the Binomial distribution, the binning approach may be used to reduce the effects of forecast variability. In order to assess the effects of binning forecasts, Figure shows the width of the 95% confidence interval for each distribution as a function of bin size, where the total width of the confidence interval, summed across all bins, is presented. This figure demonstrates the rapid growth of the 95% confidence interval with an increasing number of bins. Since the grouping process reduces the sample size at each bin, the 95% confidence interval is widened, causing an aggregate effect on the overall determination of reliability. By binning similarly valued forecasts, one vastly reduces the ability to distinguish between reliable and unreliable forecasts, further increasing the chance of Type II errors. This loss of information due to binning is especially concerning in the case of hydrometeorological extremes (i.e., floods, droughts, heat‐waves), which are, by definition, low probability events, making it essential to efficiently use information from every observation. Overall it is important for forecasts to be verified with as few bins as possible, increasing the effective sample size, thus maximizing one's ability to reject unreliable forecasts.

A further observation from Figure is that forecast sharpness affects the magnitude of approximation errors in the Binomial distribution, even with a large number of bins. It is expected that errors in the Binomial CDF, in comparison to Poisson‐Binomial CDF, will decrease with an increasing number of forecast bins, as each bin becomes more representative of its members. This is evidenced in Case 1, where the Binomial CDF approaches the Poisson‐Binomial CDF with decreasing bin size. Alternatively, the Binomial CDF in Case 2 and Case 3 has persistent error even with 10 bins. This result suggests that a large number of bins may be necessary for errors associated with the Binomial CDF to be considered negligible.

Further analysis of the effects of varying bin size is performed with respect to the BS_{R} in Figure . In this figure, the variability in reliability scores between the three cases is compared with increasing numbers of bins, through 100 replicates of each forecast case. In this figure, it is expected that the difference between the distributions of BS_{R} values will decrease with increasing bin size, due to reduced approximation errors in the Binomial distribution. Since the probabilities within each bin become more homogeneous with an increasing number of bins, the BS_{R} becomes more consistent across varying levels of sharpness. The results here show that the BS_{R} requires around six bins to remove these approximation errors. Although Figure indicates the within bin variance is becoming negligible (equation ), note that the distribution of reliability values is widening, indicating the loss of information with increasing number of bins. As was found in Figure , the increasing number of bins reduces the statistical power of any verification metric.

A comparison of the BS_{R}, Binomial distribution and the Poisson‐Binomial distribution for identifying unreliable forecasts is presented in Figure , where the observation is drawn from case 1, but the forecast is created with cases 2 and 3 with increasing *x* (equation ). The analysis of the Binomial and Poisson‐Binomial distributions in Figure uses a single bin approach, whereas the BS_{R} uses six bins based on the analysis of Figure . In Figure , the fraction of 100 forecast replicates that are rejected, with a significance of 95%, is shown with respect to *x*, where the threshold for the BS_{R} was estimated from the results presented in Figure (the threshold for BS_{R} is set to 0.0043). For case 2, it is clear that the fraction of forecast replicates rejected with the Poisson‐Binomial distribution increases more rapidly than with the Binomial distribution or the BS_{R}. This indicates that Poisson‐Binomial distribution has the greatest statistical power, the Binomial distribution has a small loss of information, and the BS_{R} has greater loss of information than the Binomial distribution. This result shows that the Poisson‐Binomial distribution is very effective in rejecting unreliable forecasts that are improperly skewed, and therefore biased, but the results are much different for case 3. In this case, the Poisson‐Binomial and Binomial distributions are largely unable to reject case 3 with an *x* value of 1.25. Alternatively, the BS_{R} approaches a rejection rate of 0.5 with and *x* value of 1.25. This indicates that a multibin approach is required to reject some unreliable forecasts. Although a single bin verification framework minimizes the width of the 95% significance interval, this will only be useful if the forecast is significantly biased, as in case 2. Alternatively, if the forecast is unbiased, yet still unreliable, as in case 3, the errors will go unnoticed without examining separate bins. Further exploration of this scenario is performed with the Reliability Diagram.

The reliability diagram for the scenarios explored in Figures is presented in Figure . In this figure, the top row shows the median Reliability Diagram of all 100 replicates for increasing skew (case 2), with associated 95% significance intervals from the Binomial distribution, and the second row shows the rejection rate for each bin. Likewise, the bottom two rows show the same information for case 3. Each reliability diagram uses six bins, following the analysis presented in Figure . With respect to case 2, the median reliability diagram steadily approaches the upper limit of the significance interval at the lower bins, with increasing *x* values. This translates into increasingly frequent exceedance of the significance interval for these bins, as shown in the second row of Figure . Note that this frequency increases at a similar rate to the BS_{R}, which indicates a similar level of statistical power. With respect to case 3, the Reliability Diagram shows increasing deviations at the outer probabilities, but remains reliable at the medial probabilities, with increasing *x*. These deviations at the outer probabilities occur at a similar rate, keeping the forecast unbiased. Although it is clear that this forecast is unreliable from the multibin perspective, single bin analysis is unable to diagnose these errors. Therefore, it is necessary to use a multibin approach when examining the reliability of event forecasts. This motivates the development of a new framework for testing the hypothesis of reliability.

In order to overcome the inability of the single bin analysis to effectively reject forecasts with unbiased, yet unreliable probabilistic residuals, a multibin verification framework must be developed. Since the multibin approach was shown to reduce statistical power, a two stage approach is proposed: (1) use a single bin analysis to maximize the ability to reject biased probabilistic forecasts, and (2) use a multibin approach to assess unbiased, yet unreliable, probabilistic forecasts. Within this framework, a few considerations must be made. First, the significance level (

In the above equations, *B*+1 hypothesis tests reject the null hypothesis, then the hypothesis of reliability is rejected with a minimum significance level of

The multibin stage of the analysis will require the forecaster to determine the appropriate number of bins for verification. A first note is that only even numbers of bins should be considered, as an odd number of bins will have a bin centered around 0.5, which will be sensitive to bias in the forecasted probabilities, and therefore will be unlikely to provide additional information beyond the single bin analysis. In addition, the forecaster should consider the nature of the probabilistic forecast errors when performing the analysis, which requires a discussion of the generation of probabilistic event forecasts.

Probabilistic event forecasts will typically be created with probabilistic forecasts of continuous variables (e.g., precipitation, streamflow, soil moisture). This necessitates forecasting of a continuous probability density. From this density, the forecasted event probability will be the portion of the continuous forecast density exceeding some predefined threshold. Given that the forecast is unbiased, yet unreliable, the most common problem will be continuous forecast densities that have improper variance, leading to an event forecast that is overly certain or uncertain. Such a scenario can be assessed with only two bins, centered at probabilities of 0.25 and 0.75. In the event that the underlying continuous forecast density is unbiased and has proper variance, yet has improperly set higher moments (e.g., skew and kurtosis), the two‐bin analysis will be unable reject the hypothesis of reliability. Although this situation poses a potential problem for two‐bin analysis, the combination of unbiased forecasts with properly set variance, in conjunction with improper higher‐order moments, is expected to be rare. Beyond this assumption of rarity, identifying unreliable forecasts with errors in higher‐order moments will require a greater number of bins to identify unreliable forecasts. With this increase in the required number of bins, the necessary number of observations to reject the null hypothesis will grow rapidly. Due to this increase in the required number of observations, an analysis was performed to determine the minimum number of observations that must be available to warrant analysis with different numbers of bins.

In this analysis, the minimum number of observations necessary to reject the hypothesis of reliability for different numbers of bins was estimated. A function was developed that calculates the observation frequency for each bin (

With the maximum probabilistic residuals provided by equation , the required number of observations in each bin,

In this equation, *N* forecasts, with a probability of *N* by increments of one until the equation is satisfied. For *B*=2, *B*=4, and *B*=6, respectively. Due to the rarity of scenarios in which more than two bins is warranted, and the rapid growth in minimum required number of observations to reject the hypothesis of reliability, this study proposes that two bins are prudent for the majority of cases.

The proposed verification framework is compared to the BS_{R} (with six bins), the Poisson‐Binomial distribution and the Binomial distribution in Figure . This Figure presents similar results to Figure , to ensure consistency in the analysis. From Figure , it is clear that the proposed methodology (green line) is comparable to the single bin analysis of the Poisson‐Binomial distribution (blue line) for case 2 (solid lines), indicating minimal loss of information when adding a second verification stage. There is a minor loss of information, and this is due to the requirement of decreasing the significance level in the single bin case. The proposed technique still outperforms both the BS_{R} and Binomial distribution, indicating that this is an effective means to reject biased probabilistic residuals.

With respect to case 3, the proposed method shows the ability to reject the unreliable forecasts. As is expected, the rejection rate increases as the forecasts become increasingly unreliable. Further, the rate at which the unreliable forecasts are rejected with the proposed method increases at a faster rate than the BS_{R}, which indicates that this method provides more statistical power than the BS_{R}. As the BS_{R} and Reliability Diagram were found to reject unreliable forecasts at a similar rate, it can be concluded that the proposed methodology is more effective than the Reliability Diagram in rejecting unreliable forecasts as well. One caveat is that visualization with a Reliability Diagram is useful in diagnosing the form of forecast errors (i.e., overly sharp or insufficiently sharp forecasts), and therefore this methodology will never entirely replace the Reliability Diagram for examining the cause of forecast errors.

In order to assess the utility of the proposed verification framework on real forecasts, a case study with National Weather Service (NWS) 12 h probability of precipitation forecasts was performed. Probability of precipitation is regularly forecasted by the National Weather Service throughout the United States. This data are archived in the National Digital Forecast Database (NDFD), which may be accessed through the National Operational Model Archive & Distribution System (NOMADS) (

The NWS probability of precipitation forecasts have been well studied [*Bickel et al*., ], and were found to be unreliable, as a whole, throughout the US. Since the forecasts are known to be unreliable, the aim in this section is to compare the ability of the proposed two‐stage verification method, the Reliability Diagram, and the BS_{R} in rejecting the forecasts. In order to compare the statistical power of these techniques, the number of observations required (ranging from 10 to 365) to reject the hypothesis of reliability, for both the proposed two‐stage approach and the Reliability Diagram, is compared in Figure . In this figure, the horizontal axis shows the number of observations required by the Reliability Diagram to reject the hypothesis of reliability, with 95% significance, with the vertical axis showing this information for the proposed approach, and the black line is the one‐to‐one line. Note that all but five points lie below the one‐to‐one line, indicating that for 211 locations, the Reliability Diagram requires more verifying observations than the proposed two‐stage approach, to determine that the forecast is unreliable. This indicates that the proposed approach has more statistical power than the Reliability Diagram, allowing for rejection of unreliable forecasts with fewer forecast and observation pairs. Further, this suggests that the assumption of two bins being sufficient in the multibin stage of the proposed approach is valid for this application.

In order to compare the BS_{R} and the proposed two‐stage approach, Figure shows the histogram of BS_{R} values for reliable forecasts (Figure , top plot) and for unreliable forecasts (Figure , bottom plot). The reliable forecasts in Figure are sampled from each of the 216 forecast sets for which the proposed two‐stage approach is unable reject the hypothesis of reliability. For unreliable forecasts, all forecasts for which the proposed two‐stage approach was capable of rejecting the hypothesis of reliability were examined. From the two histograms in Figure , it is observed that unreliable forecasts have a higher occurrence of large BS_{R} values, which is expected. Although the unreliable forecasts tend to display larger BS_{R} values than those of the reliable forecasts, many of the unreliable forecasts have very low BS_{R} values, indicating that the BS_{R} may not always be capable of distinguishing between reliable and unreliable forecasts. Due to the knowledge that the BS_{R} is an approximation of the six bin approach used in the Reliability Diagram, it is expected that the BS_{R} will be less powerful than the Reliability Diagram, and therefore less powerful than the proposed two‐stage approach. Overall this real forecast verification experiment suggests that the proposed two‐stage approach is the strictest criteria for determining forecast reliability, supporting the findings from the numerical experiments presented in section 8.

Probabilistic forecasting of events has become an important tool for forecasters to represent uncertainty in hydrometeorological applications, allowing forecasters to communicate the certainty of an event occurring. Assuming that these forecasted probabilities are reliable, the end user of that forecast can effectively manage the risk of that event occurring. This necessitates verification that the forecast is reliable, to ensure that event mitigation measures are made on correct information. This has motivated the exploration of reliability assessment in this study.

From a theoretical standpoint, this article showed that the Poisson‐Binomial distribution is an exact model of the probabilistic verification setting. Although the Poisson‐Binomial distribution is ideal for assessing reliability, it is absent from the hydrometeorological forecast verification literature. Conventional verification tools are based on the Binomial distribution, as an approximation of the Poisson‐Binomial distribution. Beyond the Binomial approximation, these tools make further approximations to develop single valued scores (BS_{R}) and diagrams (Reliability Diagram). This creates two layers of approximations, which have the potential to create errors in reliability assessment. Quantifying the errors resulting from these approximations is a central focus in this article.

The approximation of the Poisson‐Binomial distribution, via the Binomial distribution, was found to be a balance between bin size and forecast variability. As forecast variability increases, the necessary number of bins increases, but this increasing number of bins leads to a loss of information. By breaking up the verification problem into multiple different bins, the sample size in each bin is reduced, leading to a loss of statistical power in rejecting unreliable forecasts. Beyond the underlying Binomial approximation, the BS_{R} was found to further reduce the ability to reject unreliable forecasts. Being based on the binning approach, the BS_{R} has an upper limit of accuracy equal to the Binomial distribution, but imposes a normal approximation of the Binomial distribution, which will further reduce the statistical power at any practical number of bins. In addition, thresholds of acceptability (significance level) for the BS_{R} have no analytical solution, and therefore require sampling to estimate for any number of bins and sample size. Accurate estimation of BS_{R} thresholds are possible in the numerical experiments, but will be difficult for real forecasts. Similarly, the Reliability Diagram is an approximation of the Binomial distribution, except in the case discussed in *Bröcker and Smith* []. These approximations generally reduce the ability to differentiate between reliable and unreliable forecasts.

This article presented experiments that support the hypothesis that the Poisson‐Binomial distribution maximizes the forecaster's ability to reject unreliable forecasts. The exception to this conclusion was a forecast that is unreliable, yet unbiased. Although the single bin Poisson‐Binomial distribution maximizes the ability to reject biased forecasts, a single bin is insufficient when the unreliable distribution is unbiased. Solving this problem requires a multibin approach, motivating the development of a new verification framework. A two‐stage verification framework was proposed, where a single bin analysis is used to maximize the ability to reject biased forecasts, followed by a two‐bin approach to reject unbiased, yet unreliable forecasts. Results in section 8 suggest that the proposed framework is effective in identifying both biased and unbiased unreliable forecasts. Further, an examination of a real probabilistic forecast, the NWS 12 h probability of precipitation forecasts, supported the finding that the two‐stage approach to reliability assessment, via the Poisson‐Binomial distribution, is more powerful in determining reliability than the BS_{R} and the Reliability Diagram. One caveat is that this method could benefit from further testing in more real data experiments, as the singular real case study examined may not be representative of all forecasts. Although more testing is suggested to confirm these findings, the two‐stage approach, via the Poisson‐Binomial distribution, was found to be the most statistically powerful of all verification methodologies examined, and is therefore suggested for use when assessing the reliability of probabilistic event forecasts.

The variance of the Poisson‐Binomial distribution is provided in equation , and the variance of the Binomial distribution is provided in equation (A1).

This study suggested that the Poisson‐Binomial distribution will have more statistical power than the Binomial distribution, except when all forecasted probabilities are equal, and therefore the inequality in equation (A2) must be proven.

Equation (A2) may then be expanded to equation (A3).

By definition,

At this point, the left‐hand side of this equation may be expanded according to equation (A5), as there will be a set of

Equation (A6) can be found by substituting the right‐hand side of equation (A5) into equation (A4) and subtracting

Equation (A6) will only reach equality in the event that all

The authors would like to acknowledge the financial support provided by NOAA‐MAPP grant NA11OAR4310140 and also NOAA‐CSTAR grant NA11NWS4680002. All data were gathered from the National Digital Forecast Database (NDFD), which was accessed through the National Operational Model Archive & Distribution System (NOMADS) (