We investigate the ability of hydrological multimodel ensemble predictions to enhance the skill of streamflow forecasts at short‐ to medium‐range timescales. To generate the multimodel ensembles, we implement a new statistical postprocessor, namely, quantile regression‐Bayesian model averaging (QR‐BMA). Quantile regression‐Bayesian model averaging uses quantile regression to bias correct the ensemble streamflow forecasts from the individual models and Bayesian model averaging to optimally combine their probability density functions. Additionally, we use an information‐theoretic measure, namely, conditional mutual information, to quantify the skill enhancements from the multimodel forecasts. We generate ensemble streamflow forecasts at lead times from 1 to 7 days using three hydrological models: (i) Antecedent Precipitation Index‐Continuous, (ii) Hydrology Laboratory‐Research Distributed Hydrologic Model, and (iii) Weather Research and Forecasting Hydrological modeling system. As forcing to the hydrological models, we use weather ensemble forecasts from the National Centers for Environmental Prediction 11‐member Global Ensemble Forecast System Reforecast version 2. The forecasting experiments are performed for four nested basins of the North Branch Susquehanna River, USA. We find that after bias correcting the streamflow forecasts from each model, their skill performance becomes comparable. We find that the multimodel ensemble forecasts have higher skill than the best single‐model forecasts. Furthermore, the skill enhancements obtained by the multimodel ensemble forecasts are found to be dominated by model diversity, rather than by increased ensemble size alone. This result, obtained using conditional mutual information, indicates that each hydrological model contributes additional information to enhance forecast skill. Overall, our results highlight benefits of hydrological multimodel forecasting for improving streamflow predictions.

Multimodel forecasting is a well‐established technique in atmospheric science (Bosart, ; Gyakum, ; Krishnamurti, ; Sanders, ; Weisheimer et al., ), which consists of using the outputs from several models to make and improve predictions about future events (Fritsch et al., ). The motivation for multimodel forecasting is that for a complex system, such as the atmosphere or a river basin, comprised by multiple processes interacting nonlinearly and with limited observability, predictions solely based on the outputs from a single model will be prone to errors and biases (Fritsch et al., ). Indeed, early experiments comparing blended forecasts from different weather models against single‐model predictions demonstrated the ability of multimodel predictions to improve the skill and reduce the errors of weather forecasts (Bosart, ; Gyakum, ; Sanders, ; Thompson, ; Winkler et al., ). This was found to be the case for both forecasts issued by humans (Sanders, , ) and from numerical models (Bosart, ; Fraedrich & Leslie, ; Fraedrich & Smith, ; Fritsch et al., ; Gyakum, ; Krishnamurti et al., , ; Sanders, ).

Initial meteorological multimodel experiments accounted for model‐related uncertainties but not for uncertainties in the initial states. To account for the latter, multimodel ensembles were introduced, where multiple ensemble members from individual models are generated for the same lead time and geographic area by perturbing the models' initial states (Hamill & Colucci, ; Stensrud et al., ; Toth & Kalnay, ). An illustrative example of a recent, successful multimodel framework is the North American Multimodel Ensemble experiment for subseasonal to seasonal timescales (Bastola et al., ; Becker et al., ; Kirtman et al., ). Indeed, most of the established operational systems across the globe for short‐ to medium‐range weather forecasting are multimodel, multiphysics ensemble systems (Buizza et al., ; Du et al., ; Hamill et al., ; Palmer et al., ). In contrast, hydrological multimodel ensemble prediction systems (HMEPS) have not been widely implemented and remain an underexplored area of research. To our knowledge, there is currently no operational HMEPS in the world, despite their success in weather (Hagedorn et al., ; Hamill et al., ) and climate forecasting (Bastola et al., ; Becker et al., ; Kirtman et al., ).

HMEPS can be classified into the following three general categories, depending on whether multiple weather and/or hydrological models are used: (i) a single hydrological model forced by outputs from multiple numerical weather prediction (NWP) models (Thirel et al., , ), (ii) multiple hydrological models forced by outputs from a single NWP model (Randrianasolo et al., ), and (iii) multiple hydrological models forced by outputs from multiple NWP models (Velázquez et al., ). As is the case in meteorology, hydrological multimodel outputs can be deterministic or probabilistic, depending on how many and the manner in which ensembles are generated from each model (Davolio et al., ). It is important to note that although hydrological multimodel approaches have been investigated before (Ajami et al., ; Duan et al., ; Vrugt & Robinson, ), the vast majority of those studies have been performed in simulation mode (i.e., by forcing the hydrological models with observed weather variables), as opposed to forecasting mode. Simulation studies may provide useful information about near‐real‐time hydrological forecasting conditions. However, at medium‐range timescales (≥ 3 days), where weather uncertainties tend to be as important or more dominant than hydrological uncertainties, hydrological simulations provide considerably less information about forecast behavior (Sharma et al., ; Siddique & Mejia, ).

One of the earliest attempt at hydrological multimodel prediction is that of Shamseldin and O'Connor (). They combined streamflow simulations from different rainfall‐runoff models by assigning different weights to the models based on their performance during historical runs. Since then, several simulation studies have been performed to address the potential of hydrological multimodel approaches to improve understanding and prediction of hydrological variables (Ajami et al., ; Bohn et al., ; Duan et al., ; Georgakakos et al., ; Regonda et al., ; Vrugt & Robinson, ). In hydrological forecasting, recent implementations of the multimodel approach have been focused on seasonal or longer timescales (Nohara et al., ; Yuan & Wood, ), while very few studies are available at short‐ to medium‐range timescales (Hopson & Webster, ; Velázquez et al., ). Furthermore, a shortcoming of the latter studies has been the use of similar hydrological models to generate the multimodel forecasts. For example, Hopson and Webster () as well as Velázquez et al. () used similar spatially lumped or semidistributed hydrological models for their respective multimodel experiments.

To maximize the benefits from a multimodel approach, it is critical to use dissimilar models (Thompson, ), a property that is referred to as model diversity (DelSole et al., ). In hydrological science, different model types are available that could be used to fulfill model diversity, for example, spatially lumped, spatially distributed, process‐based, or land‐surface models (Reed et al., ; Smith et al., ). These different types of models tend to differ markedly in their spatial discretization, physical parameterizations, and numerical schemes (Kollet et al., ), potentially making them good candidates for multimodel forecasting. Another important concern with the multimodel approach is that of distinguishing whether any gains in skill from the multimodel are due to model diversity itself or are related to increases in the ensemble size. Recently, an information‐theoretic measure, namely, conditional mutual information (*CMI*), was proposed to address this issue in climate forecasts (DelSole et al., ). *CMI* is implemented here for the first time with hydrological multimodel forecasts.

Any multimodel forecast requires some type of statistical technique (with simple averaging being the simplest approach; DelSole, ; DelSole et al., ) or postprocessor (Duan et al., ; Fraley et al., ; Gneiting et al., ; Raftery et al., ) to optimally combine the ensemble forecasts from the individual models. Multimodel postprocessing is typically employed to accomplish several objectives: (i) reduce systematic biases in the outputs from each model, (ii) assign each model a weight that measures its contribution to the final multimodel forecast, and (iii) quantify the overall forecast uncertainty. Although a number of multimodel postprocessors have been developed and implemented for dealing with hydrological simulations (Duan et al., ; Hsu et al., ; Madadgar & Moradkhani, ; Najafi et al., ; Shamseldin et al., ; Steinschneider et al., ; Vrugt & Robinson, ; Xiong et al., ), few have been applied in a forecasting context (Hopson & Webster, ). In this study, we implement a new quantile regression‐Bayesian model averaging (QR‐BMA) postprocessor. The postprocessor uses QR to bias correct the streamflow forecasts from the individual models (Sharma et al., ) and BMA to optimally combine their probability density functions (pdfs; Duan et al., ; Vrugt & Robinson, ). QR‐BMA takes advantage of the proven effectiveness and simplicity of QR to remove systematic biases (Gomez et al., ; Sharma et al., ) and of BMA to produce optimal weights (Duan et al., ; Liang et al., ).

Our primary goal with this study is to understand the ability of hydrological multimodel ensemble predictions to improve the skill of streamflow forecasts at short‐ to medium‐range timescales. With this goal, we seek to answer the following two main questions: Are multimodel ensemble streamflow forecasts more skillful than single‐model forecasts? Are any skill improvements from the multimodel ensemble streamflow forecasts dominated by model diversity or the addition of new ensemble members (i.e., increasing ensemble size)? Answering the latter is relevant to operational forecasting because generating many ensemble members in real time is often not feasible or realistic and may not be as effective if skill enhancements are dominated by model diversity. The paper is structured as follows. Section describes our methodology. Section describes the experimental setup. The main results and their implications are presented in section . Lastly, section summarizes our conclusions.

The proposed postprocessor uses QR to bias correct the ensemble forecasts from individual models and BMA to combine the bias‐corrected forecasts. We begin by briefly revisiting the BMA technique. BMA generates an overall forecast pdf by taking a weighted average of the conditional pdfs associated with the individual model forecasts. Letting Δ be the forecasted variable, *D* the training data, and *M* = [*M*_{1}, *M*_{2}, .…, *M*_{k}] the independent predictions from a total of *K* hydrological models, the pdf of the BMA probabilistic prediction of Δ can be expressed by the law of total probability as*P*(Δ| *M*_{k}) is the posterior distribution of Δ given the model prediction *M*_{k} and *P*(*M*_{k}| *D*) is the posterior probability of model *M*_{k} being the best one given the training data *D*. *P*(*M*_{k}| *D*) reflects the performance of model *M*_{k} in predicting the forecast variable during the training period.

The posterior model probabilities are nonnegative and add up to one (Raftery et al., ), such that

Thus, *P*(*M*_{k}| *D*) can be viewed as the model weight, *w*_{k}, reflecting an individual model's relative contribution to predictive skill over the training period. The BMA pdf is therefore a weighted average of the conditional pdfs associated with each of the individual model forecasts, weighted by their posterior model probabilities. Since model predictions are time variant, letting *t* be the forecast lead time, equation can be written as

The efficient application of BMA requires bias correcting the ensemble forecasts from the individual models and optimizing their weights

To implement QR, the bias‐corrected ensemble forecasts from each model *k* and forecast lead time *t*, *k* at time *t* and *τ* defined as

In equation , *k* and quantile interval *τ* at time *t*. The parameters associated with each model are determined separately by minimizing the sum of the residuals from a training data set as follows:*j*th paired samples from a total of *J* samples, *t*, *τ*th quantile at time *t* defined as*τ* ∈ [0, 1]. The resulting minimization problem in equation is solved using linear programming via the interior point method (Koenker, ). Note that the *τ* values were chosen to cover the domain [0, 1] sufficiently well, so that the lead time‐specific error estimate in equation is a continuous distribution. Specifically, the number of *τ* values were based on the number of ensemble members required by a particular forecasting experiment and were chosen to vary uniformly between 0.06 and 0.96.

After bias correcting the single‐model forecasts using equations –, the posterior distribution of each model is assumed Gaussian. Thus, before implementing equation , both the observations and bias‐corrected forecasts are transformed into standard normal deviates using the normal quantile transformation (NQT; Krzysztofowicz, ). The NQT matches the empirical cumulative distribution function (cdf) of the marginal distribution to the standard normal distribution such that*cdf*(.) is the cdf of the bias‐corrected forecasts from model *k* at time *t*, *G* is the standard normal distribution and *G*^{−1} its inverse; and *k* at time *t*. When applying the NQT, extrapolation is used to model the tails of the forecast distribution for those cases where a sampled data point in normal space falls outside the range of the training data maxima or minima. For the upper tail, a hyperbolic distribution (Journel & Huijbregts, ) is used while linear extrapolation is used for the lower tail.

Lastly, to determine the BMA probabilistic prediction in equation , the weight *k* at the forecast lead time *t* are estimated using the log likelihood function. Note that *k*. Setting the parameter vector *θ* at the forecast lead time *t* is approximated as*g*(.) denotes a Gaussian pdf and *θ* are determined using the expectation maximization (EM) optimization algorithm (Bilmes, ). The steps required to implement the EM algorithm are provided in Appendix A. Finally, discrete ensembles are sampled from the postprocessed predictive distribution using the equidistant quantiles sampling approach (Schefzik et al., ).

Our proposed QR‐BMA approach consists of implementing equations –. To apply QR‐BMA, we used a leave‐one‐out approach where part of the forecast data set was used to train QR‐BMA and the rest to verify the multimodel ensemble forecasts. We applied QR‐BMA at each forecast lead time *t* of interest for selected forecast locations. As part of our forecast experiments, we generated both single‐model and multimodel ensemble forecasts. The single‐model streamflow forecasts were generated from GEFSRv2, while the multimodel forecasts were generated using the QR‐BMA technique to optimally combine the single‐model forecasts. The single‐model forecasts were postprocessed using QR, following the same leave‐one‐out approach used with QR‐BMA. Note that QR‐BMA was applied here independently at each lead time; thus, it is suitable for generating forecasts when predictions are needed for a single time.

*CMI* is used as a measure of skill improvement following the approach by DelSole et al. (). The approach allows to distinguish whether multimodel skill improvements are dominated by model diversity (i.e., additional information provided by the different models) or increased ensemble size (i.e., the addition of new ensemble members). To present the *CMI* measure, we first introduce three related information‐theoretic measures: entropy, conditional entropy, and mutual information (*MI*).

In the case of a continuous random variable (e.g., the streamflow forecasts *F* with pdf *P*(*f*), where uppercase is used to denote the random variable and lowercase its realizations), the amount of average information required to describe *F* is given by the entropy Η(*F*) defined as

Entropy measures the uncertainty of *F* (Cover & Thomas, ). The entropy of a random variable conditional upon the knowledge of another can be defined by the conditional entropy. The conditional entropy between the streamflow observations *O* and forecasts *F* can be calculated using the chain rule:

With equations and , the *MI* between the streamflow observations and the forecasts, *MI*(*O*; *F*), is given by (Cover & Thomas, )*P*(*o*, *f*) is the joint pdf of *O* and *F*, with marginal pdfs *P*(*o*) and *P*(*f*), respectively. *MI* is an elegant and powerful measure to quantify the amount of information that one random variable contains about another random variable. It is nonnegative and equal to zero if and only if *O* and *F* are independent from each other. *MI* has several important benefits. It is a domain‐independent measure such that the information provided is relatively insensitive to the size of data sets and outliers, unaffected by systematic errors, and invariant to any nonlinear transformations of the variables (Cover & Thomas, ; Kinney & Atwal, ).

In the case of multimodel combinations, where *F*_{1} represents the single‐model ensemble mean and *F*_{2} represents the multimodel mean of the remaining models, the *CMI* between *O* and *F*_{2}, conditioning out *F*_{1}, is given by*MI*(*O*; (*F*_{1}, *F*_{2})) measures the degree of dependence between the observation and the joint variability of the forecasts *F*_{1} and *F*_{2}. According to equation , *CMI* quantifies the additional decrease in uncertainty due to adding a single‐model forecast to the multimodel forecast mean of the other models. When the distributions are Gaussian, the *CMI* reduces to a simple function of partial correlation as follows (Sedghi & Jonckheere, ):*ρ*_{O2 ∣ 1} denotes the partial correlation between *O* and *F*_{2} conditioned on *F*_{1}. The partial correlation is related to the pairwise correlations by (Abdi, )*ρ*_{O1} and *ρ*_{O2} are the correlation skills of *F*_{1} and *F*_{2}, respectively, and *ρ*_{12} is the correlation between *F*_{1} and *F*_{2}. Hereafter, the subscript 1 denotes single‐model forecasts, and the subscript 2 denotes either single‐model forecasts or multimodel forecasts, depending on whether one is assessing the skill of single‐model or multimodel forecasts.

To further understand any skill enhancements provided by a multimodel forecast, the streamflow forecasts and observations can be partitioned into a conditional mean, called the signal variable *α*, and a deviation about the conditional mean, called the noise variable *β*. As shown by DelSole et al. (), in the case that all the ensemble members are drawn from the same model and the forecasts are computed with means of ensemble size *E*_{1} and *E*_{2}, the partial correlation in equation becomes*SNR* is defined as the ratio of signal variance to noise variance and *ρ*_{αO} is the correlation between the signal variable and streamflow observation. The partial correlation in equation is nonzero when a predictable signal exists (i.e., *SNR* ≠ 0), forecast skill exists (*ρ*_{αO} ≠ 0), and the ensemble sizes are finite. To the extent that forecast skill exceeds predictability skill,

Equation implies that an upper bound on *ρ*_{αO} results in an upper bound on the partial correlation in equation . Thus, an upper bound on the skill improvement due to adding new ensemble members from the same model can be estimated by combining equations and and taking the limit *SNR* → ∞,

Thus, any skill enhancement measured by equation that exceeds the upper bound of equation is dominated by the addition of new predictable signals (DelSole et al., ).

We computed *CMI* using equations and , together with the streamflow ensemble forecasts and observations. We used equation to obtain an upper bound for the skill improvement due to increased ensemble size. Any improvements beyond this upper bound, we attributed to the addition of new signals or model diversity. When using equations , , and , the subscript 1 refers to the single‐model forecasts *F*_{1} that one is trying to improve, and the subscript 2 the multimodel forecasts *F*_{2} or, in the case of a single‐model experiment, the addition of new members from the same model. *CMI* was computed for each individual model and multimodel combination at every lead time of interest for selected forecast locations. Before computing *CMI*, both the streamflow observations and forecasts were transformed into Gaussian space using NQT.

To implement *CMI*, three different experiments were performed: (i) 9‐member single model, (ii) 9‐member multimodel, and (iii) 33‐member multimodel. The 9‐member single‐model experiment consists of a 3‐member single‐model forecast (*F*_{1}) combined with a 6‐member ensemble from the same model (*F*_{2}). Note that this 6‐member ensemble may be treated as proxy for adding members from hydrological models with very similar structures. This experiment was repeated for each of the models used. In the 9‐member multimodel experiment, a 3‐member single‐model ensemble from one of the models (*F*_{1}) was combined with a 6‐member multimodel ensemble obtained using the remaining two other models (*F*_{2}). This 6‐member multimodel ensemble was generated as follows: (i) Three raw members from each of the remaining two models were randomly selected, and (ii) the selected members were combined using the QR‐BMA postprocessor to generate a 6‐member multimodel ensemble. Note that the number of ensemble members from each model are equal only in relation to the number of raw forecast members sampled from each model. Additionally, in both the 9‐member single‐model and 9‐member multimodel experiments, the values of *E*_{1} and *E*_{2} in equation are 3 and 6, respectively. The last experiment, 33‐member multimodel, was the same as the 9‐member multimodel experiment but using instead 33 members. That is, an 11‐member single‐model ensemble from one of the models (*F*_{1}) was combined with a 22‐member multimodel ensemble obtained by postprocessing the remaining two other models (*F*_{2}). For the *CMI* experiments, raw single‐model forecasts were used for *F*_{1} to emulate basic operational conditions. The *CMI* values for the different experiments were computed by first randomly selecting raw ensemble members from each hydrological model. This process of randomly selecting raw forecasts from each model was repeated several times for each *CMI* value, so that the reported *CMI* value is the average from multiple realizations.

Additionally, we estimated *CMI* in streamflow space using the approach discussed by Meyer (). The approach relies on the Miller‐Madow asymptotic bias‐corrected empirical estimator for entropy estimation (Meyer, ; Miller, ) and an equal frequency binning algorithm for data discretization (Meyer, ). This approach does not require transforming streamflow into Gaussian space but has the drawback that an exact upper bound, akin to equation , is not available. The *CMI* in streamflow space was computed using the same experimental conditions described before for *CMI* in Gaussian space.

Besides using *CMI* to measure skill improvements, we used the mean Continuous Ranked Probability Skill Score (*CRPSS*; Hersbach, ) since this is a commonly used verification metric to assess the quality of ensemble forecasts (Brown et al., ). The *CRPSS* is derived from the Continuous Ranked Probability Skill Score (*CRPS*). The *CRPS* evaluates the overall accuracy of a probabilistic forecast by estimating the quadratic distance between the forecasts' cdf and the corresponding observations. The *CRPS* is defined as

To evaluate the skill of the forecasting system relative to a reference system, the associated skill score or *CRPSS* is computed as*CRPS* is averaged across *n* pairs of forecasts and observations to calculate the mean *CRPS* of the main forecast system, *CRPSS* ranges from [−∞, 1]. Positive *CRPSS* values indicate the main forecasting system has higher skill than the reference forecasting system, with 1 indicating perfect skill. In this study, we used sampled climatology as the reference forecasting system. Similar to our implementation of *CMI*, the *CRPSS* was computed for both single‐model and multimodel ensemble streamflow forecasts at each lead time of interest for selected forecast locations. Confidence intervals for the *CRPSS* were determined using the stationary block bootstrap technique (Politis & Romano, ). Note that the *CRPSS* represents a quantitative measure of the overall forecast skill relative to the reference system (i.e., sampled climatology), whereas the *CMI* represents the skill improvement or enhancement provided by the multimodel forecasts. Thus, the *CMI* and *CRPSS* are not directly comparable against each other. Our proposed multimodel forecasting approach is summarized in Figure .

The North Branch Susquehanna River (NBSR) basin in the U.S. Middle Atlantic Region was selected as the study area (Figure ; Nelson, ). Severe weather and flooding hazards are an important concern in the NBSR, for example, the City of Binghamton, New York, has been affected by multiple damaging flood events over recent years (Gitro et al., ; Jessup & DeGaetano, ). In the NBSR, four different U.S. Geological Survey (USGS) daily gauge stations were selected as the forecast locations (Figure ). The selected locations are the Ostelic River at Cincinnatus (USGS gauge 01510000), Chenango River at Chenango Forks (USGS gauge 01512500), Susquehanna River at Conklin (USGS gauge 01503000), and Susquehanna River at Waverly (USGS gauge 01515000). These forecast locations represent a system of nested subbasins with drainage areas ranging from ~381 to 12,362 km^{2}. A summary of the main characteristics of the selected gauge locations is provided in Table .

NOAA's latest global, medium‐range ensemble reforecast data set, the Global Ensemble Forecast System Reforecast version 2 (GEFSRv2;

Four main observational data sets were used: multisensor precipitation estimates (MPEs), gridded near‐surface air temperature, phase 2 of the North American Land Data Assimilation System (NLDAS‐2; ^{2} were obtained from the MARFC. Similar to the NCEP stage IV MPEs (Moore et al., ; Prat & Nelson, ), the MARFC MPE product combines radar estimated precipitation with in situ gauge measurements to create a continuous time series of hourly, gridded precipitation observations. The gridded near‐surface air temperature data were produced by the MARFC using multiple observation networks, including the meteorological terminal aviation routine weather report (METAR), USGS stations, and National Weather Service Cooperative Observer Program (Siddique & Mejia, ). Additionally, we used NLDAS‐2 data for near‐surface air temperature, specific humidity, surface pressure, downward longwave and shortwave radiation, and u‐v components of wind speed. The spatial resolution of the NLDAS‐2 data is 1/8th‐degree grid spacing while the temporal resolution is hourly. Further details about the NLDAS‐2 data can be found elsewhere (Mitchell et al., ). To calibrate the hydrological models and verify the streamflow simulations and forecasts, daily streamflow observations for the selected gauged locations were obtained from the USGS. In total, 6 years (2004–2009) of hydrometeorological observations were used. Table summarizes the observational data sets.

To generate the multimodel forecasts, we used the following three hydrological models: Antecedent Precipitation Index (API)‐Continuous (Moreda et al., ), NOAA's Hydrology Laboratory‐Research Distributed Hydrologic Model (HL‐RDHM; Koren et al., ), and the Weather Research and Forecasting Hydrological (WRF‐Hydro) modeling system (Gochis et al., ). We selected these three hydrological models because they are relevant to operational forecasting in the United States and represent varying levels of model structural complexity as well as different spatial resolutions and parameterizations. The selected models collectively represent a sufficiently diverse set of models favorable for multimodel forecasting. The description of each model and the details about the configuration, calibration, and performance of the models in simulation mode are provided in Text S1 in the supporting information. The parameters selected for calibration, and the parameters' feasible ranges and calibrated values for the HL‐RDHM and WRF‐Hydro models are summarized in Table S1.

The models were used to simulate and forecast flows over the entire period of analysis (years 2004–2009) at the selected gauge locations (Figure ) but were verified for the warm season only (May–October). We focused on the warm season because flood events are more prevalent in our study area during these months. The simulated flows were obtained by forcing the hydrological models with meteorological observations. The streamflow simulations were verified against daily observed flows for the entire period of analysis, warm season only (years 2004–2009, May–October). The HL‐RDHM simulations were performed for the period 2004–2009, with the year 2003 used as warm‐up. To calibrate HL‐RDHM, we first manually adjusted the a priori parameter fields through a multiplying factor; once the manual changes did not yield noticeable improvements in model performance, the multiplying factors were tuned up using the stepwise line search algorithm (Kuzmin, ; Kuzmin et al., ). Out of all the HL‐RDHM adjusted parameters, the most sensitive parameters were found to be the upper and lower soil zones transport and storage parameters, as well as the stream routing parameters. The WRF‐Hydro simulations were performed for the period 2004–2009, with the first year used as warm‐up. To calibrate WRF‐Hydro, we implemented a stepwise manual adjustment approach (Yucel et al., ); that is, once a parameter value was calibrated, its value was kept fixed during the calibration of subsequent parameters. Out of all the adjusted parameters, the most sensitive parameters were the soil, groundwater, and runoff parameters. After manually calibrating the WRF‐Hydro parameters, the most sensitive parameter values were fine tuned using an optimization algorithm, namely, dynamically dimension search (Tolson & Shoemaker, ). The API‐Continuous model was previously calibrated by the MARFC for operational forecasting purposes using a manual approach.

Figure summarizes the models' performance in simulation mode using the Pearson's correlation coefficient, *R*; Nash‐Sutcliffe efficiency, *NSE*; and percent bias, *PB*, between the simulated and observed streamflows at daily resolution for the entire analysis period. The overall performance of the models was satisfactory (Figures a and b). API and HL‐RDHM exhibited comparable performance while WRF‐Hydro tended to underperform relative to API and HL‐RDHM. The performance of the models is discussed further in section .

To perform our forecast experiments, we generated and verified the following three different data sets of ensemble streamflow forecasts: (i) raw single model, (ii) postprocessed single model, and (iii) multimodel. The raw single‐model data set consisted of ensemble streamflow forecasts from each hydrological model without postprocessing. The postprocessed single‐model data set was generated by using QR to postprocess the raw ensemble streamflow forecasts from each hydrological model. Lastly, the multimodel data set was generated by optimally combining the ensemble forecasts from the different hydrological models using QR‐BMA. As part of the multimodel data set, we also generated an equal‐weight multimodel forecast by using the same weight, 1/*K*, to combine the models rather than the optimal weights from QR‐BMA. Additionally, for both the single‐model and multimodel forecast data sets, we varied the number of ensemble members used (9 to 33 members) to perform different experiments.

All the forecast data sets were verified across lead times of 1 to 7 days using 6 years of data (2004–2009) for the warm season only (May–October). To postprocess and verify both the single‐model and multimodel ensemble streamflow forecasts, a leave‐one‐out approach was implemented by using 4 years of forecast data (training period) to train the postprocessor and the remaining 2 years to verify the forecasts. This was repeated until all the 6 years of forecast data were postprocessed and verified independently of the training period. The subdaily streamflow forecasts generated by the hydrological models were averaged over 24 hr to get the mean daily flow. Six‐hourly streamflow forecasts were generated from API and HL‐RDHM, and 3‐hourly forecasts from WRF‐Hydro. The mean daily ensemble streamflow forecasts were verified against mean daily streamflow observations for the selected gauged locations.

In terms of the *CRPSS* (relative to sampled climatology), the raw single‐model ensemble streamflow forecasts remain skillful across lead times (1–7 days) and basins (Figures a–d), with the exception of WRF‐Hydro that has slightly negative *CRPSS* values at the longer lead times (6–7 days). In Figures a–d, the *CRPSS* values tend overall to decline with increasing lead time, as might be expected since the weather uncertainties tend to grow and become more dominant of forecast skill as the lead time progresses (Siddique & Mejia, ). There is also a slight tendency for the *CRPSS* values to exhibit spatial scale dependency. The *CRPSS* values for each model tend to increase from the smallest (Figure a) to the largest (Figure d) basin across lead times. This tendency is, however, rather weak throughout all of our forecasts, and it is somewhat more apparent for the API and HL‐RHDM forecasts than for the WRF‐Hydro (Figures a–d).

Across all lead times and basins (Figures a–d), the *CRPSS* values vary approximately from −0.15 (WRF‐Hydro at the day 7 lead time; Figure d) to 0.6 (API at the day 1 lead time; Figure d). Contrasting the hydrological models, the performance of API and HL‐RDHM is comparable, with the exception of CNON6 (Figure b) where API outperforms HL‐RDHM. This is due to HL‐RDHM having an unusually high percent simulation bias of −14.3 for CNON6 relative to API whose simulation bias is −5.8. The performance of the models in forecasting mode tends to mimic their performance in simulation mode (Figure ). That is, API tends to perform better than HL‐RDHM, and, in turn, both of these models tend to outperform WRF‐Hydro. Deviations from this tendency, however, do emerge. For example, WRF‐Hydro has similar forecasting skill as HL‐RDHM at the day 1 lead time in CINN6 (Figure a), even though in this basin HL‐RDHM performs better than WRF‐Hydro in simulation mode. Similarly, API performs slightly better than HL‐RDHM in forecasting mode at the later lead times (>4 days) in CINN6 (Figure a), but HL‐RDHM shows better performance in simulation mode. Thus, the results obtained here in simulation mode do not always translate to similar performance in forecasting mode. This is not surprising given the nonlinear relationship between hydrological processes and weather forcings. It reinforces the need to verify hydrological models in both simulation and forecasting mode to gain a more complete understanding of model behavior.

The underperformance of WRF‐Hydro, in both simulation and forecasting mode, in comparison to API and HL‐RDHM may be due to several factors. One factor is likely to be the additional model complexity of WRF‐Hydro. That is, WRF‐Hydro requires more forcing inputs and parameters to be specified than the other two models. For example, in terms of forcings, HL‐RDHM requires only precipitation and near‐surface air temperature to be specified, whereas WRF‐Hydro requires seven different forcings. It is possible that any biases in the NLDAS‐2 or GEFSRv2 forcings used here to configure the WRF‐Hydro simulations and forecasts, respectively, could be affecting its performance. However, we evaluated (results not shown) for the WRF‐Hydro streamflow forecasts the effect of each individual forcing on the *CRPSS* values and found that precipitation was the most dominant forcing. At least in forecasting mode, the additional forcings used by WRF‐Hydro do not seem to have a strong influence on its forecast skill. The relatively low performance of the WRF‐Hydro could also be due to restrictions in its ability to represent physical processes because of a priori constraints in model parameter values, which neglect the large uncertainty in parameter estimates and large impact that parameters have on model predictions.

The determination of model parameter values for the WRF‐Hydro is another factor that is likely affecting its performance. Although we calibrated selected WRF‐Hydro parameter values (see Table S1), both manually and numerically, there is generally less community knowledge about and experience with WRF‐Hydro than API and HL‐RDHM. The latter two have been around for much longer (e.g., Anderson et al., ; Koren et al., ; Moreda et al., ; Reed et al., ) than WRF‐Hydro. In the future, a more in‐depth sensitivity analysis of the WRF‐Hydro model parameters could be beneficial. Nonetheless, the performance of WRF‐Hydro in this study is comparable to those previously reported in the literature (Givati et al., ; Kerandi et al., ; Naabil et al., ; Salas et al., ; Silver et al., ; Yucel et al., ).

We used QR to postprocess the raw single‐model ensemble streamflow forecasts. Using the *CRPSS* (relative to sampled climatology) to assess the forecast skill (Figures e–h), we found that the postprocessed single‐model ensemble streamflow forecasts show, overall, skill improvements relative to the raw forecasts. The relative improvements are more noticeable for the WRF‐Hydro. For example, at WVYN6 (Figure d), the raw WRF‐Hydro forecasts have a *CRPSS* value of ~0.27 at the day 1 lead time, and that value increases to ~0.6 after postprocessing (Figure h). However, since the hydrological models are calibrated with data sets used for cross‐validating the postprocessor, the absolute *CRPSS* for the postprocessed forecasts are not representative of real‐time conditions.

Interestingly, the *CRPSS* values for the postprocessed single‐model forecasts reveal that after postprocessing, the models have comparable skill across lead times and basins (Figures e–h), perhaps with the exception of CNON6 (Figure f) where API tends to outperform the other models. This indicates that the streamflow forecasts are influenced by systematic biases and, in this case, those biases are stronger in WRF‐Hydro than in the other models. Such streamflow forecast biases result from the combined effect of biases in the weather forcings and hydrological models. In regards to the former, precipitation forecasts from the GEFSRv2 are characterized by an underforecasting bias in our study region (Sharma et al., ; Siddique et al., ), particularly at the longer lead times. This underforecasting bias affects all of our hydrological model forecasts, so it is unlikely to be the cause of the strong biases seen in the WRF‐Hydro forecasts.

Hydrological model biases appear to have a strong effect on the performance of WRF‐Hydro, given the relatively mild skill gains from postprocessing for the API and HL‐RHDM models and the larger gains for WRF‐Hydro (Figures e–h). Nonetheless, the QR postprocessor is able in this case to handle those biases. This suggests that models with simple structure (e.g., API, which is spatially lumped and has fewer parameters) may benefit less from postprocessing while models with complex structure (e.g., WRF‐Hydro, which is spatially distributed and has more parameters) may be good candidates for postprocessing. It is also possible that systematic biases in the WRF‐Hydro could be reduced through improved parameter sensitivity analysis and calibration, as opposed to statistical postprocessing.

Another interesting outcome from the postprocessed single‐model results is that the ranking of the models, in terms of the *CRPSS*, varies depending on the lead time and basin. For example, both HL‐RDHM and WRF‐Hydro tend to slightly outperform API at the day 1 lead time in Figure e, but API outperforms both models at the later lead times (>6 days) in Figures f–h. This is important because it indicates that there is no single model that consistently outperforms the other models. In other words, it is not possible, at least in terms of the *CRPSS*, to choose one model as the best in all cases. This suggests that it may be possible to maximize forecast skill across lead times and basins by optimally combining the outputs from the different models, as opposed to relying on a single model. It shows that multimodel forecasting may be a viable option to enhance streamflow predictions.

We now examine with the *CRPSS* the ability of multimodel forecasts to improve streamflow predictions. For this, the *CRPSS* is again plotted against the forecast lead time for the selected basins (Figure ). In Figure , the following three different multimodel forecasting experiments are shown: (i) equal weight, (ii) 9 members, and (iii) 33 members. For the equal‐weight experiment, the same weight, 1/*K*, was used to combine the predictive distribution of the streamflow forecasts from each hydrological model. That is, instead of using the optimal weights from QR‐BMA, the same weight was used to form a 9‐member multimodel forecast. For the 9‐member and 33‐member experiments, we used 3 and 11 raw members per model, respectively, to obtain a multimodel forecast with QR‐BMA; QR‐BMA was used to optimize the weights. Additionally, the reference system used to compute the *CRPSS* values in Figure consists of the postprocessed ensemble streamflow forecasts from API, as opposed to sampled climatology. We selected API as the reference system since this is currently the regional operational model being used to generate streamflow forecasts in our study area.

We found that the 33‐member multimodel forecasts result in higher *CRPSS* values than API across lead times and basins (Figure ). The 9‐member multimodel forecasts perform similarly to the 33‐member forecasts, but in a few cases (e.g., Figure c at the day 5 lead time) the 9‐member forecasts result in lower (negative) *CRPSS* values than API. The equal‐weight experiment is only able to improve the *CRPSS* values at the initial lead times (<3 or 4 days; Figure ), while at the later lead times its *CRPSS* values are lower than API. CNON6 offers an interesting case to further compare the single‐model and multimodel forecasts. In the single‐model forecasts for CNON6 (Figure f), API tends to clearly outperform the other models. Despite the better performance of API alone, the multimodel forecasts are still able to improve the skill for CNON6 relative to the performance of API, with the largest improvement being ~0.16 at the day 7 lead time for the 33‐member experiment.

The BMA weights associated with the multimodel forecasts (see Table S2) tend to reflect the performance of the postprocessed forecasts for the individual models in Figure . For example, the API at CNON6 consistently gets a higher weight than the other models, particularly at the longer lead times, while WRF‐Hydro at CNON6, CKLN6, and WVYN6 has relatively low BMA weights at the later lead times. Additionally, the weights show that even when the performance of one of the models is dominant, the remaining models may still contribute to improving the multimodel forecasts. This is the case for CNON6 at the later lead times (e.g., days 6 and 7 in Table S2), where despite the higher weights for API, the HL‐RDHM and WRF‐Hydro are still assigned some weight.

In sum, the multimodel forecasts reveal skill improvements relative to API, which may be considered here the best performing model in terms of the overall simulation and raw forecasts results; the optimal weights from QR‐BMA result in more skillful multimodel forecasts than using equal weights, particularly at the later lead times (>3 days); and increasing the ensemble size of the multimodel forecasts results in relatively mild skill gains. We also computed reliability diagrams, as determined by Brown et al. (), for the single‐model and 9‐member multimodel forecasts (see Figures S2 and S3). The reliability diagrams show that the multimodel forecasts tend, for the most part, to display better reliability than the single‐model forecasts.

Several studies have investigated the source of improvements (skill gains) from multimodel forecasts (Hagedorn et al., ; Weigel et al., , ). Those studies have found that multimodel forecasts can improve predictions by error cancelation and correcting deficiencies (underdispersion) in the ensemble spread of the single models. These sources of skill gain appear to be mainly statistical. This way of understanding the benefits of multimodel forecasts does not consider whether a particular model contributes additional information to the forecasts. Considering the latter is important to be able to justify adding any new models to an existing forecasting system. Another way to assess the source of improvements from multimodel forecasts that accounts for the contribution of model information, signal as opposed to noise, is through *CMI*, which we do next.

We used *CMI* to determine whether the skill improvements from the multimodel forecasts are dominated by model diversity or increased ensemble size alone. To this end, *CMI* was computed using equations and , together with the ensemble mean forecast, at lead times of 1–7 days for the selected basins (Figure ). In Figure , the following three different experiments are shown: (i) 9‐member single model (Figures a–c), (ii) 9‐member multimodel (Figures d–f), and (iii) 33‐member multimodel (Figures g–i). The experiments are described in subsection .

For the first experiment, we used equations and to obtain a theoretical upper bound for *CMI*. This theoretical bound represents the potential skill gain from the ensemble size alone. We found that the theoretical bound is in this case equal to 0.090. Figures a–c show that indeed the empirical *CMI* values for the 9‐mmeber single‐model forecasts tend to be less than or around 0.090 for all three models across lead times and basins. The 9‐member single‐model *CMI* values tend to be greater for API than HL‐RDHM and WRF‐Hydro. This indicates that the less complex model, API, is able to maximize the skill gains from the ensemble size alone. For example, in terms of the *CRPSS*, the raw single‐model forecasts from API and HL‐RDHM have comparable skill in the case of CKLN6 (Figure c) and WVYN6 (Figure d). In contrast, the 9‐member single‐model *CMI* values tend to be greater for API than HL‐RDHM in both cases, CKLN6 and WVYN6 (Figures a and b), particularly at the longer lead times. This ability of API to maximize the benefits from ensemble size alone may be due to API being more sensitive than the other models to the weather forcing. Also, in Figures a–c, the tendency is for the *CMI* values to increase some with the lead time for all the basins. This is more apparent for API and HL‐RDHM than WRF‐Hydro.

Contrasting the *CMI* values between the 9‐member single‐model (Figures a–c) and 9‐member multimodel (Figures d–f) experiment, it is apparent that the multimodel forecasts have substantially greater *CMI* values than the single‐model forecasts across lead times and basins. This indicates that any of the single‐model forecasts (API, HL‐RDHM, or WRF‐Hydro) can be improved by combining them with forecasts from the other models. Indeed, this improvement is dominated by model diversity rather than increased ensemble size alone. Although the multimodel forecasts show skill gains at all the lead times, the tendency is for the *CMI* values to increase with the lead time, suggesting that the multimodel forecasts may be particularly useful for improving medium‐range streamflow forecasts.

To further examine the hypothesis that improvements in *CMI* are dominated by model diversity rather than the ensemble size alone, the *CMI* values from the 9‐member multimodel experiment (Figures d–f) can be compared against the values from the 33‐member multimodel experiment (Figures g–i). From this comparison, it is seen that the *CMI* values for these two experiments are, overall, very similar across lead times and basins. This further supports that incorporating additional information by adding new models plays an important role in enhancing the skill of the multimodel forecasts. The results in Figure indicate that hydrological multimodel forecasting can be a viable approach to improve streamflow forecasts at short‐ and medium‐range timescales. They suggest that model diversity is a relevant consideration when trying to enhance the skill of streamflow forecasts. Although this is the case here for forecast skill, one would like in the future to examine whether these results apply to other attributes of forecast quality. In particular, metrics that are more responsive to the ensemble size than the adopted *CMI* formalism, which was based on the ensemble mean, could be tried.

We also tested the effect on the *CMI* values of using postprocessed single‐model forecasts, as opposed to raw forecasts. Thus, we calculated *CMI* (results not shown) for each basin and lead time using the QR postprocessed single‐model forecasts, that is, the experiments in Figure were repeated using the postprocessed single‐model forecasts. We found that as was the case with the raw forecasts, the *CMI* values for the multimodel combinations exceeded the theoretical upper bound of 0.090 and the *CMI* values remained very similar after increasing the ensemble size, that is, between the 9‐member and 33‐member multimodel experiments. Thus, the ability of model diversity to enhance the skill of the streamflow forecasts is independent of whether raw or postprocessed single‐model forecasts are used.

Additionally, the *CMI* values for all the different experiments in Figure were recomputed (results not shown) in streamflow space using the approach by Meyer (). Although a theoretical upper bound is not available for this approach, the *CMI* values in streamflow space for the multimodel forecasts tended to be noticeably greater than the values for the single‐model forecasts for most lead times. Moreover, differences in the *CMI* values between the 9‐member and 33‐member multimodel forecasts were only marginal. Thus, the results for the experiments in Figure using *CMI* values computed in both real (streamflow) and Gaussian space, overall, exhibited similar trends. This is again indicative of the ability of model diversity to enhance forecast skill beyond the improvements achievable by ensemble size alone.

In this study, we generated single‐model ensemble streamflow forecasts at short‐ to medium‐range lead times (1–7 days) from three different hydrological models: API, HL‐RDHM, and WRF‐Hydro. These models were selected because they represent different types of hydrological models with varying structures and parameterizations. API is a spatially lumped model; HL‐RDHM is a conceptual, spatially distributed hydrological model; and WRF‐Hydro is a land surface model. By forcing each hydrological model with GEFSRv2 data, single‐model ensemble streamflow forecasts were generated for four nested basins of the US NBSR basin over the period 2004–2009, and the warm season (May–October). The single‐model forecasts were used to generate multimodel forecasts using a new statistical postprocessor, namely, QR‐BMA. QR‐BMA uses first QR to correct systematic biases in the single‐model forecasts and, in a subsequent step, BMA to optimally combine the predictive distribution from each model. To further understand the performance and behavior of the multimodel forecasts, we performed different ensemble streamflow forecast experiments by varying the number of ensemble members, models, and weights used to create the multimodel forecasts.

From the forecast experiments performed, we found that the raw single‐model ensemble streamflow forecasts from both API and HL‐RHDM tended to outperform, in terms of the CRPSS, the forecasts from WRF‐Hydro across lead times and basins. However, after postprocessing the raw single‐model forecasts using QR, we found that the CRPSS performance of the individual models was mostly comparable across lead times and basins. In terms of the multimodel ensemble streamflow forecasts, we found that the implementation of QR‐BMA tended to improve the skill of the forecasts relative to the performance of API, which can be considered here the best performing model in terms of the raw single‐model forecasts. Additionally, we compared the forecasts from QR‐BMA against an equal‐weight experiment, where each model was assigned the same weight. We found from this experiment that the optimal‐weight forecasts from QR‐BMA outperform the equal‐weight forecasts. The latter was particularly evident at the later lead times (> 3 days).

Lastly, we used *CMI* to distinguish the source of the improvements for the multimodel forecasts. Although the adopted *CMI* formalism does not capture all aspects of ensemble forecasts, it allows a robust analysis to decide whether the skill enhancement from multimodel forecasts is dominated by model diversity or is only due to the reduction of noise associated with the ensemble size. We found that skill enhancements across lead times and basins are largely dominated by model diversity and that increasing the ensemble size has only a small influence on the *CMI* values. This is important because it indicates that in an operational setting the combination of different hydrological models, as opposed to only increasing the ensemble size of a single model, may be an effective approach to improve forecast skill. It also highlights that there is no single model that can be considered best in all forecasting cases, instead the benefits or strengths of different models can be combined to produce the best forecast. Importantly, the benefits from using different models are, in this case, not only due to the noise reduction associated with the ensemble size but with the ability of each model to contribute additional information to the forecasts.

We describe here the steps followed to implement the EM algorithm. The description uses the variables and notation previously defined in subsection . To implement the EM algorithm, the latent variable *k*th model ensemble is the best prediction at time step *i* and a value of 0 otherwise. The EM algorithm starts with an initial weight and variance for each model set to*T* is the length of the training period extending over the time steps *i* ∈ [1, *T*]. After initializing the weight and variance for each model, the EM algorithm alternates iteratively between an expectation and maximization step until a convergence criteria is satisfied. In the expectation step, the

In the subsequent maximization step, the values of the weight and variance are updated using the current estimate of

The log likelihood function in equation is then recomputed using the updated weight and variance as

The expectation and maximization steps are iterated until the improvement in the log likelihood is no less than some predefined tolerance, that is, (| (*l*(*θ*_{Iter}) − *l*(*θ*_{Iter − 1}))| ) < *tol*, in this case *tol* = 10^{−6}.

We are thankful to the Editor, Martyn Clark, Associate Editor, Jonathan J. Gourley, and three anonymous reviewers for their comments and suggestions, which helped to improve the overall quality of the manuscript. We acknowledge the funding support provided by the NOAA/NWS through Award NA14NWS4680012 and the computational support provided by the Institute for CyberScience at The Pennsylvania State University. Daily streamflow observation data for the selected forecast stations can be obtained from the USGS (