This is an open access article under the terms of the

The predictability of spring onset is assessed using an index of its interannual variability (the “extended spring index” or SI‐x) and output from the North American Multimodel Ensemble reforecast experiment. The input data to compute SI‐x were treated with a daily joint bias correction approach, and the SI‐x outputs computed from the North American Multimodel Ensemble were postprocessed using an ensemble model output statistic approach—nonhomogeneous Gaussian regression. This ensemble model output statistic approach was used to quantify the effects of training period length and ensemble size on forecast skill. The lead time for predicting the timing of spring onset is found to be from 10 to 60 days, with the higher end of this range located along a narrow band between 35°N to 45°N in the eastern United States. Using continuous rank probability scores and skill score (SS) thresholds, this study demonstrates that ranges of positive predictability of SI‐x fall into two categories: 10–40 and 40–60 days. Using higher skill thresholds (SS equal to 0.1 and 0.2), predictability is confined to a lower range with values around 10–30 days. The postprocessing work using joint bias correction improves the predictive skill for SI‐x relative to the untreated input data set. Using nonhomogeneous Gaussian regression, a positive change in the SS is noted in regions where the skill with joint bias correction shows evidence of improvement. These findings suggest that the start of spring might be predictable on intraseasonal time horizons, which in turn could be useful for farmers, growers, and stakeholders making decisions on these time scales.

Variations in the timing of spring onset affect ecosystems, forest fires, drought, pollen, and agriculture (Ault et al., ; Westerling et al., ). Given its importance to human and ecological health, there is a pressing need to characterize the potential predictability of spring onset on seasonal time horizons. In principle, such forecasts could be issued alongside seasonal predictions of more traditional variables like precipitation and temperature (Kirtman et al., ; Mo & Lettenmaier, ; Saha et al., ). However, the predictability of such seasonal transitions has not yet been widely explored.

Forecasting seasonal transitions can extend the usability of forecasts on seasonal time horizons. Characterizing such transitions requires systematic indices that are consistent through space and time, such as the “extended” spring index (SI‐x) of Schwartz et al. () and Ault et al. (). Development of this particular index relied on previous efforts that established a strong relationship between blooming of plants and the spring onset (Cayan et al., ; Schwartz et al., ; Schwartz & Marotz, ) and also linked the interannual variability of spring onset to large‐scale atmospheric patterns and ocean forcing as noted in sea surface temperature (Ault et al., ).

Here we evaluate the potential predictability of spring onset as characterized by the SI‐x (Ault et al., ; Schwartz et al., ). We focus on SI‐x because it integrates temporal and spatial atmospheric patterns of variability across synoptic to intraseasonal scales. As such, the SI‐x serves as a proxy for spring onset across North America, and predicting the timing of this seasonal transition may be critical for anticipating warm‐season events at long lead times. That is, an early spring would lead to different ecological and agricultural risks in summer than a late spring because an early start to the growing season could favor invasive species or certain plant and human pathogens (Monahan et al., ). Specifically, we are interested in quantifying the lead times on which SI‐x can be predicted. In addition, a state‐of‐the‐art ensemble postprocessing technique—nonhomogeneous Gaussian regression (NGR)—is used to answer how the multimodel ensemble outperforms ensembles from individual models, and also whether longer reforecast training periods improve postprocessing capacity by enhancing prediction skill.

The SI‐x used in this study was originally developed in Schwartz and Marotz () and Schwartz et al. () and then updated for continental‐scale coverage in Schwartz et al. (). Briefly, it is a temperature‐based index that identifies the day of year (DOY) when key early‐spring phenological events are likely to occur. Its only time‐varying inputs are daily minimum and maximum temperatures, meaning that it can be applied over a wide range of temperate climates to yield a consistent metric of the start of spring at each location across space and over many years. Additional details on the assumptions and limitations are documented elsewhere (e.g., Ault et al., ), and the code for computing the SI‐x is widely available through GitHub (*T*_{max}) and minimum (*T*_{min}) temperatures at 1° lat/lon spatial resolution, obtained from

Forecasts of daily maximum and minimum temperature are obtained from the North American Multimodel Ensemble (NMME) Phase 2 data set (Kirtman et al., ), which includes multiple models and multiple ensemble members from individual models over the period 1981 to present (

We quantify the skill of postprocessed NMME model predictions by comparing them to both climatology and uncorrected model output. To perform this evaluation, we apply two objective metrics that measure forecast skill improvement against a reference prediction: the reduction of variance skill score (SSclim) and the continuous ranked probability score (CRPS; Matheson & Winkler, ) skill score (SScrps). Both of these skill scores are variations of the generalized skill score (SS)*A*) over the departure of reference metric (*A*_{ref}) from the perfect forecast (*A*_{perf}; Wilks, ).

The SSclim,*o*_{k}) and forecasted (*y*_{k}) data. The reference metric *A*_{ref} is the MSE of the climatology (MSE_{clim}),

where *A*_{perf}, is zero as it has zero MSE.

The SScrps is defined as*F*(*y*) is the continuous cumulative distribution function (CDF) of the predictand *y*. The term *F*_{o} is the cumulative probability step function defined by

As SI‐x follows an approximately Gaussian distribution with mean *μ* and variance *σ*^{2} (e.g., Ault et al., ), CRPS for a given observation *o* can be calculated using*ϕ*() are the CDF and PDF, respectively, of the standard Gaussian distribution. Equation is used when CRPS is used to evaluate NGR‐based forecasts, where the forecast is defined as Gaussian distribution. Alternatively, we employed an ensemble version of the CRPS, which operates on the full discrete ensemble. The ensemble CRPS (eCRPS) is based on the alternative formulation for equation (Gneiting & Raftery, ):*E*_{F} denotes statistical expectation with respect to the predictive distribution *F* (*x*), and *X* and *X*' are independent realizations from *F* (*x*). Substitution of sample averages from the forecast ensemble for the expectations in equation (Ferro et al., ; Van Schaeybroeck & Vannitsem, ) yields

For our application *x*_{t, j} and *x*_{t, k} are raw ensemble members, *m* is the total number of members, 50, and *y*_{t} is observation.

A joint bias correction (JBC) technique (e.g., Thrasher et al., ) is applied to remove systematic model errors in both *T*_{max} and *T*_{min} temperature while preserving their covariance. This correction is required because SI‐x is sensitive to the covariance of *T*_{max} and *T*_{min}, and bias correcting variables individually can generate physically unrealistic outcomes (Thrasher et al., ). As temperature variations tend to be normally distributed, we define the joint distribution of daily maximum and minimum temperatures to be bivariate Gaussian (Wilks, ), which is motivated by the high correlation between daily *T*_{max} and *T*_{min}. After fitting the parameters of the joint distribution to gridded observations and NMME temperatures, we follow a quantile remapping approach similar to the one described in Li et al. (, their Figure 1). First, we estimate the quantile of a *T*_{min} value in the forecast CDF and then match this value to the same quantile in the (marginal) observational CDF; a bias corrected value for *T*_{min} is therefore obtained by identifying the appropriate observed *T*_{min} value for that quantile. Next, to bias‐correct *T*_{max}, we condition its CDF on *T*_{min} and then associate conditioned quantiles of simulated *T*_{max} values with observational ones. This procedure yields bias‐corrected values of *T*_{min} and *T*_{max} for every grid point for every day of each year and preserves the covariance structure of *T*_{min} and *T*_{max} in the observations.

In addition to biases, ensemble forecasts have dispersion errors from initial‐condition sensitivity and model structural error, among other sources (Wilks, ). However, multimodel ensemble forecasts are amenable to estimating forecast‐uncertainty distributions, which can be used to calibrate these ensembles probabilistically. Here we use the nonhomogeneous Gaussian regression‐ensemble model output statistic (NGR‐EMOS) method (Gneiting et al., ) to post process the NMME direct model output forecast in order to improve SI‐x forecast skill. Under this approach, the forecast‐uncertainty distribution is assumed to be defined by a Gaussian distribution as indicated in equation , which describes the cumulative probability that a future observation *V* will be less than a forecast quantile *q*:

where Φ[ ] indicates the evaluation of the cumulative distribution function, *a*, *b*_{1}, *b*_{2}, *b*_{3}, *b*_{4}, *b*_{5}, *c*, and *d* define the adjusted mean (*n* training‐period samples.

In this study, we fit the NGR‐EMOS using four different training period lengths (15, 20, 25, and 30 years) and five ranges of ensemble numbers (10, 20, 30, 40, and 50), which training data are out of sample (observations to be forecasted are not included in the training data). Drawing for the fitting of these parameters is randomly selected from the maximum of year length and ensemble number without repetition.

Using mean DOY values, the SI‐x leaf index computed from the NMME is comparable to the observational pattern (Figure ). As in previous studies (e.g., Ault et al., ; Cayan et al., ), a prominent north‐south gradient is present in both the observed data and the NMME ensemble mean. NMME mean and standard deviation are computed over the multimodel ensemble without distinguishing individual model‐specific distributions. Greater spatial heterogeneity is observed along the western Intermountain Region, which is not fully reproduced by the NMME mean (bottom panel of Figure ). The standard deviations of the SI‐x in Figure show less agreement between the observed pattern and model simulations. Although the simulations capture maximum interannual variance in the Pacific Northwest and in the southeastern United States, NMME overestimates this variability within a range of 4 days in the Intermountain Region, west of the Rockies (bottom panel of Figure ). Thus, model biases are more apparent in the standard deviation than in the mean.

We first evaluate the skill of NMME models in predicting SI‐x without any statistical correction. The ensemble CRPS, computed using equation , shows high values in the Intermountain Region (for January and February) and in the northeastern region of the domain (for March and April; Figure ). High CRPS values are associated with poor model performance. CRPS values are low (indicating good skill) in low‐elevation terrain. However, the raw ensemble forecasts do not outperform the CRPS climatology reference (CRPS [clim]; Figure ), which maximum value is on the order of 3.5 CRPS units. This CRPS climatology reference shows a coast‐to‐coast band around 35°N, which is consistent with the difference of variance between observations and NMME models (Figure ). This result shows that the raw ensemble forecasts are of limited utility, as they exhibit negative skill with respect to the climatology (Figure S1).

The previous results did not include the NGR‐EMOS approach, so we applied it to the SI‐x output to evaluate the effects of correcting mean and dispersion errors. We illustrate an example of the multimodel ensemble error dispersion using one grid point for different model initialization times (Figure S2). The temporal evolution from an early to late initialization (January to March) reveals the reduction of the error dispersion among the model ensemble realizations. Therefore, the NGR‐EMOS should use this information in its systematic approach to correct the final output. Using the NGR‐EMOS approach, we found that this is indeed the case. Thus, the NGR‐EMOS analysis (Figure ) reveals an improvement with respect to the raw ensemble data (Figure ) in all the different initializations. Although there is still a challenge to correct data in the Intermountain Region, the lower CRPS values show that the NGR‐EMOS postprocessed forecasts are improved relative to the raw direct model output.

Figure shows the SScrps for the entire set of forecasts, starting on 1 January, 1 February, 1 March, and 1 April for the period from 1981 to 2012. The CRPS [clim] used to compute SScrps is shown in Figure , and also an alternative SScrps field (when using the untreated NMME CRPS as reference) is shown in Figure to illustrate the value added of the NGR‐EMOS against the untreated data set. However, a comparison relative to the climatology is a fair metric and hence it is used here. Thus, in January (Figure ), several regions with improvement of at least 10% are observed in the southeast, the northwest Pacific, the northeast, and the southwest. In addition, results improve as the seasons progress, as should be expected because the initialization dates approach the onset day. In February, regions along the southern states improve. This is noted by the region with 30–50% of positive change (Figure ), which describes the improvement by NGR‐EMOS. The major feature in February is the percentage of negative skill, on the order of 20%, in the Intermountain Region, as can be inferred from the analysis of the standard deviation anomaly (Figure ; bottom). In March, the region of improvement expands and migrates north, consistent with what was shown for January. Similar results are observed for the initializations starting in April. A region with positive SScrps change in all the months is located below 40°N—Missouri, Illinois, Ohio, Kentucky, West Virginia, and Virginia—where the major improvement is observed during January, February, and March. This was likely to happen because this region coincides with the maximum variability of the SI‐x standard deviation (Figure ), near 85°W, 35°N. This suggests that NGR‐EMOS is able to add value by enhancing good SI‐x individual forecast members in the NMME.

The SScrps evaluates the forecast skill of SI‐x as a percentage of the reference climatology (Figure ). However, to determine how many days in advance SI‐x computed from NMME forecasts can estimate spring onset, the *time dependence* of SSclim needs to be characterized. Figure shows how this additional metric is constructed for a region in the Great Plains (100°W–90°W, 40°N–50°N). First, the SScrps values for every initialization (1 January through 1 May) are calculated using the climatological reference (Figure ). Second, the SScrps dependence on time is constructed based on the different model initialization dates, allowing us to compute the SI‐x predictability range for a given SScrps level of improvement. Logically, predictive skill increases as the initialization date approaches the target date. In the worst forecast skill scenario, we could expect to have at least the same chance to make a forecast as good as the climatology, which means when SScrps = 0.0. A SScrps value of 0.2 therefore represents a 20% improvement over the reference climatology, and similarly SScrps = 0.10 corresponds to a 10% improvement. In Figure , the solid line is the fitted curve with a second‐order polynomial, SScrps = 0.0363 *x*^{2}–0.1310 *x* + 0.1599, with *x* in months units. Thus, using SScrps = 0.20, we estimate *x* = 3.89 months or 86 days (DOY) in the fitted SScrps time variation (dashed line). In this example, a SI‐x predictability range of about 20 days is obtained according to the fitted SScrps versus time relationship, as climatology for the region is 104 DOY. Values for the SI‐x predictability range of the same order are obtained with the alternative SSclim metric (Table ).

The observed 20‐day forecast skill is in the range of a model such as the Climate Forecast System version 2 reported by Saha et al. (), and it might justify the use of SS equal to 20% as a meaningful benchmark. This 20‐day predictability range is the one that Climate Services would potentially use as information that includes a level of improvement of 20% (SScrps = 0.20) with respect to climatology as forecast. As the SScrps is a specific characteristic of each model's performance, different models have different SScrps values and predictability ranges for the same region, and this information can be used to weight a final product or to eliminate some models in an optimal operational forecast. For example, this weighting is objectively achieved by the *b* parameters in equation , which indicates the contribution of the five models to form the best postprocessed SI‐x forecast, with high values defining the best models and near‐zero values suggesting less useful models (Table ).

The SI‐x predictability for the continental United States is in the range of 10–60 days for the NGR‐EMOS NMME (Figure a). The SScrps threshold used here is 0.0, which is a threshold that is comparable with climatology. We extended the analysis to SScrps = 0.1 (Figure b) and 0.2 (Figure c), reducing both the temporal range and geographic extension of high forecast skill. This SI‐x predictability, in the range of 10–60 days for SScrps = 0.0, can be confirmed by the behavior of individual models (Figure S3). The scale bar groups the forecast skill range into low (10–40 days) and high (40–60 days) to highlight the results in the intraseasonal and seasonal scales, with the goal of identifying them, but without assessing the source of what produces better results in these ranges. The relatively low range of 10–40 days is characteristic of the northern Great Plains and part of the Intermountain Region, which was suggested from the analysis of the mean and variance shown previously. The high range of 40–60 days is shown north of 45°N and marginally in the Intermountain Region. These bands reflect the region of minimum variability in observed standard deviation (Figure ). The SI‐x predictability range shows a north‐south gradient as a typical characteristic seen in the SI‐x climatology that reflects the seasonal March from winter to summer.

The multimodel ensemble NMME NGR‐EMOS (Figure ) agrees well with the individual model ensembles (Figure S3), which portrays differences in the SI‐x forecast skill when applying the TT‐JBC approach. As expected, the spatial pattern of predictability differs among models. Although the Goddard Earth Observing System Model, Version 5 model shows the lowest range of predictability in the Intermountain Region, it shows better improvement after applying the TT‐JBC, which is also true for CanCM3 and CanCM4 near 45°N. The results with the TT‐JBC are consistent with the biased temperature (Figure S3; left panel), and in addition they show that the TT‐JBC adds value in regions that already have considerable forecast skill. The improvement occurs mainly in both the Canadian (CanCM3 and CanCM4) and the National Oceanic and Atmospheric Administration (Goddard Earth Observing System Model, Version 5) models. As abrupt warming events in the SI‐x calculation are modeled with daily maximum and minimum temperature (Schwartz & Marotz, ), the JBC applied on both temperatures might influence on the corrected final calculation of SI‐x. Therefore, the bias correction applied over the individual models improves the forecast skill; however, it does not outperform NGR‐EMOS (Figure ).

Using a multimodel ensemble NGR‐EMOS (Figure ), the results for the five models can be summarized in two major points. First, there is signal in the range of intraseasonal variability (10–60 days) in the NMME models when compared to climatology (SScrps = 0.0), meaning the multimodel ensemble outperforms 2 months before the beginning of the spring. These changes are localized in two regions: the “corn belt” along 40°N (Nebraska, Iowa, Minnesota, and Illinois) and the Intermountain Region. Second, when using higher thresholds (SScrps = {0.1,0.2}), this range is reduced by 10 days (with some exceptions in small localized regions), with a lower reduction in the Great Plains. Thus, a large range is still found in the vicinity of the Corn Belt region that looks promising for potential agriculture‐related applications.

In addition, for different training periods and number of ensemble members (Figure ), the CRPS shows two important aspects to consider when applying the NGR‐EMOS in SI‐x related products: (1) a long training period significantly increases the predictability score (e.g., from 15 to 30 years; top panel Figure ) and (2) a large number of ensemble members marginally improves the RPS SS (e.g., from 10 to 20 members; bottom panel Figure ). Although the forecast skill was significantly improved when the skill was low, it is not improved much when the skill was already high. For example, the initialization in January (1 month) shows a smooth transition from 1.7 with 10 members to 1.5 when using 20 ensemble members. When the skill is good (e.g., initialization in March at 3 month), increasing of the number of ensembles does not add much value to the forecast skill.

A spatial description of the SS, after using the NGR‐EMOS, reveals a significant improvement in the Corn Belt region (Figure ). It portrays the positive effect of NGR‐EMOS for the four initializations (January–April) using the SScrps SS. When we compare the difference between the model ensemble NMME mean and NGR‐EMOS, the percentage of improvement is of the order of 50 percentage points (from 10% to 80% SS) and the extension of this improvement expands significantly relative to the untreated results. For example, in February and March, the Corn Belt region sees an important improvement, which is verified with the similar results obtained by two other EMOS methods: logistic regression and Gaussian ensemble dressing (results not shown). Therefore, EMOS adds significant value to the SI‐x forecast products at all initialization stages.

This study assesses the seasonal predictability of spring onset using an index previously calibrated with plant phenology and variability of temperature (SI‐x; Ault et al., , and references therein). A set of NMME models was treated with a daily JBC approach and an ensemble model output statistics approach. Our findings show that untreated input data are of limited use, as it exhibits negative skill relative to climatology. Also, the selected training period length and ensemble size affect the SI‐x forecast skill. Long training periods and a large number of ensemble members improve the SI‐x predictability SS. Because SI‐x integrates temporal variations in the atmosphere at a continental scale, it helps us identify regions where maximum skill occurs over North America. This study provides insight into how reliable climate‐based information helps to evaluate lead time on which spring onset can be forecasted skillfully.

The results presented here show that the best predictability for the spring onset is in the range from 10 to 60 days located along a narrow band between 35°N and 45°N. Using a forecast threshold of SScrps = 0.0, the range of predictability falls into two categories: 10–40 and 40–60 days. Using higher thresholds (SScrps = 0.1 and 0.2), predictability shows a lower range with values around 10–30 days (Figure ). The 40–60 day time horizon is notable, as it extends well beyond the 10‐day barrier inherent to most meteorological forecasts. It is, however, broadly consistent with Koster et al. (), which found some skill in air temperature predictions on similar time scales, though the motivation and metrics of that study were different from ours. The region with better skill is in the core of the continent along 40°N, where the major variability of the SI‐x is observed. This region is relevant because of its vicinity to the Corn Belt states that has great impacts to the local and global economy. Also, it is where early and late spring variability is significant (Shubert et al., ). Becker et al. () also show that NMME has good results in the central United States, which further supports our interpretation, although the regions with better skill found in this study are narrow and localized.

Future work could include assessment of the atmospheric processes linked to early versus late spring onset. The dominant driver is potentially the Pacific Jetstream transition from winter into spring because of its impact in western North America. Indices have been constructed that characterize the position, structure, and strength of the Pacific Jetstream (Newman & Sardeshmukh, ) as it migrates north, splits, and weakens each spring. Therefore, the timing of this breakdown can be characterized in the intraseasonal range, which typically occurs between mid‐March and mid‐April. The range of predictability found in this study potentially supports the existence of driving mechanisms at this scale that might be orchestrating these ranges of predictability skill.

Finally, our findings suggest that there is potential spring onset forecast skill in NMME products, but a sophisticated postprocessing is necessary to achieve that potential. We delineate how the predictability skill of NMME models to forecast spring onset in North America is improved with two postprocessing techniques—the JBC and nonhomogeneous Gaussian regression EMOS. The JBC outperforms the biased temperature SI‐x product, and the improvement mainly occurs in both the Canadian and National Oceanic and Atmospheric Administration models; however, it does not outperform the multimodel ensemble NGR‐EMOS. Using NGR‐EMOS, a significant positive change in the SS is noted in regions where the skill of the raw NMME ensemble data is low. The consensus of both techniques shows that regions of better predictability can be expanded (e.g., the Corn Belt region). Therefore, adding these corrections would be important for any future operational use.

This material is based on work supported partially by USDA grant 1010630 (project NYC‐124439); additional funding was provided by NSF grant 1702697. The authors thank the North American Multimodel Ensemble (NMME) project for providing the data set. NMME project is supported by NOAA, NFS, NASA, and DOE. NMME data were obtained from