This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

The Disturbance storm time (Dst) index has been widely used as a proxy for the ring current intensity, and therefore as a measure of geomagnetic activity. It is derived by measurements from four ground magnetometers in the geomagnetic equatorial region. We present a new model for predicting Dst with a lead time between 1 and 6 hr. The model is first developed using a Gated Recurrent Unit (GRU) network that is trained using solar wind parameters. The uncertainty of the Dst model is then estimated by using the Accurate and Reliable Uncertainty Estimate method (Camporeale & Carè, 2021,

Geomagnetic storms pose one of the most severe space weather risks to our space‐borne and ground‐based electronic instruments, such as global navigation satellite systems and radio transmission systems. The Disturbance storm time (Dst) is one of the most used geomagnetic storm indicators. This study presents an innovative multi‐fidelity boosted neural network method to forecast Dst 1‐to‐6 hours ahead. The new method improves the performance of the predictions by estimating their uncertainties.

A new multi‐hour ahead Dst prediction model developed from solar wind observations using Gated Recurrent Unit networks is proposed

The uncertainty of the proposed Dst model is estimated by applying the Accurate and Reliable Uncertainty Estimate method

A multi‐fidelity method is developed to boost the performance of the model

The Disturbance storm time (Dst) is a geomagnetic index related to the perturbation of the geomagnetic field at low latitudes (Burton et al., 1975; Rostoker, 1972). Currently, Dst is defined by using geomagnetic field measurements from four equatorial ground magnetometers: Hermanus, Honolulu, San Juan and Kakioka (Sugiura & Kamei, 1991). Dst has been widely used for monitoring geomagnetic storms, which pose one of the most severe space weather risks to our space‐borne and ground‐based electronic instruments, such as global navigation satellite systems and radio transmission systems (Z. Li et al., 2021; Wan et al., 2021).

A short‐term prediction of Dst is produced operationally at the NOAA Space Weather Prediction Center by means of a physics‐based model (the Space Weather Modeling Framework developed at the University of Michigan, (Tóth et al., 2005)). A longer lead‐time operational Dst forecast is provided by Space Environment Technologies, using the Anemomilos model (Tobiska et al., 2013). Dst is also used as an essential input for forecasting thermosphere mass density and ionospheric parameters, and to parameterize several empirical models. The following is a non‐exhaustive list of models that use Dst as one of their inputs: O'Brien and Moldwin (2003) empirically estimate the location of the plasmapause; Agapitov et al. (2015) derive a statistical model for the lower band chorus distribution; S. Li et al. (2016) estimate the ionospheric global electron content storm‐time response; Boardsen et al. (2000) derive an empirical model of the high‐latitude magnetopause; and, Zhao et al. (2018) derive a model of radiation belt electron pitch‐angle distributions.

A large amount of literature has been devoted to Dst prediction, notably using empirical and machine learning techniques (Camporeale, 2019). Lundstedt et al. (2002) first attempted to implement a multi‐layer perception neural network to forecast Dst 1 hour ahead using interplanetary magnetic field (IMF) data. Several researchers presented models to extend the Dst forecast up to 6 hr in advance (Bala & Reiff, 2012; Lazzús et al., 2017; Saiz et al., 2008). A Gaussian Process model was introduced by Chandorkar et al. (2017) and combined with a long short‐term memory (LSTM) architecture in Gruet et al. (2018) to provide probabilistic predictions up to 6 hours in advance. An ensemble learning algorithm was applied by Xu et al. (2020) for the same purpose. Laperre et al. (2020) evaluated the performance of a LSTM model based on a Dynamical Time Warping metric as the cost function.

In this study, we first train a model using a machine learning (ML) technique called Gated Recurrent Unit (GRU), which is a flavor of a recurrent neural network (RNN, Hu and Zhang (2018)), to forecast Dst during strong storm periods (Dst < −100 nT) 1–6 hr ahead. The corresponding uncertainties associated to the predictions, which we refer to as ΔDst, are then estimated by using the ACCRUE method (Camporeale & Carè, 2021), also using a GRU network. The multi‐fidelity boosting method proposed here works as follows. The accurate estimate of the uncertainty, ΔDst, for a given trained model, can inform us about the input conditions under which the model does not perform well. Hence, we can identify a subset of the original training set that can be used for training a different, independent model. Such a strategy can be iterated a number of times. The final result will be a collection of models, each working very well for a specific subset of input conditions (hence the built‐in multi‐fidelity). The crucial point, though, is that since each Dst model comes with its own estimate of uncertainty, that information can be used as a weighting factor when optimally combining (in a linear fashion) a large number of models.

The paper is divided as follows. Section 2 introduces the data used for this study, the criteria to define storm events, and the corresponding time periods covered. The methodology, including the designed uncertainty quantification‐based machine learning architecture and developed multi‐fidelity boosting method, is also described. Section 3 presents the results of the developed model, and discusses the advantages of the proposed model. Finally, in Section 4, we draw conclusions and make final remarks about future directions.

Previous studies have shown that various solar wind parameters have some predictive power with respect to future Dst values. Gruet et al. (2018) selected the proton density *n*, the solar wind velocity *V*, IMF |*B*|, and IMF *B*_{z}. The same variables are considered in this study. They convey information both about incoming high speed streams and Coronal Mass Ejections that directly influence Dst. In addition, we use the clock and dipole tilt angles, as defined in Weimer (2013). All variables are shown in Table 1. The models will be trained by using this variable set defined between *t* − *delay* − 6 and *t* − *delay*, where *t* is the time stamp of a prediction, and *delay* ranges between 1 and 6 hr.

*Note*. The most recent 6 hr of each variable is used for training.

The historical Dst index is available at 1‐hr cadence from the NASA OMNI database. Figure 1 displays the Dst index in the period 1999–2017. The model is trained and tested on storm events with a Dst peak smaller than −100 nT, shown by magenta crosses in Figure 1. Overall, 66 such storm periods are selected for this study.

Consistently defining a storm time window is a difficult task, since it usually includes a pre‐storm period, a main phase and a recovery phase (Gonzalez et al., 1994). In this study, we define a storm event by looking for the nearest positive Dst values immediately before and after each negative peak, and then extending the time window by a 24‐hr buffer. An example of such a definition of a storm event is shown in Figure 2, where the Dst peak is observed on 20 November 2003. The storm period is defined as ranging between 07 November 2003 and 30 November 2003. With this procedure we make sure that the time intervals are selected in such a way that the negative Dst peaks do not always occur at the same time within the chosen storm‐time window. Hence, the neural network does not memorize. The average period of selected storm events is approximately 15 days. All 66 selected storms are listed in Table 2.

Gated Recurrent Unit (GRU) networks are one of the most widely used Recurrent Neural Networks. Similar to LSTM, GRU was proposed as a solution to short‐term memory and vanishing gradient problems (Mikolov et al., 2010). In most scenarios, the performance of GRU is on par with LSTM, but computationally more efficient because of a less complex structure (Kaiser & Sutskever, 2015). The architecture of GRU is shown in Figure 3. All variables in Table 1 at a certain time *t* are used to form *x*_{t}, where *X* is a time series with a 6‐hr span and a 1‐hr time step, and *Y* is the corresponding Dst with a fixed time delay, that is, 1–6 hr.

The ACCRUE method was proposed by Camporeale and Carè (2021) for assigning uncertainties to single‐point predictions generated by a deterministic model that outputs a continuous variable. The problem of estimating the optimal uncertainty is set up as an optimization problem, where the ACCRUE score is defined as the combination of the continuous ranked probability score and the Reliability Score (RS) according to a certain ratio. In ΔDst modeling, the ACCRUE score is used as the cost function for the same GRU architecture utilized for the corresponding Dst model. The main assumption of the ACCRUE method is that the residuals of the underlying deterministic model (i.e., the difference between the model output and the ground truth) are normally distributed. The model output is taken as the mean of a Gaussian distribution, and its uncertainty is assigned by estimating the corresponding standard deviation. Note that the ACCRUE model is trained in a self‐supervised regression mode, because the target (i.e., the standard deviation) is unknown, and it enters in the minimized loss function as a non‐trivial, yet analytical, function.

If the ACCRUE model is successfully converged, the distribution of the so‐called z‐scores, that is the algebraic errors divided by the corresponding standard deviations, collected over the whole training set, should also be normally distributed. A graphical technique called quantile‐quantile plots (Q‐Q plots) (Wilk & Gnanadesikan, 1968) can then be applied to investigate whether the ACCRUE results indeed return a normal distribution for the z‐scores.

The multi‐fidelity boosting strategy is used to develop a collection of models trained on a subset of the original training set. For a given model, the ACCRUE method is applied in order to estimate the associated uncertainties (trained on the errors evaluated over the whole training set) (Camporeale & Carè, 2021). All training samples are then sorted according to a chosen accuracy criterion. The best 10% samples are discarded in subsequent iterations. Meanwhile, a new subset is defined by using the worst 50% samples as the next training set, that is then used to train an independent model. Such a strategy can be iterated a number of times to form an ensemble of multi‐fidelity models. Finally, by exploiting the knowledge about the uncertainties associated with each model, the final predictions, which are expected to outperform the predictions of each single model, are defined as a weighted combination of each ensemble member. Three criteria are tested in this study. They are readily defined for each sample in the training set, and they are: the absolute error, the standard deviation as estimated from ACCRUE (*σ*), and their z‐score defined as *N* hours ahead, we use the final model previously developed for *N* − 1 hr as our baseline model, from which the iterative methods starts, as shown in Figure 5.

An example of the proposed multi‐fidelity boosting method for a given storm event is shown in Figure 4. Here, “model 0” is a simple persistence model. It is clear that “model 1” (obtained after one iteration of the boosting method) performs better during the pre‐storm and recovery phase, and “model 0” performs better during the main phase of the storm. The combination of these two models can outperform each if their uncertainty can be well estimated. The final predictions are shown as Equation 1, where *pred*_{f} and *pred*_{i} denote final predictions and predictions from the *i*th model, and *σ*_{i} is the uncertainty corresponding to the *i*th model.

The accuracy of the models developed with different criteria are shown in Table 3, where different columns report the RMSE value calculated on different portions of the validation set, conditioned on the observed value of Dst (we have used four intervals: (−*∞*, −100], (−100, −50], (−50, *∞*), and all). 20% of all samples are selected for this validation. Those samples will not be used for training. It is clear that the model developed with “z‐score” outperforms the other two significantly within all Dst ranges. Hence, “z‐score” is selected for the following study. All training samples will be separated into three subsets for training, early stop, and the aforementioned criteria validation. The training set needs to be shuffled before training in order to make sure it is homogeneously distributed.

*Note*. Most accurate method is marked in bold.

In order to precisely assess the accuracy of a model, it is important that the performance metrics are computed on a test set independent from the training set, so‐called unseen data. Hence, we ensure that the machine learning algorithm does actually learn meaningful patterns and does not merely memorize the training data. A “Leave one out” technique is adopted here. That is a K‐fold cross validation taken to its logical extreme, with K equal to N, the number of selected storm cases. That means that the proposed model is trained on all the data except for one storm window and a prediction is made for that left‐out storm. The procedure is repeated N times. Finally, the metrics are computed as averages over the N models. In this study, each of the 66 storm windows reported in Table 2 constitutes a fold. Root‐mean‐square error (RMSE) is used as the main metric to assess the accuracy of the proposed model. The continuous Dst prediction can also be transformed to a binary label upon defining a threshold, that is, −100 nT in this study. In this way we can use standard metrics for binary classification such as the True Skill Statistic and Matthews correlation coefficient (MCC) (Camporeale et al., 2020). The MCC score is a reliable statistical rate that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories—True positive (TP), False positive (FP), True Negative (TN) and False Negative (FN)—proportionally both to the size of positive elements and the size of negative elements in the data set (Baldi et al., 2000). TSS is a useful metric that combines both types of information and should be as close as possible to 1. Those metrics have shown some advantages over the F1 score and accuracy in binary classification evaluation (Chicco & Jurman, 2020).

In this section, the proposed boosting model is first compared against a single GRU model. The 2003 and 2022 Halloween storm events are selected to exhibit the advantages of this method. It should be noted that there are two storm events around 2003 Halloween (October 30 and November 20). The latter one is chosen because it has a larger negative peak (hereinafter we call it “the 2003‐Halloween storm”). Note that this baseline model is trained with a random initialization and without performing a hyper‐parameter optimization.

Figure 6 shows a statistical analysis of the proposed model with 1–6 hr time delay. In each panel, each dot represents the mean RMSE computed over all test samples with Dst smaller than a given Dst threshold (on the horizontal axis). The red line shows the average RMSE of the GRU results, and the gray bar is the corresponding RMSE standard deviation. The red line and the purple bar show the corresponding average RMSE and standard deviation for the developed boosting method, respectively.

It is clear that the boosting method significantly outperforms a single GRU method for all time delays. The corresponding Q‐Q plots are shown in Figure 7. If z‐scores estimated by the ACCRUE method are distributed as a perfect Gaussian, then the Q‐Q lines (in blue) would perfectly overlap the diagonal line (in orange). In Figure 7, the Q‐Q lines agree well with the diagonal lines in the training and validation sets. This correlation is slightly worse in the test set because the distribution of the test set might differ from that of the training and validation sets.

In this section we analyze the model results for two case studies: (a) The 2003‐Halloween storm, the biggest storm in the past 20 years, from 2003‐11‐19 to 2003‐11‐23; and, (b) The 2022‐Halloween storm, a recent storm with Dst peak <−100 nT, between 2022‐11‐06 and 2022‐11‐08. Results for all other storms are reported in the link

There are two reasons for the under‐prediction of the model in Figure 8 during the main phase of the storm. One has to do with the Dst residuals not being exactly normally distributed. Their distribution is skewed and that is not accounted for in the proposed multi‐fidelity boosted method. A second reason has to do with the fact that the storm under consideration is very rare and not appropriately represented in the training set.

It is clear that the GRU model can better predict the time shift of the peak, especially between the main phase and the recovery phase. Meanwhile, the peak Dst is always better predicted by the persistence model. This proves that the multi‐fidelity boosting model can take advantage of both models and further verifies our findings presented in Figure 6.

Figure 9 is similar to Figure 8, except it shows the 2022‐Halloween event. We can see that the proposed model, similar to the one for the 2003‐Halloween storm, also better predicts Dst during the main phase of this 2022‐Halloween storm than the baseline model. However, the time shift of the peak is much more significant here. This is because the model emphasizes storm events by putting more weights on larger negative Dst. A website (

We have developed a multi‐fidelity boosting model to predict Dst during geomagnetic storm periods, 1–6 hr ahead. Sixty‐seven selected storm events were chosen during a long‐span historical data set (20 years), between 2000‐01‐01 and 2020‐01‐01. One of the crucial points of this work is that an innovative multi‐fidelity boosting method is developed to enhance a simpler prediction algorithm (here based on GRU networks), especially during those “super” storms as defined in Sec. 2. We have shown that this proposed model provides a good RMSE (8.22 nT) for predicting Dst up to 3 hr ahead during strong storm periods. The prediction becomes worse with a longer lead time (RMSE of 13.54 nT at 6 hr) because of the increasing errors of the baseline model. We have also discussed how this model performs during the 2003 and 2022 Halloween storms, as case studies. The Dst peak can be well captured by the developed model.

This project has been developed in the framework of the National Aeronautics and Space Administration under Grants 80NSSC20K1580 and 80NSSC20K1275.

We thank OMNIWeb for providing the solar wind data (