This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

In recent years, ensemble modeling has been widely employed in space weather to estimate uncertainties in forecasts. We here focus on the ensemble modeling of Coronal Mass Ejections (CME) arrival times and arrival velocities using a drag‐based model, which is well‐suited for this purpose due to its simplicity and low computational cost. Although ensemble techniques have previously been applied to the drag‐based model, it is still not clear how to best determine distributions for its input parameters, namely the drag parameter and the solar wind speed. The aim of this work is to evaluate statistical distributions for these model parameters starting from a list of past CME‐ICME events. We employ LASCO coronagraph observations to measure initial CME position and speed, and in situ data to associate them with an arrival date and arrival speed. For each event we ran a statistical procedure to invert the model equations, producing parameters distributions as output. Our results indicate that the distributions employed in previous works were appropriately selected, even though they were based on restricted samples and heuristic considerations. On the other hand, possible refinements to the current method are also identified, such as the dependence of the drag parameter distribution on the CME being accelerated or decelerated by the solar wind, which deserve further investigation.

Coronal Mass Ejections (CME), consisting of huge expulsions of plasma and magnetic field from the solar corona, are important for space weather. Among several forecasting techniques, the drag‐based model, which describes CME propagation in interplanetary space, is widely used to compute CME transit time and impact speed, by describing the CME propagation as that of a solid body moving in an external fluid. In recent years, this model has been improved via a new approach in which statistical distributions of the input quantities are introduced to evaluate uncertainties of the resulting forecasts. Unfortunately, such distributions for the model parameters are still not very well known from experimental observations and it is hard to obtain them from theoretical models. In this work, we built an empirical method to evaluate such statistical distributions using a list of past CME‐ICME events. New findings emerged from this analysis, such as a dependence of the drag parameter on the interplanetary coronal mass ejections being accelerated or decelerated, deserve further investigation.

A new CME‐ICME database is created from observations from multiple sources

Statistical distributions for the model parameters are determined using the database and the drag‐based model equations, taking uncertainties into account

The probabilistic drag‐based model is updated with the new distributions and validated with a different Coronal Mass Ejections database

Coronal Mass Ejections (CMEs‐ P. Chen, 2011; J. Chen, 2017; Webb & Howard, 2012) are expulsions of plasma and magnetic field from the Sun. Their interplanetary counterparts, ICMEs (Kilpua et al., 2017) are among the main drivers of the space weather (SWx‐Pulkkinen, 2007; Schwenn, 2006) with impact on the whole heliosphere, and they are responsible for the strongest variations in the near‐Earth solar wind conditions (e.g., Buzulukova, 2017; Schwenn et al., 2005; Tsurutani et al., 1988). These variations trigger a number of effects on space‐borne and ground‐based technologies, either directly or via major geomagnetic storms.

The accurate prediction of the arrival and characteristics of ICMEs at Earth, and more recently elsewhere in the heliosphere, is a necessity to minimize the impact on the existing and future assets, and has always been a primary goal of the SWx forecasting (see e.g., Berrilli et al., 2017; Daglis, 2001; Iwai et al., 2019; Schrijver & Siscoe, 2010; Veettil et al., 2019). Note that in this work the terms CME and ICME refer to the plasma and magnetic field structure expelled from the Sun, without the shock that precedes it.

Forecasting the Time of Arrival (ToA) and Speed of Arrival (SoA) of an ICME more than an hour ahead is nevertheless a complicated task, since it requires to understand the propagation of a poorly determined plasma and magnetic field structure into an essentially undetermined interplanetary environment. A recent review (Vourlidas et al., 2019) assessed the current state of ICME ToA and SoA forecasting algorithms by surveying the recent literature. While there is quite a scatter in the results and perhaps a bias correlated to the sample size used, the authors report that the most recent forecasting methods have a mean absolute error (MAE) close to 10 hr, which is similar in the case of empirical, simplified physics or full MHD models (as ENLIL or EUHFORIA Odstrcil, 2003; Pomoell & Poedts, 2018), or machine learning based (see Camporeale, 2019, for a list of such approaches). Vourlidas et al. (2019) conclude that a number of factors (physical, observational and modeling) are limiting the performances of all approaches, and that the ToA accuracy in particular is limited by the quality of the currently available data. These complications arise from the difficulty to evaluate remotely the properties of the CME at launch with the present‐day instrumentation and from the impossibility to properly characterize the status of the inner heliosphere. Therefore, in order for the forecast to be useful, it should cope with this lack of information and provide an estimate of its intrinsic uncertainty (Owens et al., 2020). This can be achieved by empirical methods through the use of statistical relationships established between past CME measured parameters and ICME characteristics (Gopalswamy et al., 2001; Kilpua et al., 2012), or can be achieved by numerical MHD‐based models by using ensembles of runs to model the same ICME with different initial conditions that represent the inherent uncertainties (Cash et al., 2015; Emmons et al., 2013; Mays et al., 2015).

The main shortcoming of the statistical approach is that it “treats all events the same and neglects the contextual information and knowledge that certain situations are inherently more predictable than others” (Owens et al., 2020). The other approach (i.e., treat every CME as a different case) has to cope with a relatively large parameter space to explore, and the relatively long time needed for the computation of each simulation run (of the order of tens of minutes on high‐performance systems).

A possible solution is to adopt simplified, kinematic models, for instance, assuming a simplified solar wind propagation, a simple CME geometry, and a hydrodynamic‐like ICME‐solar wind interaction. Among these models, the drag‐based model (DBM‐Cargill, 2004; Vršnak et al., 2013), is among the most used and it can be run in large ensembles during a few seconds in an average laptop (Amerstorfer et al., 2018; Dumbović et al., 2018; Kay & Gopalswamy, 2018; Napoletano et al., 2018). The DBM requires CME properties as input: the CME launch time, initial speed, direction and angular width.

To model the interaction of the ICME with the background solar wind the DBM needs only the solar wind speed and the value of the drag parameter *γ*, which determines the interaction between the solar wind and the CME. The quantities used to describe the CME are retrieved from observations that have associated measurement errors. They can be used in ensemble models assuming Gaussian probability distribution functions (PDFs). The solar wind speed and the drag parameter are instead drawn from *a‐priori* PDFs, modeled from empirical PDFs built from past data sets of CME and associated ICME characteristics measured at liftoff and at Lagrange point L1, respectively. The outputs of the DBM ensemble model are the PDFs of ToA and SoA at a target location. From these, we can estimate the most probable ToA and SoA, and their associated prediction uncertainties (e.g., Del Moro et al., 2019; Piersanti et al., 2020).

In recent years there has been several interesting development of DBM based tools to forecast ICME arrival time and speed. When compared against other forecasting methods as in (Vourlidas et al., 2019), their performance is comparable and sometimes even better than MHD based methods (Vršnak et al., 2014). Augmenting the results presented in Table 1 of (Vourlidas et al., 2019) with Table 2 of (Dumbovic et al., 2021), we can estimate that the typical MAE of the DBM methods is of the order of 10 hr and the typical error on the SoA is around 50 km/s. It deserves to be stressed that the performance of each model has been computed on a different ICME data set, thus limiting the value of this comparison. A community effort to create a common benchmark is underway (e.g., Verbeke et al., 2019), but it is not yet a widespread standard.

Recently, Kay et al. (2020) used the DBM ensemble approach to answer an implicit question in Vourlidas et al. (2019): “How much should we improve our knowledge of the parameters to improve the ToA predictions beyond the present limit MAE of about 10 hr”? Their conclusion is that the most critical parameter is the CME speed at liftoff, and efforts should be spent to achieve a better and more homogeneous determination of the CME speed via coronagraphics imaging or other means. Kay et al. (2020) also explored the sensitivity of the ToA versus the *γ* parameter, following Dumbović et al. (2018), who postulated a symmetric distribution of values (*γ* = 0.1 ± 0.05 × 10^{−7} km^{−1}). However, Napoletano et al. (2018) (hereafter Paper I) built an empirical *γ* PDF which is asymmetrical and not compatible with a Gaussian shape. Since the *γ* parameter incorporates much of the physics of the ICME‐wind interaction and its precise value is poorly understood (Kay et al., 2020), we think it deserves further investigation. In particular, the empirical *γ* PDF in Paper I has been built from a limited database of ICME, therefore it is the quantity that most of all needs a more robust definition. A better assessment of the values of *γ* may be obtained, in principle, with detailed knowledge of the ICME kinematics. In order to evaluate the DBM input parameters, (Žic et al., 2015) proposed to employ a least squares fitting procedure to the ICME interplanetary tracking derived from STEREO coronagraphic and heliospheric image observations. In addition, (Rollett et al., 2016) also developed such an approach and employed an improved model for the geometrical shape of the CME front, showing that the extrapolation of the CME dynamics based on real‐time tracking can further reduce the mean error of the predicted arrival time to 6.4 ± 5.3 hr and impact speed to 16 ± 53 km/s. Although promising, such an approach is currently not feasible for real‐time forecasting, as it requires a dedicated heliospheric observatory for real‐time ICME tracking. Therefore model parameters have to be constrained by two single observations (at liftoff and Earth), and a better assessment of these parameters depends on the availability of more data and an improved database. The aim of this paper is to use a large number of ICMEs to build a new empirical PDF for the model parameters and to find a suitable functional form to model it.

The paper is structured as follows. In Section 2 we describe the data and the methods employed to build the ICME database used to obtain the new *γ* PDF. Section 3 contains the technical description of the methods used to retrieve the *γ* PDF. Section 4 contains the actual results and a validation of those results against the ICME list from Paouris and Mavromichalaki (2017a). In Section 5 we discuss the results and provide a synopsis of the findings of this paper. The ICME catalog built for this analysis, together with a tool for the data set visualization and the software employed for the ensemble simulation through the probabilistic drag based model, is available from

As mentioned above, the DBM needs input values to output a ToA and a SoA at a target location. For the purpose of this work, we compare observed ToA and SoA at Earth (or L1) position against those computed by the probabilistic drag‐based model (PDBM) presented in Paper I. We therefore need a database associating Earth ToA and SoA of an ICME to the kinematic characteristics of the corresponding CME. In particular, we need measures of the position *r*_{0} and the speed *v*_{0} at time *t*_{0} of the CME front, and the solar wind speed (see A3 in Appendix A).

Also, to obtain a homogeneous database for a consistent approach, we re‐computed some of the CME kinematics properties (CME leading edge initial initial speed and acceleration) with a standardized method. Last, to have an assessment of the method as close as possible to the actual operation performances, we made use of the same methods and algorithms presently implemented in the real‐time ICME ToA forecast running at

Our analysis uses a database connecting the kinematic parameters of the CME at launch time and the information about the arrival time and speed of the related ICME. The databases already available are not suited for our analysis, but we can merge and complement the information from three different sources available online to create a new database for this purpose.

We start from the list of near‐Earth interplanetary coronal mass ejections (ICMEs) compiled by Richardson and Cane (2010), hereafter R&C, 2010, who maintain the catalog using data from the OMNI database (Goddard Space Flight Center, GSFC,

We make use also of information about the CME liftoff retrieved from the SOHO LASCO CME catalog (

We associate each of the remaining 247 ICME in the R&C list with an entry in the CDAW SOHO/LASCO CME catalog having the same onset times, *R*_{Sun}), computed from LASCO C2/C3 coronagraph images. These values represent the position of the CME front on the plane‐of‐sky (hereafter POS), and need to be de‐projected to obtain the true radial distance and speed of the CME. The equation to de‐project the position *r*_{0} is from (Gopalswamy et al., 2010). It implies a model for the CME shape (i.e., the CME front expansion is considered completely radial, as in Model A of Figure 9 in Schwenn et al., 2005) and requires the CME angular width and the location of the CME source on the solar disk.

The details about the algorithms to associate a CME to its source on the solar disk and to compute its de‐projected speed at *r*_{0} = 20*R*_{Sun} and the associated error are described in Appendix A.

An interesting new geometrical technique for de‐projecting the CME speed has been very recently published (Paouris, Vourlidas, et al., 2021). We foresee the possibility to utilize such a method in a future work. After these procedures our data set was reduced to 214 CME‐ICME pairs. In Figure 1 we report the histogram of the yearly number of CME comprising this data set, since it spans about two solar cycles (23 and 24). As it is known (e.g., Lamy et al., 2019; Webb & Howard, 2012), the number of CMEs depends on the solar cycle phase and the total number of CMEs per cycle is clearly different for the cycle 23 and 24. In Figure 2 we show a summary of some quantities reported in Table 1, and in particular we focus on those variables that will be used for the inversion procedure (see Section 3.2).

We employ the drag‐based model (DBM; Cargill, 2004; Vršnak et al., 2013) to forecast arrival time and impact speed of a CME at Earth. This model assumes that, from a certain distance from the Sun, the CME dynamics is governed only by its interaction with the ambient solar wind. By employing a fluid dynamic analogy, it is assumed that the force depends on the square of the relative velocity of the CME to the ambient solar wind flow, so that the equation for the CME radial acceleration reads:*γ*(*r*) is the so‐called drag parameter, representing the interaction efficiency between the CME and the solar wind, *w*(*r*) is the solar wind speed, and *r* is the distance from the Sun. A reasonable approximation beyond 20 solar radii is that of constant *γ* and *w* throughout the whole ICME propagation (Cargill, 2004; Vršnak et al., 2013). We point out that this is a relevant assumption, as in reality the ambient solar wind speed and the mechanisms of interaction between the solar wind and the CME structure are not constant. We refer the reader for example, to (Temmer et al., 2012), (Rollett et al., 2014), (Žic et al., 2015) for studies investigating CMEs evolving in different drag regimes and variable solar wind speed, or (Piersanti et al., 2020) for an example of the probabilistic approach to the drag‐based model applied with a variable solar wind. Under such assumptions, Equation 1 can be solved analytically (Vršnak et al., 2013) for the heliospheric distance and the ICME speed as a functions of time:_{0} − *w*: the + sign is taken for accelerated CMEs (v_{0} − *w* < 0), while the ‐ sign holds for decelerated ones (v_{0} − *w* > 0). Given the initial conditions *r*_{0}, v_{0} and model parameters *γ* and *w*, these equations can be employed to compute the ToA and the SoA at a target located at a chosen heliocentric distance.

The DBM Equations 2 and 3 can be used to forecast the travel time *T* and the impact speed v_{1} of an ICME at a given position *r*_{1}. Conversely, if *T* and *v*_{1} are known, these equations can be inverted leaving *γ* and *w* as unknown values, as in (Vršnak et al., 2013):*w*, but can be solved numerically and its solution is then employed to compute *γ* through the second one. Therefore, for a given *T*, *v*_{0}, *v*_{1}, we have *γ* = *f*(*w*). This dependence can be seen in the joint PDFs presented below, where for each CME there is a ridge of joint *w*, *γ* values.

In principle, using the entries of the database presented in Section 2, we could compute the *γ* and *w* values for every ICME and add them to the database. In practice, due to the intrinsic errors of the *r*_{0}, *r*_{1}, *v*_{0}, *v*_{1} and *T* parameters, or due to the fact that sometimes the DBM model simply does not accurately represent the ICME motion (e.g., in the case when *w* = constant is not a viable approximation) Equations 4 and 5 have no solution for 75 out of the 214 events of our database. We therefore used an inversion method that takes into consideration the experimental uncertainty of the input quantities during the inversion procedure. The two parameters *r*_{0} and *r*_{1} have no associated errors, since we set them at 20 and 215 solar radii, respectively. For *v*_{0}, *v*_{1} and *T*, we modeled the PDFs with normal distributions centered at the relative measured or estimated values and with standard deviations corresponding to their uncertainties. For *v*_{0}, the mean is defined by the CME de‐projected speed, and the standard deviation *σ* by its associated error, rows 13 and 14 in Table 1 respectively. For *v*_{1}, the mean is defined by the ICME measured arrival speed (row 5 in Table 1), and *σ* by an assumed measurement error of 10*%*. For *T*, the mean is defined by the difference between the measured arrival time at Earth and the estimated passage at 20*R*_{Sun} row 6 in Table 1, and *σ* by considering the error on the liftoff time obtained by the de‐projection procedure (row 7 in Table 1).

From the average values and their PDFs, the inversion method generates *N* random samples of [*r*_{0}, *r*_{1}, *v*_{0}, *v*_{1}, *T*] per ICME and feeds those to Equations 4 and 5. We run this inversion procedure with *N* = 5,000 for each of the 214 events in the ICME database. In about 50% of all the generated cases the solutions *γ* and *w* do not exist due to incompatibility of the randomly generated input values, that is, such values do not allow for a solution of the implicit Equation 4.

Also, we take into consideration for the following analysis only those solutions where 10^{−8} km^{−1} < *γ* < 10^{−6} km^{−1}. We choose this range of magnitudes considering Equation 2 of Vršnak et al. (2013) and following the same order of magnitude reasoning therein, which poses a limit on the realistic values of *γ*.

The inversion procedure was successful for 210 out of 214 events, thus providing a statistical distribution for the solar wind *w* and drag parameter *γ*. The whole sample consists of 519,857 inversions. In the upper‐left panel of Figure 3, we show the joint distribution *γ* – *w*. As already stated in Par. 3.2, in the joint distribution we can still identify a few *w*–*γ* ridges generated by single CMEs, but the plane is populated enough to extract the properties of these PDFs and compare them with the PDFs used in Paper I. Also, as a consequence of the random extraction of the initial speed and solar wind, we can draw two more joint PDFs, separating the accelerated Δ*v* = *v*_{0}–*w* < 0 (18,416 extractions) and decelerated Δ*v* > 0 (501,441 extractions) ICMEs. These two joint PDFs *γ* − Δ*v* are shown respectively in the central‐left and lower‐left panels of Figure 3.

From the joint distributions shown in Figure 3, we can extract the marginal distributions for the drag‐parameter, *γ*. We compare this empirical PDF with the reference lognormal function used in Paper I to model the *γ* PDF. The plot in the upper panel of Figure 3 shows that the histogram retrieved from the whole data set has a shape which is not compatible with the reference PDF (in blue). The fit of the histogram with a lognormal function (red dashed) retrieves *μ* = −0.83 and *σ* = 1.26, against the *μ* = −0.70 and *σ* = 1.01 of the reference PDF. It is worth to note that we tried fitting to the PDFs other function types, in particular exponentials, power laws, and truncated power laws, but none of those retrieved a better fit than the lognormal.

As above, we can divide our data set in accelerated and decelerated ICMEs (central left and lower left panels of Figure 3). The two histograms look quite different and are fitted by lognormal functions (red dashed) with significantly different parameters. In particular, from the fit to the histogram from the decelerated ICMEs, we retrieve *μ* = −0.85 and *σ* = 1.25. Since this histogram contains more than 98% of the total samples, its parameters are close to those obtained from the whole data set, but also more compatible with the values of Paper I.

From the fit to the histogram from the accelerated ICMEs, we retrieve *μ* = 0.40 and *σ* = 1.18. The accelerated CMEs are represented by a lognormal distribution with a significantly higher mean value.

From these results, we define new *γ* parameter distributions functions for accelerated and decelerated cases. Similarly, we can extract the partial distribution of the solar wind values obtained by all the inversion procedures black line in Figure 4. We fitted this distribution with the sum of two Gaussian functions (blue and red lines), whose parameters are reported in Table 2. It is straightforward to interpret these two Gaussian functions in terms of slow and fastsolar wind. We therefore redefine the parameters to generate the PDFs for the solar wind employed in Paper I with the new values reported in Table 2.

In order to test the new solar wind and drag parameter distributions and compare the performance of the PDBM related to old distributions (Paper I) our ICMEs list was used to compare the predictions with the observed values. The algorithm maps a large sample of initial conditions to the corresponding transit time and arrival speed through the DBM Equations 2 and 3, allowing, in addition, to asses the forecast uncertainty from the output distributions. For each event, we start by randomly extracting an initial speed *v*_{0} and a value for the solar wind speed *w* from their respective normal distributions. If *v*_{0} < *w* (*v*_{0} > *w*) the event is an accelerating (decelerating) CME, and a value for the drag parameter *γ* is randomly generated from the corresponding log‐normal distribution from Section 4.2. Computed CME transit time *T*_{c} and impact speed at 1AU *v*_{C} are then obtained through the DBM equations with this set of initial conditions and parameters. For each event, a statistical distribution for ToA and SoA results from repeating this procedure a large number of times. The mean values 〈*T*_{c}〉 and 〈*v*_{c}〉 of such distributions, together with their standards deviations are taken as representative of the forecast for each event.

Figure 5 shows the histograms of the difference between transit time and arrival speed comparing the old PDBM and the new one on this set of events. Relevant performance indicators are collected in Table 3. Figure 6 shows the plots of the PDBM prediction versus the observed values of the ICME transit time and impact speed. For 74% of events, the observed ToA falls within the standard deviation of the predicted one, and for the arrival speed this occurs for 90% of the events. This may be an indication of a possible overestimation of the error on the forecasted ToA, related to the errors on the input velocities.

To test the validity of the new PDBM distributions on a list of past ICME events which is independent from the one described in Section 2, we employed a list of 100 events from the (Paouris & Mavromichalaki, 2017a) list, obtained after excluding 92 common ICMEs between the two databases. Figure 7 shows the histograms of the difference between transit time and arrival speed comparing the old PDBM and the new one on this set of events. Relevant performance indicators are collected in Table 4. Figure 8 shows the plots of the PDBM prediction versus the observed values of the ICME transit time and impact speed. For 65% of events, the observed ToA falls within the standard deviation of the predicted one, while for the arrival speed this occurs for 78% of the events.

We compiled a list of CME‐ICME pairs with a reliable association between their remote observations and in situ signatures. To evaluate the CME initial speed and launch time, we applied a polynomial fit procedure to coronagraph data. We then employed an algorithm to establish the most suitable type of solar wind accompanying the CME. We built this database by using the automated methods for CME characterization that are used for a CME detection and forecast service. Therefore, the uncertainties reported in the database should be representative of the errors in real‐time use.

We employed this database to retrieve the parameters *γ* and *w* in the DBM equations. In this procedure, we took into account the uncertainty on the observations, mapping the input PDFs into the PDFs for the model parameters. The robust statistics, granted both by the larger ICME database with altogether 214 ICMEs and the Monte‐Carlo‐like inversion method, produced a joint PDF populated enough to allow us to verify the PDFs proposed in (Napoletano et al., 2018).

The empirical solar wind *w* PDF has been modeled using the sum of two Gaussian functions, and given their parameters we interpreted these as representative of the slow and fast solar wind distributions.

Similarly, we verified that a Lognormal function fit is a suitable function for fitting the empirical PDF for the drag parameter *γ*. Although the Lognormal is a long‐tailed function, we can define the average value ^{−1}. Also, the fit parameters appear to be quite close to those of Paper I. It is worth to note here that the Drag Based Ensemble Model (DBEM) implementation by Dumbović et al. (2018) uses as *γ* PDF a Gaussian function with *μ* = 1. × 10^{−8} km^{−1} and *σ* = 0.5 × 10^{−8} km^{−1}, that is 10 times smaller than the value we found. Similarly to the present study, in a more recent paper (Calogovic et al., 2021), they applied a reverse modeling procedure with the DBEM aimed to find optimal values for the DBM parameters. Their results showed that for the drag parameter a higher median value of about three times larger (*γ* = 0.32 × 10^{−7} km^{−1}) and an extended range of values (*σ* = 0.7 × 10^{−8} km^{−1}) are needed than the one used in the previous version of the DBEM, for lower MAE and ME in predictions. Interestingly, in the study of (Rollett et al., 2016), over the list of 21 ICME tracked by heliospheric imager, the average value of the fitted drag parameter is also generally larger than that of the previous models (≈0.4 × 10^{−7} km^{−1}, even excluding 4 cases which yielded unrealistic too‐high values for *γ*), still following the same trend. Furthermore, a recent paper by (Paouris, Čalogović, et al., 2021) found that the DBEM needs a larger value between 2.1 and 4.8 times larger of the drag parameter to model a set of CMEs. The large statistics of the inverted DBM parameters, allowed us also to try and separate the PDFs for those CMEs which are accelerated or decelerated by the solar wind. Although the statistics for the accelerated CMEs is much reduced, we found evidence that the *γ* PDFs are significantly different. In particular, the accelerated CMEs seem to experience, on average, a larger value of *γ* and to equalize more rapidly their speed to that of the solar wind.

We therefore introduce refinement of the PDBM with respect to that presented in Paper I, using different *γ* PDFs in case of accelerated or decelerated ICME. It is worth explaining why we chose to look for this separation. As the drag‐based model describes the CME propagation taking as a reference the motion of a solid body in a fluid stream, a different value for the drag parameter may be expected if such body does not present the same shape to a fluid coming from the rear (accelerating CME) and to a fluid coming from the front (decelerated CME). We suggest that this may be the case, as ICMEs are typically depicted as curved flux tube, and our finding that the accelerated CMEs experience a higher drag than the decelerated ones is in accordance with such picture, where we expect a higher drag due to the fluid piling up in the rear of accelerating ICMEs, and a lower drag for decelerating ones, which undergo a smoother solar wind flow on their edges. Interestingly, some results from (Vrsnak, B. et al., 2008) may lead to similar conclusions. In this work, they investigated several relationships between the CME dynamics and the CME mass and found a correlation between the latter and the initial speed, concluding that since slower CMEs tend to have lower mass, larger values of *γ* are expected, as the drag parameter is related to the inverse of the CME mass (refer to equations in Vršnak et al. (2013)).

The updated method shows an improvement on predicted ToA average, both in the test against the initial database (Section 4.3) and in the validation against Paouris' CME list (Section 4.4), although the performances obtained with the old and new PDBM implementations are comparable and well within the error bars.

It is worth to note that the performance of our model is in agreement with the results from the investigation of (Vourlidas et al., 2019), which investigated the relation between the size of the database and the ToA MAE for several forecasting methods. Apparently, ToA MAE in the range 10–15 hr is the current limit on the performance obtained by almost all the methods for ICME forecasting, including numerical models. Our interpretation is that this limit is set by both the lack of knowledge about the actual state of the interplanetary medium and the large uncertainties on the CME initial properties.

This triggers two considerations. First, as long as the input data has such large errors, there will be probably little gain in tuning the model performance without taking into consideration those input errors. This, of course, affects also the new PDBM implementation we are proposing in this work. Second, it is important to test/validate the forecast procedures (especially those suited for real‐time implementation) using a standard database of CMEs, in order to allow the comparison of the performance of models under the very same conditions. To this purpose, we think that data sets such as that presented in this work and that in (Paouris & Mavromichalaki, 2017a) will be of benefit to the CME modeling community when comparing the performances of different models and methods.

Lastly, it is worth to stress that the joint PDFs in Figure 3 show a non‐linear correlation between *γ* and *w*: while we did not make use of this information in our work, the use of a joint PDF for the parameter extraction would reduce the parameter space by one degree of freedom. This approach definitely deserves further investigation, and it will be treated in future work.

Within the Appendix we provide more information about the methods employed to build the CME database.

To find the most probable CME source on the Sun, we employed a source finding algorithm that makes use of HEK (Heliophysics Event Knowledgebase‐Hurlburt et al., 2010) to query which solar features (Active Regions, Solar Flares, Filaments Eruptions) that may have been the source of the CME, are within an area *A* and a time span Δ*t* compatible with the CME launch parameters. The time span Δ*t* is defined by an estimate of the time and duration of the CME liftoff obtained from LASCO images. The search area *A* is the whole solar sector defined by the CME POS angle and the angular width *W* of the CME if *W* < 180° (normal and partial halo CMEs). In the case 180° ≤ *W* < 270° (half halo CME), the same sector is limited to 800*”* from the disk center. In the case *W* ≥ 270° (full halo CME), *A* is the central part of the solar disk, within 600*”* from the disk center. In the case multiple possible sources are retrieved by the query, the position of the CME source is the weighted average of the retrieved feature positions, with active regions weighting the larger between 0.1 and their associated *R* values (Schrijver, 2007), flares weighting 25 and filament ejections weighting 500. In those cases where no potential feature is retrieved by the query, the default source position is computed as the intersection between the CME POS angle vector and a circle centered on the disk center, with radius *R**, where *R** = *R*_{sun} in case the CME width is smaller than 90°. To the CME foot‐point position we associate an error given by the larger between five*”* and the standard deviation of the weighted averaging described above. We employed this method and these weighting values since they are those in use in the real‐time CME detection and propagation services SWERTO (Berrilli et al., 2017) and IPS (Veettil et al., 2019) and have been set after an extensive test on a number of known CME‐ICME counterparts.

After the identification of CME most likely source region,it is necessary to compute the radial speed *v*_{r} from the measured POS speed of the CME front. The procedure is based on Equation 1 in Gopalswamy et al. (2010), assuming a cone model for the CME shape (see Figure 1 in the same reference). The main difference is that we use the de‐projection coefficient to obtain the de‐projected position *R*_{r} from the POS position of the CME front, instead of directly de‐projecting the speed. By using such equation we can also compute the error associated to the radial position *dR*_{r}.

Once we have the data to plot a time‐distance relation in the de‐projected framework, we can obtain the radial speed by doing a linear or quadratic fit of the *r*(*t*) relationship described by these data. The standard option is to fit a quadratic relationship, but when the number of measured position of the CME available for the fit is less than 9, a linear fit is used. Examples of quadratic fits are presented in Figure A1. Note that with a quadratic fit, the CME's acceleration is assumed to be constant and its speed is assumed to be a linear function. Because of this approximation, different velocities can be obtained depending on which part of the *r*(*t*) data is used for the fit, as illustrated in the figure. In this paper, we have for robustness always used all available data points.

With the parameters from the fit, we are able to compute several quantities of interest for the CME liftoff. Namely, the time when the CME front reaches the 20*R*_{Sun} distance and its associated error; the CME *v*_{r} (*@*20*R*_{Sun}) and associated error; the possible CME front residual acceleration at 20*R*_{Sun} distance. Those values make part of the CME liftoff characteristics in the database: start date, de‐projected speed, de‐projected speed error, acceleration. Since the error on the arrival date is negligible, the error on the Start Date is reported as transit time error.

In order to propagate the ICME with an appropriate solar wind speed, for each event we have to hypothesize if the ICME interacted with a stream of slow (*S*) or fast (*F*) solar wind. It is well known that coronal holes are sources of fast solar wind streams (Krieger et al., 1973; Nolte et al., 1976), therefore we implemented an algorithm which discriminates the solar wind type by verifying if the CME source region is close to a coronal hole. A suitable algorithm queries the HEK (Heliophysics Event Knowledge) catalog for all the Coronal Holes present on the solar disk.

The time range queried starts from 4 hr before the estimated CME launch time to the CME launch time (considering the error).

As a consequence, we associate the event with fast (slow) solar wind if the CME source coordinates are close to (far from) any Coronal Hole retrieved by the query.

This work is in part supported by the ESCAPE project (the European Science Cluster of Astronomy *&* Particle Physics ESFRI Research Infrastructures) that has received funding from the European Union Horizon 2020 research and innovation program under the Grant Agreement no. 824064. We also acknowledge support for our research by the project “CEI6: Circumterrestrial Environment: Impact of Sun‐Earth Interaction” funded by the MIUR Progetti di Ricerca di Rilevante Interesse Nazionale (PRIN) Bando 2017‐grant 2017APKP7T. We made use of the Near‐Earth Interplanetary Coronal Mass Ejections Since January 1996 catalog compiled by Ian Richardson and Hilary Cane and available at

The ICME catalog built for the analysis in Section 3, together with a tool for the data visualization and the module employed for running the PDBM simulations, can be downloaded from