We use the framework of Physics‐Informed Neural Network (PINN) to solve the inverse problem associated with the Fokker‐Planck equation for radiation belts' electron transport, using 4 years of Van Allen Probes data. Traditionally, reduced models have employed a diffusion equation based on the quasilinear approximation. We show that the dynamics of “killer electrons” is described more accurately by a drift‐diffusion equation, and that drift is as important as diffusion for nearly‐equatorially trapped ∼1 MeV electrons in the inner part of the belt. Moreover, we present a recipe for gleaning physical insight from solving the ill‐posed inverse problem of inferring model coefficients from data using PINNs. Furthermore, we derive a parameterization for the diffusion and drift coefficients as a function of *L* only, which is both simpler and more accurate than earlier models. Finally, we use the PINN technique to develop an automatic event identification method that allows identifying times at which the radial transport assumption is inadequate to describe all the physics of interest.

We analyze the relative importance between drift and diffusion, with varying L, geomagnetic activity, and phase space density values

We derive a simple and interpretable parameterization of drift and diffusion coefficients as functions of L only

We use the PINN framework to automatically identify events for which the one‐dimensional radial approximation does not hold

The mechanisms that regulate the acceleration, transport, and loss of energetic particles in the Earth's radiation belts have long been investigated, both from the standpoint of fundamental research and for practical space weather applications (Horne et al., 2005). In this region, so‐called “killer electrons” can be accelerated to relativistic energies in just a few days, or even minutes, posing a dangerous threat to satellites (Horne, 2007). The radiation belts are composed of a collisionless, tenuous plasma that obeys Maxwell's equations and whose distribution can be described by the first‐principle Vlasov equation. However, due to the massive temporal and spatial separation of the leading physical processes, the customary approach to study radiation belt electrons is to use a model reduction known as quasi‐linear theory, introduced in the seminal study of Kennel and Engelmann (1966), and soon adopted in radiation belt physics (Lyons et al., 1972; Summers et al., 1998). The motion of charged particles in a dipolar magnetic field can be decomposed into three quasi‐periodic orbits and corresponding adiabatic invariants. In the quasi‐linear procedure, one can expand particle orbits around their unperturbed trajectories in the Vlasov‐Maxwell equations and derive a diffusion equation in adiabatic invariant space (Schulz & Lanzerotti, 1974). The scattering due to resonant wave‐particle interactions violates the conservation of adiabatic invariants and it is responsible for most of the particle dynamics (since collisions are absent in this tenuous plasma environment). These effects can be described by the diffusion coefficients, hence dramatically reducing the complexity of the model. Furthermore, given the different timescales associated with the three adiabatic invariants, one can decouple the diffusion in the radial direction from the one in energy and pitch angle, ending up with a one‐dimensional diffusion equation, valid for particles at a constant value of the first and second adiabatic invariants. Alternatively, one can describe the time evolution of the particles' Phase Space Density (PSD) as a stochastic process due to small random changes in the variables, which leads to the one‐dimensional Fokker‐Planck equation (Chandrasekhar, 1943):*f* is the particles' PSD, Φ is the third adiabatic invariant (magnetic flux enclosed by a drift shell), *t* is time, and Equation 1 is understood to be valid for constant values of first and second adiabatic invariants. The drift and diffusion coefficients (*C*_{Φ} and *D*_{Φ}, respectively) have the physical meaning of mean displacement and mean square displacement per unit time. Typically, Equation 1 is further simplified by assuming a simple relationship between *C*_{Φ} and *D*_{Φ}, which can be derived in the case of a dipole field (Fälthammar, 1966) or in absence of source or sinks (Roederer & Zhang, 2016): *C*_{Φ} = 1/2(*∂D*_{Φ}/*∂*Φ) so that, upon transforming Φ to the normalized equatorial radial distance *L* we get the familiar expression:

Equation 2 has constituted the backbone of a large part of radiation belt research for the past 60 years, and even though it is now understood that energy and pitch angle diffusion are crucial ingredients for an accurate description of electrons dynamics (Shprits et al., 2009; Thorne, 2010; Tu et al., 2013; Xiao et al., 2010), the relative importance of radial diffusion is still vigorously debated (Lejosne & Kollmann, 2020). Although the radial diffusion coefficient *D*_{LL} can be calculated from first principles (Fälthammar, 1965; Liu et al., 2016), as well as for event‐specific cases using in‐situ or ground‐based magnetic and electric field or particle measurements (Cunningham, 2016; L.‐F. Li et al., 2020; Ripoll et al., 2016; Schulz & Lanzerotti, 1974; Tu et al., 2012) (keeping in mind the several assumptions built in the quasi‐linear approximation (Camporeale, 2015a)), its specification requires detailed knowledge about the power spectrum and/or the distribution of Ultra Low Frequency (ULF) waves that are resonant with electrons (Dimitrakoudis et al., 2015; Ozeke et al., 2012). Hence, most of the research focus has been centered on finding an efficient and accurate empirical parameterization of the diffusion coefficient *D*_{LL}, possibly as a function of quantities that are available in real‐time. The parameterizations most used in the literature use the geomagnetic index *Kp* as the main driver. The model by Brautigam and Albert (2000) (henceforth BA) is possibly the most widely used parameterization of *D*_{LL} as a simple function of *Kp* and *L*. More recent works include Ali et al. (2016); Drozdov et al. (2020); Lejosne (2019); Ozeke et al. (2014); Wang et al. (2020). A Bayesian approach that accounts for the possible source of uncertainties has been presented in Sarma et al. (2020).

Here, we approach the problem of defining and parameterizing the coefficients of the radial diffusion equations from a pure data‐driven standpoint and, for the first time, using machine learning techniques. Since Equation 2 does not account for any injection or loss due to non‐diffusive processes, it is customary to add a source/loss term in the form *f*/*τ*. When *τ* is a general function of *L* and *t*, that term is general enough to account for all processes that are not included in the diffusive term. In practice, because we want to be able to distinguish losses (for instance due to particles falling into the loss‐cone) from sources (for instance due to scattering in energy and pitch angle) we split the loss/source term as:*τ* and *S* are defined as positive, and have the units of time. Equation 3, however, is not solvable as an inverse problem, being strongly ill‐posed: there is no unique solution and, in fact, a trivial solution is one where *D*_{LL} = 0 and all the rate of change in *f* is accounted for by the source/loss terms. A possible way to alleviate such ill‐posedeness is to enforce a given parameterization to the coefficients. Lanzerotti et al. (1970) is one of the earliest work that has proposed a given parameterization for the diffusion coefficient and the electron lifetime (in particular, by assuming that they are both time‐independent and functions of *L*). A data assimilation technique (Kalman filter) has been used in Koller et al. (2007) to identify sources and losses associated with the BA diffusion coefficient. A Bayesian approach to determine the optimal coefficients (but still within a given functional form) has successfully been followed by Sarma et al. (2020); however, those approaches inevitably restrict the functional form of the free parameters and they possibly miss more general and insightful solutions. Here, we follow a different strategy to alleviate the problem of ill‐posedness. We generalize Equation 2 to an advection‐diffusion Fokker‐Planck equation of the form:*C*(*L*, *t*) a positive‐definite drift coefficient. The positiveness of *C* imposes a constraint on the solution, yet still allows the drift term to effectively act as both a source or a loss term with respect to the diffusive term (i.e., it can be either positive or negative, depending on the sign of the derivative). In other words, we seek a solution of the Fokker‐Planck equation in drift‐diffusion form (Equation 1), without assuming any relationship between the drift and diffusion coefficients, since in general *C*_{Φ} ≠ 1/2(*∂D*_{Φ}/*∂*Φ) (Allanson et al., 2022; Lemons, 2012). The additional drift term is physically related to rapid particle injections into the inner magnetosphere which have often been observed by satellites, and which result from rapid advective transport rather than a diffusion process (see e.g., Bortnik et al. [2008]; Z. Li et al. [2021]).

To solve this inverse problem, we use a Physics Informed Neural Network (Raissi et al., 2019) (PINN), that derives *f*, *D*_{LL}, and *C* as general smooth functions of *L* and *t*, by enforcing both consistency with data and a small residual of the drift‐diffusion Equation 4. We use 3 years of Van Allen Probes data (that we consider ‘noiseless’) in the inverse‐problem. The procedure approximates the phase space density *f* by means of a neural network (learning from the observed data), and learns *D*_{LL} and *C* as the optimal coefficients that solve Equation 4 for the approximated *f*. We emphasize that all of the physics of interest and the particle dynamics are encoded in those coefficients, whose analysis then becomes extremely insightful.

We compare our results with the following benchmarks: the Brautigam and Albert (2000), the Ali et al. (2016) and the Ozeke et al. (2014) parameterizations for the diffusion coefficients. For each of these, we use the formula presented in Y. Shprits et al. (2005) for the electron lifetime *τ* (widely used in the literature; Drozdov et al. [2017]). The forward model is computationally very cheap and it is solved with the finite difference method presented in Welling et al. (2012) (slightly adjusted by substituting the advection term *∂*(*Cf*)/*∂L* in lieu of the loss term *f*/*τ*).

This work has several goals. First, we present the first‐ever application of the PINN framework to solve an inverse problem and derive the optimal coefficients for the radial transport problem using real spacecraft observations. Although PINN is gaining increasing attention in all fields of applied mathematics and engineering, its potential in space physics is still not fully realized (Bortnik & Camporeale, 2021). Second, we showcase some examples of data mining approaches that can deepen our physical understanding and possibly unveil new processes. We emphasize that all of the physics of interest and the particle dynamics are encoded in the drift and diffusion coefficients, whose analysis is extremely insightful. We regard that as a fine example of data‐driven knowledge discovery, which is one of the ultimate goals of using machine learning in physics (Camporeale, 2019). Third, we perform data‐driven discovery of the physics which is missing in the traditional quasi‐linear diffusion equation, routinely used to study electrons in the radiation belts. We show that the drift term is often comparable with the diffusion one, and we analyze in detail their relative importance, with varying *L*, geomagnetic activity, and phase space density values. Fourth, we derive what is possibly the simplest and most interpretable parameterization of drift and diffusion coefficients as functions of *L* only, that is still able to capture most of the dynamics. We show that this parameterization is competitive and often outperforms less interpretable parameterizations presented in the literature. Eventually, we achieve one of the most important and long‐standing goals of scientific machine learning: we use a general but opaque ML technique (PINN) to solve an inverse problem and we discover that the free parameters of our Fokker‐Planck equation (diffusion and drift coefficients) can be well approximated by a simple, interpretable formula. That is, we perform data‐driven, ML‐aided model order reduction. Finally, we use the PINN solution for an automatic event identification task, namely to identify events for which the one‐dimensional radial approximation does not hold, requiring other physical mechanisms, such as energy and pitch‐angle resonant interactions.

Equation 4 is solved by means of an unconditionally stable, second order accurate, Crank‐Nicholson scheme discussed in Welling et al. (2012). For completeness, we report the numerical discretization here:*n* and *j* represent discretization in time and space, with time steps Δ*t* and Δ*L* respectively, and *t* = 1 (hours) and Δ*L* = 0.05. Observations are linearly interpolated to the computational grid both at *L* = 2.0 and *L* = 5.5 to be used as time‐dependent boundary conditions, and at initial times for all values of *L* to be used as initial conditions.

Physics‐informed Neural Networks (PINN) are a framework for solving forward and inverse problems involving nonlinear partial differential equations (Raissi et al., 2019). The theoretical foundation of PINNs lies in the well‐known universal approximation property of neural networks (Hornik et al., 1989) that essentially allows neural networks to accurately approximate a large class of continuous functions. The basic idea of PINNs is rather simple, and it exploits the fact that the output of a neural network is a continuous and differentiable function (almost everywhere). Moreover, PINNs take advantage of the ability of modern neural network libraries to automatically calculate exact derivatives with respect to the input variables, by applying the chain rule of differentiation (this is known as *autodiff* in machine learning jargon; Géron [2019]). Hence, each term in a partial differential equation (PDE) can be calculated exactly on a set of collocation points within the domain, and the PDE itself can be used as a penalization term in the loss function minimized by the neural network. Upon convergence, a PINN outputs a function that approximately solves the PDE and matches the given data on the points where it has been trained.

An interesting feature of PINNs that we use in this work is their ability to solve inverse problems in a mesh‐free fashion and with a minimal set of assumptions. However, the possibility of finding general forms for the free parameters of a PDE has the potential drawback of the converged solution not being unique. We approach this issue by employing an ensemble method, namely by solving the inverse problem several times and averaging the top 5 solutions. Because the solution *f* spans several orders of magnitude in the *L* domain, we perform the transformation *f* = *e*^{g} and solve for *g*:

The PINN is designed as a combination of three coupled neural networks, each taking a point in (*L*, *t*) as input and outputting the value of *f*, *D*_{LL}, and *C* at that point, respectively. Those three outputs are then combined in the loss function, which is the sum of the mean square error with respect to the observations (in logarithm), and the mean square of the residual of Equation 6. Boundary conditions (at *L* = 2 and *L* = 5.5) are enforced by neglecting the residual term in the loss function on those points (i.e., the function *f* is forced to converge to the boundary values). The neural network architectures are standard and have been selected by progressively increasing their complexity and monitoring changes in the converged values of the loss function until a plateau was observed. Other hyper‐parameters were not optimized. The networks use a tanh activation function in all the layers. The network that outputs the solution *f* uses six inner layers with [30, 20, 20, 20, 20, 20] neurons, while the two networks outputting the coefficients *D*_{LL} and *C* have three inner layers with [30, 20, 10] layers. To perform the optimization we use a combination of the Adam optimizer (Kingma & Ba, 2014) and the BFGS (Broyden‐Fletcher‐Goldfarb‐Shanno) method (Zhu et al., 1997), both within the Tensorflow framework (Abadi et al., 2016).

We use observations from the Magnetic Electron Ion Spectrometer (MagEIS) instruments aboard the Van Allen Probes spacecraft (Blake et al., 2013). Van Allen Probes is a NASA twin satellite mission that was active for 7 years, since its launch on 30 August 2012. Its primary mission was to address how populations of high‐energy charged particles are created, lost, and dynamically evolve within Earth's magnetic trapping region (Fox & Burch, 2014). Due to the unprecedented quality and quantity of data collected, Van Allen Probes have marked a golden era for radiation belt studies (W. Li & Hudson, 2019). Here, we limit our study to electrons with first adiabatic invariant *μ* = 700 MeV/G and second adiabatic invariant *K* = 0.1 *R*_{E} G^{0.5}, which corresponds to approximately 1 MeV electron energies in the heart of the radiation belt. We used the TS05 magnetic field model (Tsyganenko & Sitnov, 2005) to calculate the adiabatic invariants (specifically, we have used the IRBEM library publicly available at

Figure 1 shows the PSD (log scale) of the whole dataset as a function of *L*. The vertical dashed line divides the dataset into a training set (70% of the whole dataset, from 01‐Nov‐2013 to 30‐Oct‐2016) and a test set (30% of the whole dataset, from 01‐Nov‐2016 to 30‐Sep‐2017). This period was specifically chosen to be on the declining phase of the solar cycle which is historically associated with frequent and intense geomagnetic activity. One can notice that the dataset is sparse both in time and space since it essentially follows the highly elliptical trajectory of the satellites.

Our quantity of interest, the phase space density *f*, changes by several orders of magnitude between *L* = 2 and *L* = 5.5. Hence, it is not straightforward to design a single metric for model performance. A thorough analysis of several metrics often used in radiation belt modeling can be found in Liemohn et al. (2021); Morley et al. (2018). Here, we are interested in studying the model accuracy at given values of *L*, rather than averaging over the whole domain. We define and use three different errors, where each metric quantifies a different aspect of model performance. Following Morley et al. (2018), we characterize accuracy by defining the *percentage symmetric accuracy* *ζ* as:*f* are the ground‐truth values taken by observations and the corresponding values produced by a model, respectively. *P*_{k} represents the *k* − th percentile (i.e., *P*_{50} is the median) calculated over all values at fixed *L*. This represents a generalization of the median symmetric accuracy (Morley, 2016) for quantiles other than the median, which allows for estimate error bars (i.e., *ζ*_{k} is monotonically increasing with increasing *k*). The second metric we employ characterizes bias and is called the *symmetric signed percentage bias* SSPB, again generalized from the definition in Morley et al. (2018):

Note that, by taking the absolute value after calculating the percentile, SSPB is not ordered when considering different percentiles *P*_{k} (hence it does not allow to estimate error bars). Finally, we define the error *ɛ* as the median value at fixed *L* of the absolute error of the logarithmic phase space density. That is:

We benchmark our results against several parameterization for the diffusion coefficient: the BA model (Brautigam & Albert [2000]; Ozeke et al. [2014], and Ali et al. [2016]), which are all functions of *L* and the geomagnetic index *Kp* only (Rostoker, 1972). Their formulas are:

The following definition of electron lifetime is employed (Drozdov et al., 2017) for the BA and Ozeke et al. parameterizations:*L*_{pp} is the plasmapause location, empirically estimated with the formula in Carpenter and Anderson (1992):*Kp*_{max} being the maximum value of the geomagnetic index *Kp* 24 hr prior. Ali et al. (2016)’s parameterization does not use a loss term.

In this work we initialize neural networks following the so‐called Xavier initialization (Glorot & Bengio, 2010), that is by drawing the initial weights of the neurons in layer *l* from a normal distribution with zero mean and variance *σ*^{2} = 1/*N*^{l}, with *N*^{l} the number of neurons in layer *l*. The biases are set equal to zero. This initialization is particularly effective when using tanh activation functions. We have solved the PINN described above for 20 different random initializations of the underlying neural networks, each time training for 100,000 epochs (we note that some of the networks might have converged with a smaller number of iterations). Moreover, we have verified that the results described in the following do not substantially depend on the number of PINNs trained, i.e., the results are well converged. The best five solutions in terms of the error *ɛ*(*L*), Equation 9, computed on the training set are shown in Figure 2 as black lines. Blue, magenta, and yellow lines denote the BA, Ozeke et al., and Ali et al. solutions, respectively. Not surprisingly, the PINN solutions consistently outperform those three benchmark solutions. However, it is interesting that the simple approach of averaging the best 5 diffusion and drift coefficients yields a result that also outperforms the benchmarks and indeed is very close to each of the five ensemble members. The error of the PINN ensemble mean is shown in Figure 2 as a red line. This is not a trivial result, because from Equation 6 one can see that averaging the coefficients *D*_{LL} and *C* do not yield a solution that is the average of the ensemble members' solutions. Figures 3 and 4 (top panels) show respectively the best five realizations of the diffusion coefficient *D*_{LL} and the corresponding drift coefficient *C* as heat maps in logarithmic scale and as a function of time (horizontal axis) and *L* (vertical axis). The bottom panels show the error of each realization with respect to the coefficient calculated by averaging the best five coefficients. Such an average of the best five is shown in Figure 5 for *D*_{LL} (left) and *C* (right).

Here we perform a statistical analysis of the optimal coefficients derived with PINN on the training set. First, we show in Figure 6 the distribution of the PSD *f* as a function of *L*. The heat map shows the counts in each bin, normalized to the largest number for a constant value of *L*. The statistics are computed over about 25,000 hr, spanning 3 years of data (01‐Nov‐2013 to 30‐Oct‐2016). One can notice that three regimes naturally appear: one for *L* ≲ 3.2 where *f* is approximately constant at levels of 10^{−10}, one for 3.2 < *L* ≲ 4.5 where *f* rapidly increases and it has a large spread covering the range 10^{−10} < *f* < 10^{−4} and the third regime at larger *L* where the *L* − dependence is again flattened, even though the spread in values remains relatively large. Figure 7 (left panel) shows the distribution of the diffusion coefficient *D*_{LL} as function of *L*. The gray area represents the interval between the 25th and 75th percentile (for a given *L*), and the orange line denotes the median. One can notice that the spread increases by moving further away from the coordinate *L* ∼ 3.2. Also, the slope of the distribution undergoes several regimes. For reference, we overlay the curves *L*^{10} (yellow) and *L*^{20} (magenta). The former is adopted in the BA parameterization and is consistent with the distribution of *D*_{LL} for small *L*, while for large *L* the latter *L* − dependence seems more appropriate. A more detailed examination of this distribution is shown in the right panel of Figure 7. Here, we have ranked column‐wise (i.e., for constant *L*) the number of counts in each bin (the bins are uniformly spaced in log_{10}*D*_{LL} and *L*). The heat map shows the top 20 ranks, with black signifying the top rank (i.e., bins with the largest number of counts at constant *L*, and white with the lowest rank (20 or above). In this way, we are able to distinguish different *trajectories* for *D*_{LL}, and in particular a bifurcation of values, particularly at large *L*. The same bifurcation is even more prominent in the distribution of *C*, shown with the same format in Figure 8, where one can notice two different regimes being approximately separated at *L* ∼ 3.5. Interestingly, for *L* > 3.5, *C* can vary by one or two orders of magnitude.

The presence of (at least) two distinct regimes confirms that the physics of interest is different within and outside the plasmapause. Here we do not explicitly model the plasmapause location (see, e.g., Chu et al. [2017]; Guo et al. [2021]; Malaspina et al. [2020]), hence the change in the distributions slopes between *L* = 3 and *L* = 3.5 should be attributed to a statistically averaged plasmapause location. The spread in the coefficients is harder to interpret physically, although certainly driven by variations in the boundary conditions at *L* = 2 and *L* = 5.5. We note that one of the important aspects of PINN‐based insight discovery is identifying regions in parameter space that are poorly constrained or carry greater error, as specific areas that require better understanding and further investigation. Finally, Figure 9 shows the ranked joint distribution of *D*_{LL} (horizontal axis) and *C* (vertical axis). Both quantities are on a logarithmic scale. While there seems to be an almost linear dependence between the two coefficients for relatively small values (≲10^{−2}), several branches appear for large values, possibly indicating different physical regimes.

In order to understand the relative importance of the diffusion and drift terms in Equation 4 we define their ratio as *r* (in logarithmic scale, vertical axis) is shown as a function of *L* (horizontal axis). The distribution is normalized to the maximum value of counts per *L*‐value. The black solid line at log_{10}*r* = 0 indicates an equal balance between drift and diffusion, and the region below that line represents a stronger diffusion than drift. One can notice that in the inner magnetosphere (*L* ≲ 4) the two terms are approximately balanced, while diffusion plays a larger role with increasing *L* in the outer belt. Figure 10 can be interpreted in the sense of local versus global losses, where the former are captured by the drift term and the latter by the diffusion term. Typically, local diffusion at *μ* = 700 MeV/G is controlled by the hiss and chorus waves and radial diffusion becomes very low at the lower L‐shell. On the other hand, hiss waves will more likely be a cause of local losses at low L‐shell, providing a steady decay time, shorter than the one due to radial diffusion. It is important to notice that this picture might change for lower *μ* values, which is something that can be explored in the future using this technique.

We further analyze the relative contribution of the drift and diffusion terms by studying the ratio *r* as a function of log_{10}*f* and *L*, and for different geomagnetic activity, represented by the Auroral Electrojet index AE, in Figure 11 (left panel: *AE* < 100, middle panel: 100 ≤ *AE* < 300, right panel: *AE* >= 300). Interestingly, at low *L* drift is more dominant than diffusion for larger values of PSD. Also, the range of *L* in which diffusion is dominant slightly shifts to smaller *L* with increasing geomagnetic activity. This analysis unambiguously shows an unexpected relatively large contribution of non‐diffusive drift in the time evolution of the phase space density.

As explained above, in this approach electron losses and sources are not included explicitly, and the last two terms of Equation 3 (*f*/*τ* and *f*/*S*) are replaced by a drift term. However, *effective* lifetimes associated with losses and sources can be derived at each point in time and space by calculating *f*/(*∂Cf*(*L*, *t*)/*∂L*) and defining this quantity as *τ* when it is positive, and −*S* when it is negative. Notice that both *τ* and *S* are positive and have the units of days. Their distribution is shown in Figure 12, as functions of *L* (logarithmic vertical scale). Here, the different shades of gray denote the area covered by [1‐99], [10‐90], and [25‐75] percentiles at a constant *L* value. Once again, a distinguishing feature is the existence of two regimes: for small *L* both lifetimes are very large (i.e., the corresponding loss/source terms *f*/*τ* and *f*/*S* are negligible), but their value decreases substantially with increasing *L* until they plateau at large *L*. It is interesting that the range of values taken by *τ* (i.e., the gray area) also increases significantly with larger *L*, to the point that at *L* = 5, *τ* can range approximately three orders of magnitude. In the left panel of Figure 12, the black line denotes the parameterization by Shprits et al. (2005) used in the BA and Ozeke et al. models. The underestimation of *τ* at small L might be the cause of the large errors for low *L* in those models (see Figure 2).

Several mechanisms that locally enhance the phase space density have been investigated in the literature (Boyd et al., 2018; Hudson et al., 2020; Jaynes et al., 2015). Figure 13 shows the source term *S* over the whole training set, in space (vertical axis) and time (horizontal axis). The interesting feature is that local injection of phase space density can sporadically extend to low values of *L*, down to *L* ∼ 3.5–4. Although in the majority of cases the timescale associated with such injections is of the order of tens or hundreds of days, there are cases where *S* ∼ 1 day, hence comparable with the timescale of local diffusion and losses.

The PINN method described above derives *D*_{LL} and *C* as generic functions of time and space (*t*, *L*), spanning the whole training period. In order to understand the relationship between the diffusion and drift coefficients and their physical drivers, here we perform a feature selection analysis. This analysis can be used, in later works, to inform machine learning models that seek to generate *D*_{LL} and *C* as a function of past known quantities, for space weather forecasting purposes. Feature selection is an extensive topic in the machine learning literature (see, e.g., J. Li et al. [2018]). Here, we use the *backward elimination* technique based on generalized linear models, which we briefly describe in the following. First, we define a minimal set of features, based on our physical intuition: since the radiation belt is ultimately driven by the solar wind variability, we include solar wind quantities observed at the L1 (first Lagrangian point) and propagated in time to the magnetosphere bow‐shock that are well known to be drivers of geomagnetic activity (Kilpua et al., 2015; Wing et al., 2016). Those solar wind quantities are taken from the hourly NASA OMNI dataset and interpolated to the time when the coefficients are defined. Table 1 lists the 12 features initially considered. Note that all the features are evaluated at the same time of the coefficients *D*_{LL} and *C*, except for Σ*PSD* which is averaged over the 10 prior hours. A generalized linear model is built using all combinations of those features up to a quadratic order (a total of 91 terms for *C* and 78 terms for *D*_{LL}, including the intercept). The linear model naturally provides the standardized coefficients (so‐called t‐Statistic or Z‐score) for each term, defined as the ratio between the coefficient calculated for that term by solving a least‐square problem, and its standard deviation. A large value of the standardized coefficient rejects the hypothesis that the coefficient is zero (null hypothesis). In the backward elimination procedure, we iteratively eliminate the coefficient with the smallest Z‐score (in absolute value) and train a new model with all the terms remaining, until only one term is left. This provides us with a ranking or selection of the features. Figure 14 illustrates the top 10 features for *D*_{LL} and *C*, respectively, as a function of the coefficient of determination *R*^{2} (note the different vertical scales in the left and right panels). The features are ranked from left to right with decreasing importance, and the reported *R*^{2} is intended for a model that uses all features listed to the left (i.e., adding one at a time). The red dashed line represents the largest *R*^{2} achieved when all the features are included. In order to add robustness to the procedure, each model is trained on randomly selected 80% data in the training set.

It is interesting to notice that the solar wind features have lower rankings than features that use PSD and boundary conditions. In other words, (at least for the values of adiabatic invariants *μ* and *K* under consideration) the solar wind information contained in the PSD and the boundary condition is more informative for *D*_{LL} and *C* than using the solar wind directly. However, one should notice that the reported ranking might be susceptible to the specific algorithm used (in our case backward elimination) since many features are strongly correlated to each other. A more comprehensive study of the most efficient time lag between solar wind and diffusion and drift coefficients, following the methodology of Wing et al. (2016) is under way.

Eventually, in the grand scheme of scientific machine learning, one would like to use advanced but often opaque techniques (such as PINN) to extract physical insight from the data, with the final goal of exploiting such new insights to advance our knowledge and possibly derive new interpretable models. In a sense, that follows Occam's razor argument that suggests that one should seek the most parsimonious yet accurate model. Here, we close the circle of our inquiry by deriving what is possibly the simplest parameterization of *D*_{LL} and *C*. The feature selection procedure (Figure 14) demonstrates that most of the variance in both *D*_{LL} and *C* can be attributed to changes in *L*. In other words, *L* is the best unique predictor for the coefficients, and therefore we aim to describe them as a function of *L* only, by fitting the PINN‐derived values of *D*_{LL} and *C* with a cubic interpolator, shown with black lines in the left panels of Figures 7 and 8, respectively. Not surprisingly, the cubic interpolator is a good approximation of the median values. The derived formulas for the cubic fit are the following:

In order to assess the goodness of this approximation, we use it in a forward model solution (see section 2) and we compare the results with two benchmarks: a solution derived from the BA diffusion coefficients (Brautigam & Albert, 2000), and another derived by using the diffusion coefficients proposed in Ozeke et al. (2014) (a comparison against the Ali et al. [2016] model is not shown since it was found to yield too large errors [Drozdov et al., 2021]). For both cases we solve Equation 2 with the addition of a loss term (−*f*/*τ*), parameterized as in Gu et al. (2012); Orlova et al. (2016), since the inclusion of such term, is standard practice to account for wave‐particle scattering due to hiss and chorus waves, and it is known to improve accuracy (see section Metrics and Benchmarks). In Figure 15 we show (left) the percentage symmetric accuracy *ζ*, Equation 7 and (right) the symmetric signed percentage bias SSPB, Equation 8 (see Section 2) calculated over the whole test set (1 year of data), as a function of *L*. Blue, red, and black lines denote the results from the baselines by BA and Ozeke et al., and by using the PINN‐derived cubic fit, respectively. In the left panel of Figure 15, the solid squares denote the median values *ζ*_{50} and the error bars are calculated as the spread between *ζ*_{25} and *ζ*_{75}. In the right panel, positive values are in solid and negative values in dashed lines. One can notice that the simple cubic approximation of Equation 11 yields results comparable or superior to the ones obtained with more sophisticated models (all errors are by definition going to zero at the boundary). Figure 16 presents a measure of spread for the results obtained with the cubic parameterization (left), the Ozeke et al. (middle), and the BA (right) models. In each panel, the light and dark and blue areas denote respectively the [10‐90] and [25‐75] percentiles for the logarithm of the PSD (horizontal axis) obtained from Van Allen Probes data (over the 1 year test set), while the cyan solid line represents the median, as a function of *L* (vertical axis). With the same format, light and dark grey areas represent the [10‐90] and [25‐75] percentiles for the output of the respective models, while the black solid line represents the median. One can notice that the results obtained with the PINN‐derived cubic fit are statistically more consistent with the data, for all values of *L*. The Pearson correlation coefficient is reported in each panel and denotes the linear correlation between (logarithm of) model outputs and Van Allen Probes data.

Finally, we present in Figure 17 the PSD resulting from the forward models using the three different parameterizations (BA in red, Ozeke et al. in yellow and PINN‐derived cubic fit in purple), compared against the Van Allen Probes data (blue), for the whole period covered in the test set. The top and bottom panels are for *L* = 5 and *L* = 4, respectively. In all cases, the simulations have initial and boundary conditions taken from the data. For *L* = 5, the PSD resulting from the new parameterization presented here is consistently more accurate than the two baseline models, which tend to underestimate the Phase Space Density. At *L* = 4 none of the three models is particularly accurate, although the PINN is often orders of magnitude closer to the observations than the other two models. Note that logarithmic scales are used on the vertical axis. All the results are for the first adiabatic invariant *μ* = 700 MeV/G and the second adiabatic invariant *K* = 0.1 *R*_{E} G^{0.5}.

One of the by‐products of the PINN approach outlined in this paper is the possibility of studying how well the observational data are consistent with the solution of the underlying PDE. As mentioned in the Introduction, the derivation of the radial diffusion Equation 4 is based on several assumptions, one of which is the conservation of the first and second adiabatic invariant. Breaking those invariants can cause local diffusion in energy and pitch‐angle (Camporeale, 2015; Tu et al., 2013). By investigating how small the residual of the PDE is on the domain, one can easily identify times when any of the quasi‐linear assumptions do not hold, and hence Equation 4 cannot capture some of the physical mechanisms that generate the data. Figure 18 shows the residual of Equation 4 plotted as a heat map over the whole training period. For ease of visualization, it has been normalized to its maximum value, and the color scale is capped at a value of 0.3. The red dashed lines on the bottom of the figure represent times at which the residual contains values in the 99 percentile of its distribution. The list of these ‘events’ is reported in Table 2. Most of these periods are associated with moderate or strong geomagnetic storms, dropout events, or flux enhancements, and have already been studied in the literature. When that is the case, some references that explicitly analyze data from that period are cited in the last column. For other events, we have not found previous studies in the literature, and we encourage the community to analyze them.

*Note.* The last column indicates references in case that event has been studied in the literature.

The process of understanding the mechanisms underlying a physical process, and the ability to describe such mechanisms with the elegant and succinct formalism of partial differential equations (PDEs) lies at the core of scientific discovery. However, the way in which scientists extract information from experiments and observations (*data*) and encode that information into PDEs has seen dramatic changes over the last decade, when methods originating in machine learning have started playing an increasingly important role. Currently, there is a rich literature on data‐driven discovery of PDEs (see, e.g., Berg & Nyström [2019]; Boullé et al. [2021]; Long et al. [2018]; Raissi [2018]; Rudy et al. [2017]; Udrescu & Tegmark [2020]; Xu et al. [2019]; Zhang & Lin [2018]). The published methods can be loosely divided into two classes. On one hand, one can create a large dictionary of terms that contain algebraic, differential, and integral operators and search the space of all (or many) combinations of those terms for the optimal PDE that describes the data (i.e., the PDE whose solution is an acceptable approximation of the data). Two seminal examples of this approach are Rudy et al. (2017) (using sparse regression) and Udrescu and Tegmark (2020) (using symbolic regression). On the other hand, one can restrict the search for the optimal PDE to a specific class of functionals, thus setting up the problem of PDE discovery as an inverse problem, where the time and space dependence of free parameters (such as for instance, drift and diffusion coefficients) needs to be learned. Physics‐Informed Neural Network, introduced in the study by Raissi et al. (2019), falls in this category, and it is the approach used in this paper. Here, we have presented a framework that solves the problem of finding the optimal coefficients for a Fokker‐Planck equation (inverse problem) with a Physics‐Informed Neural Network, applied to the study of energetic radiation belt's electrons, and using for the first time real space satellite observations (Van Allen Probes). This approach opens several possible avenues for future investigations. In this study, we have showcased several of them.

Specifically, we have investigated the possibility that the time evolution of the Phase Space Density of electrons in the Earth's radiation belt could be described by the combination of (and the competition between) a diffusion and a drift term. It was found that the data is more consistent with the inclusion of a non‐diffusive drift mechanism and it was discovered that the phase space distribution is an important parameter in determining the coefficients. These findings challenge several decades of literature that have exclusively focused on diffusive processes.

The data‐driven approach enabled by PINN allows to unambiguously test such hypothesis, by determining the optimal drift and diffusion coefficients that, used in Equation 4, result in the solution most consistent with observations. Interestingly, we have shown that, at least for the values of first and second adiabatic invariants considered here, drift and diffusion are competing for *L* ≃ 4, while diffusion becomes increasingly dominant for larger values of *L*. Obviously, as powerful as it is, the PINN method does not solve the issue of ill‐posedness of the inverse problem. Namely, there is no guarantee about the uniqueness of the solution. Indeed, we have verified that different realizations of the coefficients are possible and equally valid. Interestingly enough, we have also verified that not only do the best five coefficients used in this study yield solutions that have comparable errors with respect to the data but that the average of the coefficients (analyzed in detail in Figures 5–11) also yield a similar level of error.

Furthermore, discovering the optimal diffusion and drift coefficients allows for data‐mining them in order to learn their dependence on physical parameters and the statistical behavior of their profile (Figures 3–9). Second, one can re‐derive effective loss and source terms, and study their behavior in space and time (Figures 12 and 13). In this way, we have discovered fast sporadic injections of PSD at *L* ∼ 3.5–4 that might occur on a ∼1 day timescale (Figure 13). The analysis has also highlighted a deficiency in modeling the loss term *τ* at low *L* in previous works (Figure 12). Third, we have used the PINN‐discovered coefficients *D*_{LL} and *C* and their learned dependence on *L* to build a simple and interpretable model that yields an excellent approximation (and forecast) of the PSD (Figure 15), with no free parameters, other than the boundary conditions. In our opinion, this step represents the state of the art in scientific machine learning, where a simple, analytical, interpretable expression for physical parameters has been discovered by way of using a powerful, yet opaque, ML method such as PINN.

Finally, we have shown a simple way of performing automatic event identification, that is to identify time intervals when the underlying diffusive approximation is not valid (Figure 18). This can be due to a number of physical effects, including non‐resonant interactions (Camporeale, 2015; Camporeale & Zimbardo, 2015), and large‐amplitude waves (Bortnik et al., 2008), pitch‐angle and energy scattering (Tu et al., 2013), and others. Interestingly, some of the identified events (reported in Table 2) have been well studied in the literature, while others were not and thus deserve further investigation.

Future steps include extending the present study to a range of first and second adiabatic invariants and eventually to the less approximated diffusion equation in energy and pitch‐angle (requiring the specification of a diffusion tensor that includes cross terms, thus increasing the dimensionality of the problem, see, e.g., (Albert & Young, 2005; Camporeale et al., 2013a, 2013b), and the estimates of uncertainties associated either to the derived coefficients or directly to PSD solution of the Fokker‐Planck equation (Camporeale & Carè, 2021; Chen et al., 2020).

This material is based upon work supported by the National Aeronautics and Space Administration under Grant 80NSSC20K1580 ”SWQU: Ensemble Learning for Accurate and Reliable Uncertainty Quantification” issued through the Space Weather with Quantified Uncertainty (SWQU) program. A.D. was supported by NASA Grant 80NSSC18K0663. J.B. acknowledges subgrant 1559841 to the University of California, Los Angeles, from the University of Colorado Boulder under NASA Prime Grant agreement 80NSSC20K1580, the Defense Advanced Research Projects Agency under U.S. Department of the Interior award D19AC00009, and NASA/SWO2R Grant 80NSSC19K0239. E.C. is partially supported by NASA grants 80NSSC20K1275, 80NSSC21K1555.

Software and data are publicly available on