The Formosa Satellite‐7/Constellation Observing System for Meteorology, Ionosphere, and Climate‐2 (FORMOSAT‐7/COSMIC‐2) Global Navigation Satellite System radio occultation (RO) payload can provide global observations of slant total electron content (sTEC) with an unprecedentedly high spatial temporal resolution. Recently, a new ionospheric data assimilation system, the Community Gridpoint Statistical Interpolation (GSI) Ionosphere, is constructed with the National Oceanic and Atmospheric Administration GSI Ensemble Square Root Filter and the Global Ionosphere Plasmasphere and the Thermosphere Ionosphere Electrodynamic General Circulation Model. The paper demonstrates the capability of the GSI Ionosphere to improve the ionospheric specification and make a quantitative assessment of the impact of FORMOSAT‐7/COSMIC‐2 RO data on the ionospheric observing system simulation experiments conducted to calibrate key Ensemble Square Root Filter parameters that control detrimental effects of the sampling errors, particularly on the ensemble‐based estimation of the correlation between observations and model states, in order to yield high‐quality assimilation analysis. Results from the observing system simulation experiments show that (1) an ensemble size larger than 70 is recommended for assimilation of RO sTEC data with the GSI Ionosphere and (2) localizing the impact of observations around the tangent points in the horizontal direction with a length scale of 5,000 km is effective in improving assimilation analysis quality. Assimilation of sTEC data from FORMOSAT‐7/COSMIC‐2 can considerably improve the global ionospheric specification through the application of the GSI Ionosphere. The GSI Ionosphere can provide instantaneous global pictures of the ionosphere variability and help characterize day‐to‐day variability of the ionosphere and deepen our understanding of the observed day‐to‐day variability.

As exemplified by the success of numerical weather prediction, data assimilation has attracted great attention as a promising approach to integrating geospace observing capabilities with a numerical model of the ionosphere to improve the specification and forecasting of ionospheric weather. Data assimilation is a powerful technique that can optimally combine observations with a numerical model to help initialize model states and estimate inadequately specified model parameters.

Total electron content (TEC) is one of the most valued data types for ionospheric data assimilation. From the phase and pseudo range measurements of the Global Navigation Satellite System (GNSS) signals received by ground‐based and satellite‐based GNSS receivers, the electron density integrated along the radio path between GNSS satellites and receivers through the ionosphere and plasmasphere can be calculated and is referred to as slant TEC (sTEC). Comparing with ground‐based TEC, satellite‐based TEC data are evenly distributed over ocean and land areas. A radio occultation (RO) event occurs when the Earth is located between GNSS and low Earth orbit satellites, while the GNSS raypath between GNSS transmitters and receivers passes through the ionosphere and plasmasphere. Otherwise, the antenna at low Earth orbit receives the signal that travels only through the plasmasphere. In this study, we will focus on RO sTEC observations, in particular, expected to be obtained from the Formosa Satellite Mission 7/Constellation Observing System for Meteorology, Ionosphere and Climate 2 (FORMOSAT‐7/COSMIC‐2) mission. Thanks to the success of the FORMOSAT‐3/COSMIC mission, the follow‐on FORMOSAT‐7/COSMIC‐2 mission is originally consist of six low‐inclination‐angle (24°–28.5°) orbit satellites and six high‐inclination‐angle orbit (72°) satellites. The six low‐inclination‐angle orbit satellites will be launched in 2018. Although the mission of six high‐inclination‐angle orbit satellites is cancelled, it is expected that there will very likely an alternative source available from high latitude constellation from commercial providers. The main payload, TriG GNSS RO System, is capable of profiling the ionosphere with a great accuracy. The FORMOSAT‐7/COSMIC‐2 low‐inclination angle satellites are expected to provide a dense spatial and temporal coverage of high‐quality sTEC observations evenly distributed on low latitude and midlatitude (Yue, Schreiner, Kuo, et al., ; Yue, Schreiner, Pedatella, et al., ).

In the past decade, considerable efforts have been made to assimilate TEC data into numerical ionospheric models (e.g., Chartier et al., ; Chen et al., ; Scherliess et al., ; Schunk et al., ; Wang et al., ). One of the most widely recognized ionospheric data assimilation systems is the Global Assimilative Ionospheric Model (GAIM) developed by a joint effort of University of Southern California and Jet Propulsion Laboratory (USC‐JPL GAIM). The USC‐JPL GAIM can assimilate multiple types of data into a numerical ionospheric model. The model covers the altitude range from 100 to 1,500 km. The plasma density in this model is calculated along the geomagnetic field lines, and the thermospheric states and electric fields along with other drivers such as solar extreme ultraviolet (EUV), on the other hand, are parameterized or specified by empirical models. In the USC‐JPL GAIM, two data assimilation schemes are used to update different variables. The four‐dimensional variation data assimilation method is used to estimate model drivers, and a band‐limited Kalman Filter is used to estimate the plasma density model state (Hajj et al., ). Hajj et al. () have successfully assimilated ground‐based TEC data into the USC‐JPL GAIM by using this band‐limited Kalman Filter. Komjathy et al. () have assimilated both ground‐based Global Positioning System (GPS) TEC and the FORMOSAT‐3/COSMIC TEC data into the USC‐JPL GAIM. Their results show assimilating satellite‐based TEC data helps improve vertical electron density specifications.

The Global Assimilation of Ionospheric Measurement (GAIM) developed by Utah State University (USU GAIM) is another well‐recognized ionospheric data assimilation system. This system has two different data assimilation approaches, the USU GAIM‐GM (Scherliess et al., ; Schunk et al., , ) and the USU GAIM‐FP (Scherliess, Thompson, & Schunk, ; Schunk et al., ). The model used in the USU GAIM‐GM is the Ionosphere Forecast Model (IFM), and the data assimilation scheme is Gauss‐Markov Kalman filter. The IFM covers the altitude range from 90 to 1,400 km, and the plasma density is calculated along the geomagnetic field lines. In IFM, thermospheric compositions, temperatures, and winds as well as electric field and precipitation patterns are specified by empirical models (Scherliess et al., ; Schunk et al., ). On the other hand, the model used in the USU GAIM‐FP is the Ionosphere‐Plasmasphere Model (IPM) that covers the altitude range from 90 to 30,000 km. As for the IFM, the IPM needs thermospheric state variables and other external drivers to be specified by empirical models but can be adjusted by the data assimilation system. In addition, the electron and ion temperature from empirical models are necessary because energy equations are not solved in the IPM. The data assimilation scheme used in the USU GAIM‐FP is ensemble Kalman filter (Scherliess, Thompson, & Schunk, ; Schunk et al., ). Both the USU GAIM‐GM and the USU GAIM‐FP can assimilate TEC data into models adequately (Scherliess et al., ; Scherliess, Thompson, & Schunk, ; Schunk et al., ).

Recently, another data assimilation system has been built for the National Center for Atmospheric Research (NCAR) Thermosphere Ionosphere Electrodynamic General Circulation Model (TIE‐GCM) (Richmond, Ridley, & Roble, ) using the Ensemble Square Root Filter (EnSRF) implemented in Data Assimilation Research Testbed (DART) (Anderson, , ). The TIE‐GCM is a three‐dimensional, physical‐based model that can self‐consistently simulate the coupled processes of ionosphere and thermosphere in hydrostatic pressure coordinates. The DART is a flexible ensemble data assimilation software framework with various options for filtering methods. Unlike the GAIM systems, the DART/TIE‐GCM has been designed to take a full advantage of thermosphere‐ionosphere coupling in both analysis and forecast steps of EnSRF (e.g., Hsu et al., ; Lee et al., ; Matsuo & Araujo‐Pradere, ; Matsuo, Lee, & Anderson, ). The DART/TIE‐GCM has also been used to assimilate ground‐based TEC data successfully during ionospheric storm periods (e.g., Chartier et al., ; Chen et al., ). Because the upper boundary of the TIE‐GCM is too low to represent contributions of the topside ionosphere and plasmasphere electron density to sTEC adequately, Chen et al. () and Chartier et al. () had to extrapolate electron densities above the model upper boundary to assimilate vertical TEC into the DART/TIE‐GCM. Vertical TEC is the vertical integration of the electron density in the direction perpendicular to the ground and can be geometrically converted from sTEC under some assumptions. To avoid the errors introduced by vertical extrapolation of the model electron densities, it is desirable to use a model that includes the ionosphere and plasmasphere for assimilation of sTEC data.

In order to take an advantage of ionosphere‐thermosphere coupling in data assimilation methods and to overcome limitations of earlier works by the DART/TIE‐GCM, in this study, we present a new approach for assimilation of satellite‐based RO sTEC data. The study focuses on assessment of the impact of sTEC observations from the upcoming FORMOSAT‐7/COSMIC‐2 low‐inclination satellite constellation on the low‐latitude and midlatitude ionospheric specification.

In this study, a coupled model of the thermosphere, ionosphere, and plasmasphere developed as result of collaboration between NCAR and National Oceanic and Atmospheric Administration (NOAA) has been incorporated in an ensemble‐based data assimilation scheme, which is part of the Gridpoint Statistical Interpolation (GSI) data assimilation system operationally used for numerical weather prediction at NOAA in order to build a new ionospheric data assimilation system. This system is hereafter referred to as the GSI Ionosphere.

The model used in this study is a fully coupled model of the Global Ionosphere Plasmasphere (GIP) and the TIE‐GCM (Pedatella et al., ). While the GIP simulates the ionosphere and plasmasphere processes, the TIE‐GCM solves for the thermospheric processes including the electrodynamic processes. In the following, we refer to this coupled model as the GIP/TIE‐GCM.

The GIP is developed from the ionosphere and plasmasphere part of the Coupled Thermosphere‐Ionosphere‐Plasmasphere Model (Millward et al., ). The GIP solves the continuity, momentum, and energy equations for plasma along geomagnetic field prescribed according to the International Geomagnetic Reference Field using the apex coordinate system (Richmond, ). In the GIP, the distribution of atomic oxygen and hydrogen ion densities are determined with consideration of the transport and diffusion processes. Other ion species solved by using atomic oxygen ion densities from the flux‐tube solver and assume chemical equilibrium, a balance between production and loss. Therefore, both atomic oxygen and hydrogen ions are primary prognostic model state variables that are dynamically evolved from the previous model time step to the next in the GIP. The GIP model consists of two model domains: the low‐latitude and midlatitude regions and the high‐latitude region. The GIP fluxtubes at a given magnetic longitude are distributed with respect to *L* shell. The boundary between low‐latitude and midlatitude and high‐latitude regions is fixed at *L* = 4. For the low‐latitude and midlatitude regions, the GIP solves the plasma along closed fluxtubes that move perpendicular to the magnetic field (*B*) in the magnetic meridional/vertical direction by *E* × *B* and parallel to *B* by ambipolar diffusion (Millward et al., ). The altitude range of low‐latitude and midlatitude parts of the GIP is approximately from 90 to 19,000 km, which covers the most of GNSS raypath of FORMOSAT‐7/COSMIC‐2 RO that traverses through the ionosphere and plasmasphere. On the other hand, the open fluxtubes in the high‐latitude region are cut off at around 10,000 km in altitude, and therefore, the altitude range of the GIP high‐latitude region is approximately from 100 to 10,000 km.

As mentioned above, the TIE‐GCM solves for the thermospheric states, including electrodynamics, in the fixed pressure coordinates (Richmond, Ridley, & Roble, ). The horizontal resolution of the TIE‐GCM version used for this study is 5° × 5°, and the vertical resolution is two levels per scale height. The altitude of the lower boundary of the TIE‐GCM is approximately 97 km, and the upper boundary ranges from 400 to 700 km depending on solar activity levels. By using the GIP/TIE‐GCM, we are able to account for ionosphere‐thermosphere coupling in the process of data assimilation. The main drivers of the GIP/TIE‐GCM include F10.7 index (F107), cross‐tail potential drop (CP), auroral hemispheric power (HP), and atmospheric tides. F107 represents the solar EUV level that determines the photoionization rates, photo‐dissociation rates, and heating rates of the neutral and ionized species in the model. HP and CP indexes represent the magnitude of auroral particle precipitation and the ionospheric convective electric fields imposed from the magnetosphere. The atmospheric tides control the lower boundary conditions of the model.

The data assimilation system under consideration is composed of an analysis step and a forecast step. In the analysis step, a selected set of the model state variables are updated through assimilation of observations. In the forecast step, updated state variables are fed back to the model and used as initial conditions to forecast the future state. In a data assimilation system, cycling of these two steps is carried out over an extended period. This is the so‐called data assimilation cycle.

In this study, the atomic oxygen ion density and electron density on the model grid are selected to be estimated and updated during the analysis step. The electron density is an observed variable, but in the GIP/TIE‐GCM, the electron density is recomputed as a sum of the atomic and molecular ion species at each model time step. In fact, the atomic oxygen ion is one of the main prognostic model state variables and the dominant ion species in the F region, whose number density is largely equal to the electron number density. The data assimilation scheme used in the analysis step in this study is the EnSRF developed by Whitaker and Hamill () implemented in NOAA's GSI data assimilation system.

The EnSRF can be presented as a modification to the traditional Kalman filter (Kalman, ) and to the ensemble Kalman filter (Evensen, ). Following the standard notation used in the atmospheric data assimilation (e.g., Ide et al., ), let *x*^{a} and *x*^{b} be an *m*‐dimensional vector of the updated state variables and forecast state variables, respectively, and *y*^{o} be a *p*‐dimensional vector of observational variables. **P**^{a} and **P**^{b} here represent an *m* × *m* analysis error covariance matrix and forecast error covariance matrix, respectively, and **R** denotes a *p* × *p* observational error covariance matrix. The Kalman gain matrix and the gain used to update deviations are demoted as **K** and *p* × *m* matrix, **H**. Following the formulation presented in Whitaker and Hamill (), a prime here denotes the deviation from the ensemble mean and an overbar denotes the ensemble mean. In all ensemble‐based Kalman filters, including EnSRF, sample estimates of **P**^{b} do not need to be explicitly computed and stored. Instead, the terms **P**^{b}**H**^{T} and **HP**^{b}**H**^{T} are computed from model ensemble as shown below. Under the assumption that observational errors are not correlated, observations can be assimilated sequentially one by one. In this serial application of the analysis update, **K** and **HP**^{b}**H**^{T} and **R** become scalars. This makes the filter implementation computationally more efficient. For a set of N model ensembles and one observation (*p* = 1), the state and covariance update equation in EnSRF are given as*n* is an index for ensemble member (*n* = 1, …N), **ρ**^{b} and **ρ**^{o} are *m* × *p* and *p* × *p* matrices of the covariance localization function, *ρ* (explain later), and **∘** denotes the element‐wise multiplication (i.e., Schur product). The ensemble means of the updated and forecast state variables are given as **H** represents an operation that computes sTEC values from electron density values on the model grid and will be discussed in detail in the next subsection. In this kind of filter implementation, the sampling errors, which originate from the use of a finite size of the model ensemble (*N* ≪ *m*) that detrimentally impacts on the estimation of **P**^{b}**H**^{T} and **HP**^{b}**H**^{T}, need to be addressed somehow. As shown later, this detrimental impact can be mitigated by the covariance localization. In the following section, the *prior* refers to a probability distribution of the model forecast ensemble before being updated by data assimilation and the *posterior* refers to a probability distribution of the model ensemble after the update in analysis step.

Commonly used auxiliary methods for adjusting **P**^{b} to correct the issues associated with sampling errors include covariance inflation (e.g., Anderson & Anderson, ) and covariance localization (e.g., Hamill et al., ; Houtekamer & Mitchell, ). An underdispersed model ensemble leads to an insufficient variance in **P**^{a}, which in turn causes filter divergence. Covariance inflation artificially inflates the sample variance of the model ensemble by effectively pushing an ensemble member away from the ensemble mean. The GSI‐EnSRF uses the relation to prior spread (Whitaker & Hamill, ) that inflates the posterior variance by multiplying an inflation factor, *γ*, to each model ensembles perturbation.*σ*_{a} is the posterior standard deviation, *σ*_{b} is the prior standard deviation, and *w* is the weighting factor for inflation. If *w* = 1, the posterior variance is the same as prior variance. If *w* = 0, there is no inflation.

On the other hand, the correlation estimated from a small number of the model ensemble often leads to spurious correlation, especially at large‐lag distance. To suppress this spurious correlation in **P**^{b}, Houtekamer and Mitchell () first introduced a cutoff distance to limit the impact of observation on the state update beyond a certain distance. This is referred to as localization of the covariance. In the EnSRF, the covariance localization is achieved via multiplying the sample covariance (or regression coefficient) between an observation and a state variable on model grid by a localization factor that determined by tapering (or localization) function. The localization function is essentially a correlation function or a distance‐dependent function with the value ranging from one to zero with an increasing distance. By using the covariance localization, the impact of a given observation on the state update can be limited around the observation location.

This study adopts the Gaspari and Cohn (GC) function (Gaspari & Cohn, ), which is widely used in atmospheric data assimilation and is denoted as *τ* here, to taper the ensemble‐based covariance. The GC function is parameterized by a localization length scale, *L*, that determines a distance beyond which the correlation becomes zeros. In the GSI‐EnSRF, the localization factor, *ρ*, is equal to a vertical localization function, *ρ*_{v}, multiplied by a horizontal localization function, and *ρ*_{h}, in the spherical Cartesian coordinates. The localization factor is given as*r*_{h} is the horizontal distance between an observation location and a model grid point and *r*_{v} is the difference in log‐scale pressure levels of an observation location and a given model grid level. In other words, *r*_{v} is an absolute altitude difference between an observation height and a model grid level given in terms of the scale height. Both vertical localization and horizontal localization functions are specified by the GC function with a vertical localization length scale, *L*_{v}, and a horizontal localization length scale, *L*_{h}, respectively.

It is difficult to define a location of the RO sTEC observation because sTEC is a nonlocal quantity. Therefore, the tangent point of each raypath is adopted as an observation location for the sake of implementing the covariance localization. A tangent point is the point along a given raypath that is the closest to the Earth under the straight‐line propagation. In the *F* region of the ionosphere, electron densities around a tangent point usually account for a large proportion of the electron densities integrated to sTEC. Since the RO raypath for a given sTEC traverses a large distance through the ionosphere and plasmasphere, sTEC observations contain information about the plasma densities over a large spatial model domain. It is important to note that the covariance between a given sTEC observation and model state variables is still inhomogeneous and anisotropic, even after the GC function is applied to taper the sample covariance to localize the impact of observations around the tangent point.

Generally speaking, the smaller the ensemble size, the higher the sampling errors. The covariance localization and inflation is used to rectify the issues that arise from spurious correlations due to the sampling errors. This paper will focus on the impact of both the covariance localization and ensemble size on quality of EnSRF assimilation analysis. By comparing results from a number of observing system simulation experiments (OSSEs), the most effective ensemble size and the length scales of the covariance localization to assimilate sTEC data into GIP/TIE‐GCM using the EnSRF are determined in this study.

Since the observations are usually not co‐located with the model grid points and the observed variable is often different from the model state variable, the model state variables need to be converted to the observed variables by using a forward (observation) operator, **H**. In this study, the observation is RO sTEC; hence, the sTEC value needs to be computed by integrating electron densities on the GIP/TIE‐GCM gird along the RO raypath to obtain the predicted value sTEC by the model (**H x**

OSSEs are one of widely used approaches to evaluate the potential impact of given observing systems before they are developed or deployed (Hoffman & Atlas, ). In OSSEs, the synthetic observations are simulated from the “true” state provided by a numerical model, often referred to as the nature run (NR), with the expected coverage, resolution, and accuracy of observation systems. Using synthetically generated observation, usual data assimilation experiments are carried out. Please note that the model ensemble members used in the data assimilation experiments are different from NR. By verifying the data assimilation results against the NR, the impact of assimilating observations from a hypothetical observing system on specification and forecasting of the geophysical system can be assessed.

The specific purpose of OSSEs here is to assess the ability of FORMOSAT‐7/COMSIC‐2 observing system to improve the low‐latitude and midlatitude ionospheric specification and forecasting. In this study, a number of OSSEs with different covariance localization length scales and ensemble sizes are conducted. All OSSEs are conducted under low solar activity, geomagnetically quiet, and solstice conditions, from 00:00 UT to 12:00 UT of 1 January. Synthetic sTEC observations are assimilated hourly into the GIP/TIE‐GCM as described below.

The model ensemble is generated by perturbing three main model drivers: F107, HP, and CP, according to a Gaussian distribution with the mean value of F107, HP, and CP set to 120 Solar Flux Unit (SFU), 16 GW, and 45 kV, and with the standard deviation set to 15 SFU, 2 GW, and 10 kV, respectively. When drawing the ensemble samples of these drivers from the respective Gaussian distribution, we assume that the F107 index is independent of both HP and CP, but HP is correlated to CP. Figure shows the histogram of the model driver ensembles along with the Gaussian distribution function from which the ensemble members are randomly drawn. In addition, the NR is executed by running the GIP/TIE‐GCM under higher solar and geomagnetic conditions than those for the ensemble mean. Specifically, the plasma density distributions of NR are generated with the F107, HP, and CP values of 140 SFU, 18 GW, and 55 kV, respectively.

Before data assimilation cycling, the model ensembles need to be spun up in order to allow enough time for each model ensemble member to reach the state that is dynamically balanced with perturbed drivers and to obtain the model ensemble with an adequate spread. In the spin‐up period, the stand‐alone TIE‐GCM model is run for 23 days. After that, the thermospheric state obtained from a long integration of the stand‐alone TIE‐GCM is used as initial conditions to advance the GIP/TIE‐GCM for another 5 days. Note that the NR is spun up in the same manner and that all model drivers are fixed during the spin‐up and data assimilation cycling periods.

The synthetic RO sTEC observations along the raypath between GPS and GLONASS satellites and FORMOSAT‐7/COSMIC‐2 low‐inclination satellites are generated as follows. Using the same observation operator described above, the electron densities from NR on the model grid first interpolated the values along a raypath with 20 km segments, and then integrated over a raypath. After that, observational errors are added based on a centered Gaussian distribution with the standard deviation of 3 TEC unit. The sampling rate used for each RO event is 1 Hz. Roughly 300 to 400 FORMOSAT‐7/COMSIC‐2 RO events that amount to 200,000 sTEC data are assimilated into the GIP/TIE‐GCM at each data assimilation cycle.

The first set of OSSEs is conducted to determine the impact of the ensemble size on the EnSRF performance and the quality of assimilation analysis. The EnSRFs with three different model ensemble sizes 40, 70, and 100 are executed with the identical covariance localization setting. In the second set of OSSEs, the EnSRFs with an ensemble size selected based on the first set of OSSEs are run to further study the impact of covariance localization. In the covariance localization scheme, the GC functions with four different horizontal localization length scales, including 500, 1,000, 5,000, and 10,000 km, and four different vertical localization length scales, including 0.5, 1, 3, and 7 ln(mb), are employed. For comparison, filtering experiments without the horizontal or/and vertical covariance localization are additionally executed. Considering the horizontal resolution of TIE‐GCM is 5° × 5° and an average raypath length that travels through the ionosphere is roughly 7000 km, the horizontal localization length scale for the GC function investigated here ranges from 500 to 10,000 km. The vertical localization length scales need to be given in terms of scale height in the GSI‐EnSRF. Since the TIE‐GCM vertical resolution is two levels per scale height and the total number of levels of TIE‐GCM interface is 29, the range of vertical localization length scales for the GC function explored is from 0.5 to 7 scale height.

Because the vertical localization length scale is specified in terms of scale height, the covariance localization is anisotropic in terms of geometric height (km). The cut‐off distance of the observation impact is farther away from an observation in the upward direction than the downward direction. For instance, the impact of observation located around the F2 peak is more aggressively localized in the bottomside ionosphere than in the topside ionosphere, which is favorable considering different physical mechanisms determining the *F* region and *E* region plasma density distributions.

According to the accuracy of TriG GNSS RO System, the observation error is 3 TEC unit for all data in OSSEs. Moreover, The weighting factor for covariance inflation in equation , *w*, is set to 0.9 based on experiments shown in Figure S1 in the supporting information, which compare two OSSEs with different values of weighting factor, *w* = 0.1 and *w* = 0.9. There is no signification difference, but an OSSE with a larger weighting factor performs slightly better. Therefore, a covariance inflation with *w* = 0.9 is applied in all the OSSEs presented in this study.

The comparison of the prior and posterior ensemble distribution to the NR is presented in terms of the root‐mean‐square difference (RMSD) between the ensemble mean and NR of atomic oxygen ion density, computed over the geomagnetic low‐latitude and midlatitude regions from 200 to 500 km altitudes, where FORMOSAT‐7/COMSIC‐2 low‐inclination RO data have the greatest influence on the assimilation analysis. In addition to the OSSEs described in the previous section, a control ensemble forecast experiment is executed, with no data assimilation, using the same perturbed model driver parameters used to initialize the model ensemble for the OSSEs. The RMSD between the forecast ensemble mean and the NR is computed in the same manner as for the posterior and prior ensemble.

If the ionospheric data assimilation of sTEC by the EnSRF is successful, the RMSD should become smaller after the analysis step, suggesting a greater proximity of the estimated model state to the NR from which observations are sampled. The posterior ensemble spread is ought to become smaller than the prior ensemble spread, reflecting the uncertainty reduction in the state estimation. In the forecast step, the RMSD of OSSE is likely to increase toward the level of RMSD of control ensemble forecast experiment because model drivers are not altered by assimilation and the same perturbed model drivers are used in both sets of the ensemble simulations. The ensemble spread should also grow larger during the forecast step of the EnSRF to reflect an increased degree of uncertainty in the state estimation. Through successive applications of the analysis and forecast steps, the RMSD should overall continue to decrease. Figure displays how the ensemble mean and each ensemble member typically vary, as the global mean atomic oxygen density (in the geomagnetic low‐latitude and midlatitude regions from 200 to 500 km altitudes, over the course of the entire data assimilation experiment). At the update step, all the ensemble members (grey lines) and the ensemble mean (black line) shift closer to the NR (red line), and the ensemble spread becomes smaller, representing the uncertainty reduction after incorporating the observation information into model ensemble. After that, the ensemble members diverge away from the NR and the ensemble spread grows larger during the forecast steps, representing the increasing of uncertainty.

Figure shows the RMSD of the OSSE results obtained from the EnSRF with 40, 70, and 100 ensemble members. In these experiments, the GC localization function is used to localize the covariance in the horizontal direction with a length scale of 5,000 km. No localization is applied in the vertical direction. The RMSD of the control ensemble forecast experiment with 100 ensemble members is also shown in Figure . Note that the performance of the control ensemble forecast experiment does not make much difference among the ensemble size of 40, 70, and 100.

The RMSD of these three OSSEs is generally smaller than that of control ensemble experiment over the entire data assimilation experiment of 12 hr. This indicates that the EnSRF can improve the ionospheric specification by bringing the model ensemble closer to the NR by assimilation of sTEC observations. The most significant improvement occurs in the first assimilation cycle. This is because the ionosphere of NR is biased to be higher in comparison to the ensemble mean, and the first update step is particularly effective in making a gross correction of the global atomic oxygen ion density distribution. This behavior is explored further with respect to a choice of the covariance localization parameters later. During the forecast step, the RMSD decreases for about 30 min and increases toward to the RMSD value of the control ensemble forecast experiment. This behavior will be future discussed in the next section.

As suggested by the RMSD, the performance of the EnSRF improves with an increasing number of ensemble members with the 100‐member EnSRF at the best among three filters. The same conclusion holds for the comparison of 40‐, 70‐, and 100‐member EnSRFs with different settings of covariance localization (see Figures S2, S3, and S4). Comparing with the 100‐member EnSRF, the EnSRF with 70 ensemble members results in a larger RMSD at the beginning of data assimilation experiment, but the RMSD gradually reduces over time. At the end of the 12‐hour data assimilation experiment, the ratio of RMSD to that of the control experiment is 0.3261 and 0.3215 for 70‐ and 100‐member EnSRFs, respectively. The performance of these two filters is similar.

Unlike the 70‐ and 100‐member EnSRFs, the 40‐member EnSRF's performance is inconsistent, and some of the GIP/TIE‐GCM ensemble simulations become numerically unstable during forecast steps. At the beginning, the behaviors of RMSD for the 40‐ and 70‐member EnSRFs are similar, but the posterior RMSD for the 40‐member EnSRF becomes lager than the prior RMSD after the fifth update step at 04:00 UT, which implies poor performance of the EnSRF. Under certain localization settings, even worse performance has been observed (see Figures S2, S3, and S4). In Figure , differences of the atomic oxygen density between the prior and posterior ensemble mean and NR at 330 km altitude at 04:00 UT are shown for the 40‐, 70‐, and 100‐member EnSRF. Positive values indicate a positive bias, meaning that the atomic oxygen ion density of ensemble mean is larger than that of the NR, and vice versa for negative values. Since the error covariance estimated by the 40‐member ensemble is not accurate enough, the posterior biases become larger than the prior biases in some regions. Although OSSEs with the 70‐ and 100‐member EnSRFs also have some problems in low‐ and midlatitudes of postnoon and premidnight regions, the magnitude of biases decreases with an increasing ensemble size. As a result, when forecasting, the GIP/TIE‐GCM is more stable if the model state is initialized by the EnSRF with the ensemble size of 70 or higher. At the end of whole data assimilation experiment at 12:00 UT, the ratio of RMSD to that of the control experiment is 0.5021 for the 40‐member EnSRF, which is fairly large in comparison to the 70‐ and 100‐member EnSRFs.

Figures a–d display how the prior covariance between a given sTEC observation and the atomic oxygen densities on the model grid looks in the 40‐ and 70‐member EnSRFs without and with the covariance localization. The raypath of this sTEC observation appears bended in these panels because it is displayed in the longitude‐latitude‐altitude coordinates. The tangent point of this raypath is located in the dayside EIA region at 350 km in altitude. Figure shows the analysis increment, for the same cases, along with the prior mean atomic oxygen ion density distribution as the grey scale background contour. Note that the analysis increment refers to

Figure shows the RMSD of the posterior ensemble from OSSEs with the EnSRF with and without covariance localizations at 00:00, 02:00, 04:00, 06:00, 08:00, 10:00, and 12:00 UT. Note that the RMSD is computed over the same region as for Figure . For the vertical localization, the GC function with four different vertical localization length scales, including 0.5, 1, 3, and 7 scale heights, is applied. For the horizontal localization, the GC function with four different horizontal localization length scales, including 500, 1,000, 5,000, and 10,000 km, is applied. The ensemble size is 70 for all OSSEs shown here. The RMSD over entire data assimilation cycles can be found in Figures S5–S9. The RMSD of OSSEs that uses the GC function with smallest horizontal and vertical localization length scales in the localization scheme is considerably larger than other OSSEs. This suggests the need of careful tuning of covariance localization parameter.

The RMSD is reduced dramatically at the first update step, especially if the covariance is not localized or localized with the GC function with a large length scale in both the horizontal and vertical directions. As mentioned earlier, this is because the gross correction of the prior ensemble, here biased to be higher, is more effective with no localization of the covariance. In comparison, for the same set of observations, such a reduction is less dramatic for OSSEs with covariance localization with the GC function with a smaller length scale, but there is a steady reduction of RMSD over many assimilating cycles. At the end of data assimilation experiment at 12:00 UT, the EnSRF with covariance localization with a length scale of 5,000 km in the horizontal direction and with no vertical localization leads to the smallest RMSD.

A choice of the localization length scale in the horizontal direction affects the assimilation analysis considerably as suggested by the RMSD magnitude. The EnSRF with localization with the GC function with a length scale of 5,000 to 10,000 km in the horizontal direction leads to the smallest error regardless of a choice of vertical localization length scale. Comparing with the horizontal direction, the impact of a vertical localization length scale appears to be minor. In general, the use of larger vertical localization length scales results in a smaller RMSD.

As shown in Figure , lager analysis biases in regions such as the low‐ and midlatitudes of postnoon and premidnight regions still need to be reduced by applying the covariance localization with a certain length scale. To further improve specification of the ionosphere in the EIA and boundary of high‐ and midlatitude regions, a localization function that is estimated specifically for the sTEC data using a method proposed by Anderson and Lei () might be helpful in the future.

In summary, the second set of OSSEs demonstrates that, if the GC function is used to localize the covariance in the EnSRF, the most appropriate range of the horizontal localization length scale for sTEC data assimilation is from 5,000 km to 10,000 km. No vertical localization appears to be the most effective. Figure shows the atomic oxygen ion density at the end of OSSE at 12:00 UT along with the mean of the control ensemble simulation and the NR. In this OSSE, the 70‐member EnSRF is used with the GC function with a length scale of 5,000 km to localize the impact of observation in the horizontal direction. A visual inspection of these panels shows that the assimilation analysis shown in Figure a is closer to the NR shown in Figure c in comparison to the control simulation shown in Figure b.

As shown in Figure , during almost all forecast steps of the EnSRFs, the RMSD grows smaller for about 30 min before starting to grow larger as expected. This peculiar behavior, here referred to as the “U‐shape” RMSD, suggests that the GIP/TIE‐GCM ensemble mean continues shifting toward the NR whose driver setting is slightly higher than the ensemble mean setting during the forecast step. As illustrated in Figure , the RMSD is expected to continuously grow during a forecast step.

To understand this better, the model state is examined in detail during the forecast step of the fourth data assimilation cycle when the U‐shape RMSD is the most evident from 03:00 to 04:00 UT (see Figure ). Figure shows the RMSD computed along each magnetic field line from 200 to 500 km altitude at every 12 min from 03:00 to 04:00 UT. The OSSE used to compute these RMSD maps was obtained with the 70‐member EnSRF. The high RMSD region appears roughly from 180° to 300° longitude that corresponds to 15:00 to 23:00 LT (postnoon to premidnight) at 03:00 UT. An apparently large RMSD region in the midgeomagnetic latitude at around 200° longitude (marked by a cyan box in Figure ) and another large RMSD region in low‐geomagnetic latitude at around 300° longitude (marked by an orange box) are referred to as regions A and B, respectively. In the course of the forecast step, the RMSD in the region A becomes smaller, while the RMSD in the region B increases. The error reduction and increase in these regions are responsible for the U‐shape RMSD computed over a large model domain.

Figure shows the distribution of atomic oxygen ion density at 330 km altitude, and the difference from the NR is shown in Figure in the same format as in Figure where the positive values mean that the ensemble mean is larger than the NR. Before the assimilation update at 03:00 UT, the ensemble mean is significantly larger than the NR in the postnoon to premidnight area in the low‐geomagnetic latitude. The EnSRF corrects the density globally but overcorrects in the postnoon to premidnight area as indicated by negative values in the posterior bias map shown in Figure . Over the course of the forecast step, these negative biases become smaller, bringing the midgeomagnetic latitude ionosphere state closer to the NR, while the positive biases start appearing again. The regions of positive and negative biases agree with the regions with large errors in Figure . At the end of current data assimilation cycle, the ensemble mean again becomes larger than that of the NR.

From the postnoon to premidnight, the photoionization production process becomes weaker and the loss process through recombination becomes more dominant. As shown in Figure , the overall atomic oxygen ion densities in the OSSE are smaller than that in the NR at the beginning of forecast step and become larger. This implies that the loss rate of atomic oxygen ion density in the OSSE is slower than that of the NR.

Because synthetic sTEC observations are sampled from the NR with a higher level of the solar EUV flux, the *F* region peak density in the OSSE is being place in the atmosphere with less molecular concentration in comparison to the NR when the EnSRF brings both the peak density and peak height up. This results in a smaller loss rate of the atomic oxygen ion of OSSE than that of the NR through recombination with molecular species and leads to a positive bias during postnoon and premidnight that appears at the end of forecast step. Examining the OSSE results shown in Figures , , and in more detail, 62% of the magnetic fluxtubes located from 180° to 300° longitude experience an increase in both peak density and peak height by the assimilation update at 03:00 UT. Figure shows the locations of foot points of these fluxtubes, which largely overlap with regions A and B. The loss rate of OSSE in the *F* region is smaller than that of the NR in the majority (83%) of those fluxtubes whose peak density height is corrected to higher.

In summary, the U‐shape RMSD results from limitations of the assimilation method rather than the intrinsic dynamical behaviors of the thermosphere and ionosphere. The analysis update of the atomic oxygen ion density in the local postnoon and premidnight regions is inadequate, resulting in a negative bias from the NR as shown in Figure . Because the neutral composition is unaffected by the assimilation update, the EnSRF brings the peak density larger and brings peak height to the atmospheric region with a less abundance of molecular species. As a result, the situation that the loss rate in the OSSE is smaller than that in the NR. These limitations should be overcome in the future, by updating the thermospheric compositions in the analysis step as has been done in Hsu et al. () and by improving quality of assimilation analysis with the help of a nonparametric covariance localization function estimated for a specific observing system (e.g., Anderson & Lei, ).

A few issues with the current localization scheme need to be addressed in the future study. First, the bending angle of raypath that travel through the ionosphere and plasmasphere is very small, so we could consider the raypath as a straight line between GNSS (GPS and GLONASS) satellites and FORMOSAT‐7/COSMIC‐2 low‐inclination satellites in normal Cartesian coordinate. It is ideal to adaptively localize the covariance along the raypath of a given sTEC observations. On the other hand, currently in the GSI, the horizontal distance is computed in the spherical Cartesian coordinate with precludes a true representation of distance. This discrepancy causes an incorrect adjustment in covariance localization. Second, the raypath travels over a large horizontal distance in the ionosphere but confined vertically. Although the vertical localization is expected to improve quality of the assimilation analysis, an aggressive vertical localization in the OSSEs results in discontinuous ion/electron density profiles that in turn introduce an undesirable unbalance in dynamical and chemical processes in the forecast step. A more comprehensible investigation of the covariance localization scheme for ensemble data assimilation of sTEC observation is needed to solve issues addressed above in the future.

Data assimilation is a powerful technique that can be used not only for monitoring the ionospheric weather but also for gaining a better understanding of various ionospheric phenomena. By systematically contrasting various observations and a model through the process of data assimilation, we are able to identify our lack of understanding of fundamental physical processes described in the first‐principle model. Although this study focuses on assimilating the FORMOSAT‐7/COSMIC‐2 low‐inclination RO sTEC data into the GSI Ionosphere system with the practical aim of improving the ionospheric specification and forecasting, this technique will also be helpful for addressing science questions, for instance, regarding day‐to‐day variability of ionosphere by providing an instantaneous global picture of the ionosphere.

The GSI Ionosphere is an ionospheric data assimilation system that is constructed with the NOAA GSI‐EnSRF and GIP/TIE‐GCM. The impact of sTEC on the low‐ and midlatitude ionosphere specification has been investigated through a comparative analysis of OSSEs. By using the GIP/TIE‐GCM in conjunction with the EnSRF, the data assimilation analysis is produced with the benefit of a self‐consistent coupling of the ionosphere and plasmasphere with the thermosphere in the forecast steps. The EnSRF is an ensemble‐based data assimilation scheme, and detrimental effects of the sampling errors caused by the use of a finite number of ensemble need to be rectified in order to construct a stable and effective filtering system and to yield high quality data assimilation analysis. A number of the OSSEs are carried out, with different ensemble sizes and different covariance localization scales, to examine the most suitable EnSRF parameters for sTEC data assimilation with the GIP/TIE‐GCM.

Primary findings are summarized as follows:

In the future, the EnSRF performance can be improved further by taking the following measures. Nonparametric localization functions that are designed specifically to sTEC data with a consideration of the RO raypath geometry, instead of a parametric function such as the GC, are desirable. In addition, updating the thermospheric compositions during the analysis step is considered essential to extend the utility of the FORMOSAT‐7/COSMIC‐2 RO data to further improve the ionospheric specification using the GSI‐EnSRF.

This study is supported by the NOAA Space Weather Prediction Center and by the following grants: Taiwan Ministry of Science and Technology grant MOST 105‐2119‐M‐008‐020, National Space Organization grant NSPO‐S‐104083, NASA award NNX14AI17G, and AFOSR grant FA9550‐15‐1‐0308. The authors thank NCAR COSMIC office for their great help with FORMOSAT‐7/COMSIC‐2 RO data. The authors would like to acknowledge the high‐performance computing resource and support provided on Yellowstone by NCAR's Computational and Information Systems Laboratory, sponsored by the National Science Foundation (ark:/85065/d7wd3xhc). We are also grateful for helpful guidance from Jeff Whitaker and Lili Lei. This work is presented at the 2017 International Team meeting for Ionospheric Space Weather Studied by RO and Ground‐based GPS TEC Observations, which is led by Jann‐Yenq Liu and supported by International Space Science Institute, Bern, Switzerland. The main data assimilation result presented in this paper is publicly available from