This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

We evaluate Linear Inverse Models (LIMs) trained on last millennium model data to predict Arctic sea‐ice concentration, thickness, and other atmospheric and oceanic variables on monthly timescales. We find that more than 500 years of training data and 100 years of validation data are needed to reliably estimate LIM forecast skill. The best LIM has skill up to 8 months lead time and outperforms an autoregressive model of order one (AR1) forecast at all locations, with particularly large outperformance near the ice edge. However, for out‐of‐sample validation tests using data from various different model simulations and reanalysis products, they underperform an AR1 model due to differences in the location of the sea‐ice edge from the training data. We present a metric for predicting LIM forecast skill, based on the spatial correlation of the variance in the training and validation data sets.

Rapid changes in sea‐ice concentration in recent decades have produced new navigational challenges and hazards in the region, elevating the importance of seasonal sea‐ice forecasts. Arctic sea ice has been shown to contain inherent predictability on seasonal timescales, yet current predictions generally show poor skill. Here, we employ a statistical technique referred to as Linear Inverse Modeling, which uses linearized dynamical modes estimated from a training data set to predict sea‐ice conditions on monthly timescales. We find a Linear Inverse Model is able to outperform a baseline statistical model throughout the Arctic when initialized on the data derived from the same model.

A Linear Inverse Model is evaluated using last millennium model simulations for Arctic climate prediction

The Linear Inverse Model successfully predicts Arctic conditions when the same model simulation is used for training and validation

Linear Inverse Model forecast skill is proportional to the spatial correlation of variance in the validation and training data

Large climatic changes observed in the Arctic in recent decades have made prediction of Arctic sea ice on monthly to seasonal timescales increasingly relevant (Jung et al., 2016). Younger, thinner ice in the Arctic (e.g., R. Lindsay & Schweiger, 2015; Maslanik et al., 2007) has allowed for new shipping routes (e.g., Smith & Stephenson, 2013) and a growing prevalence of industrial and tourist (Hall & Saarinen, 2010) activities in the region. Furthermore, fast changing sea‐ice conditions present hazards for local communities as conditions become increasingly challenging to anticipate and predict (Eicken, 2013).

Arctic sea ice has been shown to contain inherent predictability with persistence of sea‐ice concentration (SIC) up to 5 months (Blanchard‐Wrigglesworth, Armour, et al., 2011; Lemke et al., 1980; R. W. Lindsay et al., 2008), and strong area–thickness coupling provides predictability for even longer timescales (Blanchard‐Wrigglesworth, Bitz, et al., 2011). Blanchard‐Wrigglesworth and Bushuk (2019) find that predictability in models is likely robust within a constant climate mean state, but these relationships might not be stationary in a warming climate (Bonan & Blanchard‐Wrigglesworth, 2020; Holland & Stroeve, 2011; Holland et al., 2019). Beyond persistence, other major contributors to Arctic sea‐ice predictability include dynamical advection of sea‐ice anomalies by mean Arctic circulation patterns (Guemas et al., 2016), atmospheric temperature variability (Olonscheck et al., 2019), and ocean heat flux (Bitz et al., 2005; Bushuk et al., 2019; Yeager et al., 2015).

There has been a growing effort to predict Arctic sea ice on subseasonal to seasonal timescales, and the Sea Ice Outlook (SIO, Stroeve et al., 2014) represents a substantial sea‐ice‐research community effort to develop and improve Arctic sea‐ice prediction. Starting in 2008, the SIO has accepted forecasts of September sea‐ice conditions from groups around the world, and though much progress has been made, there remains considerable room for improvement. Stroeve et al. (2014) evaluated the skill of SIO forecasts from 2008 to 2013 and found that, overall, the skill, regardless of method, was generally poor when sea‐ice conditions depart from the long‐term trend. Moreover, statistical models tend to outperform dynamical models (Stroeve et al., 2014) likely due in part to uncertainty in model physics (Blanchard‐Wrigglesworth et al., 2015) and initial conditions (Blanchard‐Wrigglesworth et al., 2017).

Most statistical, data intensive, approaches for predicting Arctic sea ice involve predicting pan‐Arctic or regional quantities using linear regression techniques (e.g., Drobot et al., 2006; R. W. Lindsay et al., 2008; Petty et al., 2017; Tivy et al., 2007). R. W. Lindsay et al. (2008) combine observed atmospheric indices and sea‐ice coverage with ocean and ice thickness fields derived from Pan‐Arctic Ice‐Ocean Modeling and Assimilation System (PIOMAS) in order to develop an empirical linear model. They find substantially more skill than previous linear regression approaches that include only SIC and atmospheric variables (Drobot et al., 2006). More recently, other data‐driven techniques have been applied to Arctic sea‐ice prediction on seasonal timescales. For example, Andersson et al. (2021) use convolutional neural networks trained on both climate model and observational data to predict 6 months of SIC fields that are able to outperform a dynamical model for seasonal forecasts. Hogg et al. (2020) apply a Koopman Mode Decomposition to satellite observations of SIC to make future predictions of SIC in both hemispheres.

Here, we employ a Linear Inverse Model (LIM) to predict Arctic sea ice on monthly timescales. A LIM is ideal for this application as it uses linearized dynamical modes estimated from a training data set, harnessing the main contributors to Arctic predictability. The LIM framework assumes stationary statistics and that fluctuations about a mean state can be modeled by linear dynamics plus stochastic noise. Yuan et al. (2016) used a linear Markov model trained on sea‐ice, oceanic, and atmospheric variables from reanalysis data to predict Arctic SIC on monthly timescales, similar to the LIM approach here; however, they did not detrend the data and only use 35 years (1979–2013) for training and validation. LIMs have become a useful tool for predicting and probing the dynamics of the tropical atmospheric and oceanic variables on seasonal timescales (e.g., Alexander et al., 2008; Cavanaugh et al., 2015; Dias et al., 2019; Henderson et al., 2020; Huddart et al., 2017; Newman et al., 2009; Penland, 1996; Penland & Matrosova, 1998; Penland & Sardeshmukh, 1995; Shin et al., 2021; Winkler et al., 2001). Perkins and Hakim (2020) built a multivariate LIM for forecasting global climate states on annual timescales. We build on this previous work with a focus on predicting Arctic sea‐ice and other climate fields on monthly timescales.

We focus on answering two main questions: (a) over what time period is a LIM useful for predicting Arctic sea‐ice and climate conditions on monthly timescales? (b) What conditions are required for the LIM to make skillful predictions? To address the first question, in Section 3.1, we use last millennium simulations to train a LIM and assess the sample size needed for robust estimates of LIM skill. Given the absence of strong external forcing during the last millennium, and large availability of data, we are able to optimize the parameters of the LIM for predicting Arctic sea‐ice coverage and assess its skill. We then address the second question in Section 3.2 by initializing the LIM with data from different model simulations and reanalysis data and find that the LIM fails to beat an autoregressive model of order one (AR1) forecast due to differences in the sea ice from the training data. We present a metric for predicting when the stationary statistics assumed for the LIM are not satisfied.

An LIM is an empirically determined linearization of a statistically stationary dynamical system about its mean state (Penland, 1996; Penland & Matrosova, 1994; Penland & Sardeshmukh, 1995). The tendency of a multivariate state vector **x** can be represented as**L** is the deterministic dynamical operator that propagates the state in time and *ζ* represents the unpredictable dynamics as uncorrelated white noise forcing in time with state–space correlations. Integrating Equation 1 with respect to time gives the forecast from time *t* to *t* + *τ*,**n** is a random error vector from the integration of the white noise (*ζ*). Since **L** is a matrix, it has an eigenvalue decomposition such that **Lu**_{m} = *λ*_{m}**u**_{m}, where **u**_{m} are the eigenmodes of **L** and *λ*_{m} are the corresponding eigenvalues. Stationary statistics require that the eigenmodes of **L** (**u**_{m}) are all damped (eigenvalues have negative real parts). Since the eigenmodes are not orthogonal, interference results in transient growth despite the decay of each eigenmode (e.g., Farrell, 1982).

We use standard inverse methods to determine **L** empirically as described in Penland (1989) by defining a matrix **G**_{τ} = exp(**L***τ*), which is determined through *τ*‐lag covariances,**C**_{τ} = 〈**x**(*t* + *τ*)**x**^{T}(*t*)〉 represents the sample covariance of **x** for a time lag of *τ* and **C**_{o} = 〈**x**(0)**x**^{T}(0)〉. Assuming that the state and the error are uncorrelated, we can use Equation 2 to propagate the state covariance:

From Equations 3 and 4 we also have

Given **G**_{τ} and **N**_{τ}, we solve for the state **x** at time *τ* using Equation 2 and the covariance using Equation 4. Note that when validating the LIM mean forecast, we take the expectation of Equation 2, and the noise term vanishes.

As mentioned in Section 2.1, the LIM operator **L** is assumed to be stationary about a mean value, and damped (it has negative eigenvalues), so we linearly detrend the training data and remove the climatological mean for each variable. Given that the trend in time is different across the seasonal cycle (particularly for Arctic sea‐ice variables, e.g., Serreze et al., 2007), we remove the linear trend and mean for each month individually at each grid point. We limit the domain to north of 40°N in order to optimize for Arctic prediction.

Given the large number of degrees of freedom in the state vector, we truncate the state using an area‐weighted empirical orthogonal function (EOF) decomposition before calculating the lagged covariance. The leading 50 EOFs for each variable are retained and the resulting state projection normalized by the square‐root of the total temporal variance such that the sum of the variance over the truncated state for each variable sums to 1. Once truncated and normalized, all LIM variables are stacked in a matrix:**G**_{τ} using Equation 4 for *τ* = 1 month.

Arctic sea‐ice coverage exhibits a large seasonal cycle; thus, we anticipate the need to potentially build a LIM for different months separately. To investigate this, we train LIMs for single‐month transitions (e.g., forecasting only February from January) as well as LIMs trained on all months (with seasonal cycle and trends removed). We find that single‐month LIMs are prone to have positive eigenvalues for **L**, indicating that the stationary assumptions of the LIM are not met. We hypothesize that this is due to a change in the location of variability for sea ice as the location of the sea‐ice edge changes; that is, the statistics are not stationary. In order for the LIM to produce anomalies in a future month for locations where there is not much variability in the initial month, positive eigenvalues result. As a result, we use the all‐month LIMs for the rest of the paper.

We perform both *in*‐ and *out‐of‐sample* validation, which refer to the time period used for validation relative to the training period. For in‐sample validation, the time period used for validation is also used to train the LIM, and for out‐of‐sample validation, the validation time period is not used in training. We will use *intramodel* validation to indicate when data originating from the same model run is used for both training and validation of the LIM and *cross‐model* validation to indicate when the LIM is trained and initialized using data originating from different model simulations and reanalysis data.

For training the LIM, we use monthly averaged data from the Community Earth System Model version 1 (CESM1) Last Millennium Ensemble (LME, Otto‐Bliesner et al., 2016). We train and validate our LIM on a single ensemble member spanning years 850–1850 CE. Model fields used to test their affect on LIM forecast skill are: 2‐m air temperature (TAS), sea level pressure (PSL), 500 hPa geopotential height (ZG500), sea surface temperature (SST), sea‐ice thickness (SIT), and SIC. These variables were selected because they have all been associated with sea‐ice variability, and we hypothesize that including each of these variables may contribute positively to the LIM’s ability to predict SIC. Specifically, Arctic sea ice has been shown to exhibit strong coupling with surface air temperature, particularly on longer timescales (e.g., Armour et al., 2011; Gregory et al., 2002; Mahlstein & Knutti, 2012; Olonscheck et al., 2019). PSL is associated with surface winds that can drive sea‐ice motion and variability (e.g., Rigor et al., 2002) and, similarly, ZG500 is associated with the large‐scale dynamical state of the atmosphere which can also contribute to sea‐ice variability. Furthermore, persistence in SIC is strongly influenced by both SSTs and SIT (e.g., Blanchard‐Wrigglesworth, Armour, et al., 2011; Bushuk et al., 2015).

We also initialize and train the LIM using other models and data products simulating various time periods in order to test the sensitivity of the LIM to different mean states and covariance structures across variables. During the last millennium, we use monthly averaged data from both the Community Climate System Model version 4 (CCSM4, Landrum et al., 2013) and Max Planck Institute (MPI) for meteorology last millennium simulations (Jungclaus et al., 2012), which were run from 850 to 1850 CE. Both of these simulations are part of the Coupled Model Intercomparison Project, phase 5 (Taylor et al., 2012), Paleoclimate Modeling Intercomparison Project phase 3. During the historical period, we use monthly averaged data from the Coupled Model Intercomparison Project, phase 6 (CMIP6, Eyring et al., 2016) Community Earth System Model version 2 (CESM2, Danabasoglu et al., 2020), MPI Earth System Model version 1.2 (low resolution [Mauritsen et al., 2019]), and Geophysical Fluid Dynamics Laboratory Earth System Model version 4.1 (GFDL, Dunne et al., 2020) simulations, which were run from 1850 to 2014 CE. We also validate the LIM using the European Center for Medium‐Range Weather Forecasts reanalysis product (ERA5, Hersbach et al., 2020), which spans 1979–2020 CE (ERA5 does not include a SIT variable). For simulations of the future, we use one ensemble member from the CESM1 Large Ensemble (Kay et al., 2015) which simulates 1920–2100 CE. All variables from all sources are regridded onto the native ocean and atmosphere grids from the CESM1 LME simulations.

To validate our predictions, we use the squared correlation coefficient (*R*^{2} value), coefficient of efficiency (CE), and root mean squared error (RMSE). As defined below, the correlation coefficient (*R*) measures relative phasing of two time series.*v* is the verification data and *x* is the state being evaluated (the forecasted value). The square of *R* describes the percentage of the variance in *v* that is linearly explained by *x*. The CE (Nash & Sutcliffe, 1970), like the correlation coefficient, not only measures the relative phasing of two data sets but also includes bias in the mean and variance:

To assess the pan‐Arctic skill of SIC, which is reported as the percentage of a grid cell covered in sea ice, we use total Arctic RMSE. For forecasts of other quantities such as SIT, we use Arctic mean RMSE. To calculate the RMSE we use,*i* represents a given forecast, *v* is the verification data, and *x* is the forecast being evaluated. The RMSE is calculated for each grid cell. For SIC, the RMSE is multiplied by the grid cell area and the sum is taken across the ocean domain (north of 40°N, land not included), which we refer to as total Arctic RMSE. For other variables, the area‐weighted Arctic mean RMSE is taken across the Arctic (north of 40°N), which we refer to at the Arctic mean RMSE. LIM forecasts are projected back from EOF space and compared with the verification data (not truncated) in latitude–longitude space.

Given that the LIM assumes stationary statistics, we expect the LIM to produce skillful forecasts during the last millennium when there is less anthropogenic forcing than the Instrumental Era or future projections. We anticipate that the LIM will also perform well for intramodel validation given that the covariances in the LIM match those in the target state. In Section 3.1, we will use intramodel experiments to optimize the LIM parameters, and in Section 3.2, we will use cross‐model validation to quantify under what time periods the LIM produces skillful forecasts. Forecasts from an AR1 model, trained on the EOF truncated data, serve as a baseline reference for comparison with the LIM forecasts. Here, an AR1 forecast (*W*) for a given month *t* is defined as *W*_{t} = *α*_{1}*W*_{t−1}, where *α*_{1} is the 1 month lag correlation of the system.

To start, we train and validate the LIM during the last millennium using intramodel validation with a CESM1 LME simulation, which provides ample data to test the sensitivity of the LIM to various parameters. In particular, we test the sensitivity of LIM performance to the following parameters: number of training and validation years, number of EOFs included in the truncation, and variables included in training.

To investigate the number of training and validation years necessary for converged LIM performance metrics, we perform two experiments. First, we fix the training period from 850 to 1650 CE (800 years) and vary the number of validation years from 10 to 200 years segments between 1651 and 1850 CE. All possible nonoverlapping validation segments of each length are used. Next, we fix the validation period from 1751 to 1850 CE (100 years) and vary the number of training years ranging from 100 to 900 years for all possible nonoverlapping segments between 850 and 1750 CE. For 1 month forecasts, the LIM total Arctic RMSE asymptotes at large sample size to 1.5 × 10^{6} km^{2} as compared to 1.6 × 10^{6} km^{2} for the AR1 forecasts (Figure S1 in Supporting Information S1). This indicates that more than 500 years of monthly training data and 100 years of validation data are needed to produce reliable skill metrics. For the remaining experiments, we will train the LIM between 850 and 1650 CE unless otherwise noted.

Next, we vary the number of EOFs included in the truncation during the training procedure (Section 2.4) from 5 to 250 for each variable and perform both in‐sample (851–1050 CE) and out‐of‐sample (1651–1850 CE) validation. The percent of variance explained as a function of the number of EOFs retained for each of the six variables in the CESM1 LME simulation data is shown in Figure S2 in Supporting Information S1. For all experiments, and 10 or more EOFs for each variable, the LIM out‐of‐sample forecasts outperform an AR1 forecast for lead times up to around 5–6 months (not shown). While LIM forecast skill decreases monotonically in number of EOFs for the in‐sample experiments, skill for the out‐of‐sample experiments levels off around 40–80 EOFs depending on lead time (Figure S3 in Supporting Information S1). This indicates overfitting for larger numbers of EOFs, and to avoid that we truncate all subsequent experiments to 50 EOFs per variable.

We now investigate how including different variables contribute to LIM predictability of Arctic SIC at different lead times. We consider the role of atmospheric (TAS, PSL, and ZG500), oceanic (SST), and SIT on SIC predictions. Nine different LIMs are trained using: only SIC, SIC plus TAS, SIC plus PSL, SIC plus ZG500, SIC plus SST, SIC plus SIT, SIC plus SST plus SIT, SIC plus TAS plus SST plus SIT, and a LIM trained using all six variables. For each LIM, we use 800 training years (851–1650 CE), 200 validation years (1651–1850 CE), and truncate to 50 EOFs per variable. Moreover, we evaluate LIM performance relative to a LIM trained on SIC alone (Figure 1). Most of the SIC forecast skill comes from SST and SIT variables (dark blue dashed line in Figure 1), as well as TAS (purple dashed line in Figure 1); however, all variables contribute favorably to forecast skill of SIC. TAS and SST contribute the most to SIC forecast skill on 1–3 months lead times, while SIT along with TAS and SST contribute the largest skill increase on 4–6 months lead times. PSL and ZG500 show very similar contributions to the forecast skill in SIC at all lead times, though adding both variables increases forecast performance more than adding either variable individually (not shown). Given these results, we include all six variables in LIM training except when validating on reanalysis data, which does not include SIT.

Performance of the LIM during out‐of‐sample validation involves training on all months between 850 and 1650 CE and validation between 1651 and 1850 CE using the CESM1 LME simulation. The LIM shows skill above an AR1 forecast for all variables at 1 month lead times (Figure S4 in Supporting Information S1). For mass fields (PSL and ZG500), such skill is short‐lived and disappears beyond 2 months lead time. For sea‐ice variables (SIT and SIC) and TAS, LIM outperforms an AR1 forecast through 8 months lead times and 7 months lead times for SST. Generally, the shorter‐lived predictability of PSL and ZG500 makes physical sense given the relatively short‐lived nature of pressure anomalies in the atmosphere relative to SST anomalies, which tend to be more persistent. The longer lived predictability of SIT is also consistent with previous work showing that SIT anomalies persist across seasons (Balan‐Sarojini et al., 2021) and up to a year (Blanchard‐Wrigglesworth, Armour, et al., 2011).

In terms of the spatial distribution of skill, we find positive CE values everywhere except for small regions near the ice edge where sea‐ice variability is only present during a small number of months in the validation data set (Figure S5 in Supporting Information S1). RMSE is generally greatest near the sea‐ice edge, with largest values near the Fram Strait into the Barents Sea. Figure 2 shows the difference in spatial skill of 1‐month forecasts between the LIM and AR1 forecasts. The LIM outperforms an AR1 forecast in nearly all regions, increasing the correlation and CE value by up to approximately 0.24. RMSE is also reduced or remains the same everywhere with a maximum reduction of approximately 3.8%. To place this value in context, we note that the maximum mean monthly standard deviation in SIC across all grid cells in the CESM1 LME simulation (1850–2005 CE) is approximately 30%. The reduction in RMSE, relative to this standard deviation, is highest (0.28 sigma) in the Pacific sector, Hudson Bay, and Kara Sea (see Figure S6 in Supporting Information S1).

These intramodel, out‐of‐sample results indicate that the LIM can not only predict SIC well on seasonal timescales, but other important oceanic and atmospheric variables of the Arctic climate state. We have optimized the LIM to predict SIC, but it maintains skill in other variables for long lead times (up to 8 months, Figure S4 in Supporting Information S1). Furthermore, the LIM is able to predict SIC on 1 month lead times throughout the Arctic region, showing better skill than an AR1 forecast everywhere. These results indicate that the LIM is a useful tool for efficient emulation of coupled model simulations.

To investigate our second research question and determine under what conditions the LIM works well for predicting Arctic sea‐ice conditions, we perform cross‐model validation. For these experiments, we initialize the LIM using data that originate from a different model than the LIM training data (CESM1 LME simulation from 850 to 1650 CE). Generally, we find that the LIM fails to outperform an AR1 forecast (trained on the same model data as the LIM) when initialized with models outside CESM1 and outside the mean state conditions of the last millennium. Forecast skill as a function of lead time for the cross‐model experiments are shown in Figure S7 in Supporting Information S1. For all cross‐model validation experiments, the LIM outperforms climatology on 1 month timescales and up to 4 months when using CESM1 historical simulations. Only experiments performed using historical or last millennium simulations from CCSM4 and CESM1 outperform an AR1 forecast (Figure S7 in Supporting Information S1).

To summarize and evaluate LIM performance when trained using CESM1 LME and validated using different data sets, we compare the variability in the validation data relative to the training data set. Specifically, we calculate the spatial correlation between the training and validation patterns of variability to represent spatial differences. Figure 3 shows this spatial correlation versus the total Arctic RMSE for different LIM forecasts relative to an AR1 forecast of SIC. The vertical axis in Figure 3 (and Figure S7 in Supporting Information S1) indicates that the LIM is able to outperform an AR1 forecast when initialized with CCSM4 last millennium simulations (navy blue), but not with MPI last millennium simulations (brown). Similarly when initialized with CMIP6 historical simulations (1851–2014 CE) from MPI, GFDL, and CESM2 as well as a future scenario from the CESM2 LE, the LIM was unable to outperform an AR1 forecast. Finally, we tried to validate the LIM using ERA5 reanalysis data from 1980 to 2020 CE and this LIM also failed to outperform an AR1 forecast (SIT not included).

Figure 3 also indicates that, overall, this spatial correlation is a good predictor of LIM performance. Thus, the locations of the variability in the training and validation data sets have to be sufficiently colocated in order for the LIM to outperform an AR1 forecast. Applying least squares regression to the data shown in Figure 3 yields an *R*^{2}‐value of 0.77. The *x*‐intercept of this best fit line is 0.66, indicating that a spatial correlation between the training and validation data variance must exceed that value in order to likely outperform an AR1 forecast. A similar analysis was also done for SIT (see Figure S8 in Supporting Information S1) and the *x*‐intercept of the best fit line is 0.55 (*R*^{2}‐value of 0.72). These results confirm that LIMs are most useful when applied to time periods exhibiting stationary statistics and that the LIM inherits model biases from the training data set.

Overall, we find that a LIM trained using a CESM1 LME simulation performs well as an emulator, yielding skillful forecasts of Arctic sea ice and other climate variables to 8 months lead, when validated on data from the model used for training, including out‐of‐sample validation in time. We find that at least 500 years of monthly data are needed for training and 100 or more years of monthly validation data to reliably quantify skill. We also find that TAS and SST contribute most to Arctic SIC prediction on 1–3 months timescales, while SIT becomes more important on 4–6 months timescales.

When validated on out‐of‐sample intramodel data, the LIM performs well not only for SIC prediction but for all of the six variables included. The predictions outperform an AR1 forecast for all variables except for PSL and ZG500 for lead times beyond 2 months. This highlights the ability of the LIM to capture a broader picture of the Arctic system despite being optimized for predicting sea‐ice coverage. The LIM outperforms an AR1 forecast throughout the Arctic and shows the most skill in regions near the ice edge with the most variability in SIC.

We test the ability of the LIM to predict conditions when initialized with other last millennium simulations, historical simulations, reanalysis data, and a future scenario, although the LIMs produce skillful forecasts to 1–4 months lead time, none outperform an AR1 forecast (except the CCSM4 last millennium simulation). This result is likely due in part to model bias and in part to changes in climate statistics depending on the scenario of interest. Previous work has found a large range of persistence estimates across models and a general overestimation of persistence in models compared to observations (Blanchard‐Wrigglesworth & Bushuk, 2019; Giesse et al., 2021), which may contribute to the lack of skill we see during cross‐model validation. We find that the spatial correlation between the variability (in time) of the training versus validation data can be used as a good predictor (*R*^{2}‐value of 0.77) of whether the LIM will outperform an AR1 forecast, with a spatial correlation greater than 0.66 between the training and validation needed for LIM forecasts to outperform AR1.

Given this criteria, one can pursue training and validation data sets that best fit a period of interest. If a LIM were to be trained for present‐day forecasting, more sample training data are needed than are available from satellite and reanalysis products. Furthermore, a training data set’s variability would need to correlate well with satellite observations. The training data set we use here, CESM1 LME, has a spatial correlation of 0.57 with satellite observations between 1980 and 2015 CE which does not pass the threshold determined here. This value drops to 0.42 between 2000 and 2015 CE.

As described here, the LIM is best used as a model emulator as it seems to inherit model biases from the training data set. However, given that the LIM can predict a diverse set of variables across the climate system at minimal computational cost, it could also be a useful tool for probing coupled model dynamics in the Arctic.

The authors thank Kinya Toride and Lindsey Taylor for their helpful comments and suggestions regarding Linear Inverse Modeling. MKB was supported by a NSF Graduate Research Fellowship award DGE‐1762114 and the University of Washington CICOES Graduate Student Fellowship. GJH was supported by NOAA through award NA20NWS4680053 and EB‐W by NSF through award OPP‐1751363. GJH and EB‐W were also supported by NSF through Grant OPP‐2213988.

All data used to train and validate models for this work were derived from other sources. CESM1 LME simulations are available through Otto‐Bliesner et al. (2016), CCSM4 last millennium simulations through Landrum et al. (2013), and MPI last millennium simulations from Jungclaus et al. (2012). Historical simulations from CESM2 are available through Danabasoglu et al. (2020), MPI historical simulations through Mauritsen et al. (2019), and GFDL historical simulations from Dunne et al. (2020). ERA5 reanalysis data are available from Hersbach et al. (2020) and CESM1 LE simulations through Kay et al. (2015).