Forecasting distribution shifts under novel environmental conditions is a major task for ecologists and conservationists. Researchers forecast distribution shifts using several tools including: predicting from an empirical relationship between a summary of distribution (population centroid) and annual time series (“annual regression,”

Many ecosystems worldwide are showing rapid responses to changing environmental conditions. Distribution shifts often indicate the intensity of environmental impacts, both because distribution shifts can directly cause changes in community stability (Theobald, Breckheimer, & HilleRisLambers, ) and because distribution is an integrated measure of changing demographic rates (Merow et al., ). Rapid distribution shifts have been documented for birds (Hitch & Leberg, ), plants (Kelly & Goulden, ) and fishes (Pinsky, Worm, Fogarty, Sarmiento, & Levin, ).

Distribution shifts due to environmental change can be particularly rapid in marine ecosystems. For example, the Gulf of Maine has shown rapid increases in water temperature during the past decade, and these increases have changed the spatial distribution and phenology of lobster fisheries thereby causing economic hardships in 2012 (Mills et al., ). Similarly, the Northwest Atlantic as a whole has shown rapid warming, and this has driven a rapid northward shift in abundance for yellowtail and summer flounder, in turn causing a lagged shift in fishing effort for these species (Pinsky & Fogarty, ). Rapid changes such as these are less frequent in ecological systems where slow changes in habitat or long‐lived life‐history strategies can underlie a lagged effect of climate on local population densities (Schurr et al., ).

Forecasting future shifts in distribution is a unifying challenge for ecologists and conservationists across many taxa. Forecasting future distribution shift is important to predict likely changes in competitive and facultative interactions among species (Schliep et al., ) or the invasion success of newly introduced species (Ramírez‐Albores, Bustamante, & Badano, ). In marine species, forecasting distribution shifts is particularly important when renegotiating fishery access rights (Pinsky & Fogarty, ), or when forecasting potential impacts of changing climate on fishery economic potential (Cheung et al., ).

Recent research has recommended forecasting distribution shifts using mechanistic models that represent the impact of environmental changes and biotic interactions on individual demographic rates (Mellin et al., ; Swab, Regan, Matthies, Becker, & Bruun, ; Trainor, Schmitz, Ivan, & Shenk, ; Zurell, ). However, many taxa (e.g., marine fishes, insects and fungi communities, among others) often do not have sufficient individual‐specific demographic information to parameterize stage‐based models of environmental and biological mechanisms (e.g., examples in Parmesan, ). In marine fishes, for example, researchers have instead forecasted distribution shifts using two types of models: (a) annual regressions (AR) of distribution shift, or (b) habitat‐envelope (HE) models. Annual regressions of distribution generally involve calculating an annual summary statistic for species distribution, for example, the centroid of a population in each year. This centroid is then regressed against multiple annual covariates, for example, total abundance, average bottom temperature or fishing mortality rates (Adams et al., ; Mueter & Litzow, ; Nye, Link, Hare, & Overholtz, ; Spencer, ), and this empirical relationship is used to forecast distribution shift under alternative future values of these covariates (Hare, Alexander, Fogarty, Williams, & Scott, ). By contrast, habitat‐envelope models involve estimating or pre‐specifying the habitat suitability function for a given species based on temperature, demersal habitat or other local habitat conditions (Cheung, Lam, & Pauly, ; Kaschner, Watson, Trites, & Pauly, ). Changes in habitat suitability can then be forecasted given projected changes in environmental conditions, and changed habitat suitability can be used to forecast changes in the centroid of a population's distribution (Cheung et al., ). Recent research has also developed spatio‐temporal models to estimate distribution shifts and attribute shifts to different causal mechanisms, and these models can include both annual covariates (e.g., average environmental conditions) and local covariates affecting habitat suitability (Thorson, Ianelli, & Kotwicki, ; Tredennick et al., ).

Despite this increased interest in forecasting species distribution, there has been relatively little research comparing predictive skill for alternative potential models used to forecast distribution shifts. Skill testing is a common approach to compare models and benchmark progress in many other types of environmental forecasting and has been widely used, for example, to validate strategic forecasts of abundance and productivity using an ecosystem model for the Northwest Atlantic (Olsen et al., ), explore improvements in hurricane forecasts over time (Vitart, ), assess skill for decadal forecasts of butterfly distribution in Finland (Eskildsen et al., ), or identify when regression models are useful to forecast fish distribution for the Northwest Atlantic (Kleisner et al., ). Skill testing has seen increasing use for seasonal forecasts of fish distribution shift, for example, 2‐month forecasts for the spatial distribution of tunas in the Great Australian Bight (Eveson, Hobday, Hartog, Spillman, & Rough, ), or 4‐ to 8‐month forecasts of sardine distribution in the US California Current (Kaplan, Williams, Bond, Hermann, & Siedlecki, ). However, skill testing has not been widely applied to compare performance for multiple models to forecast shifts in species distribution occurring over time‐scales between seasonal and climate forecasts (1–5 years, termed “short‐term” in the following).

In this paper, I use retrospective skill testing to compare the performance of “annual regression” (AR) and “habitat‐envelope” (HE) forecasts of distribution shift with an alternative vector‐autoregressive spatio‐temporal (VAST) model. Following common practice in fisheries, I measure distribution as the centroid of population biomass rather than measuring changes in occupied habitat. This VAST and HE models involve fitting a delta‐generalized linear mixed model (delta‐GLMM) to local samples of population biomass while accounting for both spatial, temporal and spatio‐temporal correlations in biomass density. AR, HE and VAST forecasts are generated by fitting to biomass‐sampling data at 370 stations for each of 20 marine species in the Eastern Bering Sea from 1982 to 2015. For each species, I fit data for a subset of years, predict shifts in centre of gravity (COG) occurring after the last fitted year and compare the prediction with observed shifts over those subsequent years. Finally, I summarize predictive skill by calculating the variance explained by each model's forecast relative to forecasting that the population does not move in the future (termed a “persistence forecast”). I also determine whether the predicted uncertainty for each model (termed the “forecast interval”) is too narrow, too wide or has appropriate width. These two metrics of predictive skill measure whether forecasts are accurate and generate useful estimates of forecast uncertainty. I use this example to illustrate the importance of retrospective skill testing for measuring model performance when planning future model developments or choosing among alternative forecast models.

I seek to evaluate predictive skill for three potential models for forecasting distribution shifts without invoking detailed demographic mechanism: annual‐regression models (AR), habitat‐envelope models (HE) and a vector‐autoregressive spatio‐temporal (VAST) model with or without temperature as a habitat covariate. These models forecast distribution based on changes in temperature, and I here assume that models have perfect information about future temperature; results therefore represent an “optimistic” picture of forecast skill relative to cases when future temperature must itself be forecasted. I first discuss each of these in detail (see Table for summary). I then describe two measures of forecast skill: reduction in mean squared error (MSE) relative to a null “persistence” forecast, and predictive interval coverage. Finally, I describe the data set used in this skill testing experiment.

Several authors have previously analysed environmental impacts on distribution shifts by regressing a measure of population location against annual covariates, without proposing any explicit model for how the environment impacts local population densities (Hare et al., ; Nye et al., ; Pinsky et al., ). This approach will presumably perform better than the habitat‐envelope or spatio‐temporal models whenever the impact of habitat covariates on local densities is difficult to specify correctly a priori.

To implement this approach, I first develop a statistic *Y*(*t*) representing the centroid of the population north of the equator, calculated as the “abundance‐and‐area weighted average” (AAWA) of sampled locations (see Supporting Information Appendix S1). I then regress *Y*(*t*) against average bottom temperature *t*:*α* and *δ* represent the intercept and slope for a linear regression of *Y*(*t*) on average temperature *Y*(*t*_{forecast}) given

Habitat‐envelope (a.k.a. species distribution/density) models typically use available data or expert opinion to define habitat suitability as a function of habitat variables, where future distribution can then be forecasted given alternative values for habitat variables (Cheung et al., ; Eskildsen et al., ; Eveson et al., ).

I use a delta‐generalized linear model (Lo, Jacobson, & Squire, ) to separately predict occupied habitat via changes in encounter probability *p* as well as biomass density within occupied habitat via positive catch rates *r*. This delta model involves specifying a probability distribution for sampled biomass *b*_{i} for each sample *i* given predicted encounter probability *p*_{i} and positive catch rates *r*_{i}:

Specifically, I use a Poisson‐link delta model (Thorson, ) that accomplishes this by modelling both the density of individuals *n*(*s*,* t*) and the average biomass per individual *w*(*s*,* t*) at each location *s* and year *t*. Encounter probability *p*_{i} is defined in the Poisson‐link delta model given the assumption that individuals are randomly distributed in the vicinity of sampling, such that the probability of encountering at least one individual follows: *a*_{i} sampled for sample *i* (in this case, the area swept by each operation of bottom trawl sampling). Positive catch rate *r*_{i} is defined from the definition of biomass density, *d*(*s*,* t*) = *r*(*s*,* t*)*p*(*s*,* t*) = *n*(*s*,* t*)*w*(*s*,* t*):

Finally, I define numbers density and average weight as quadratic functions of bottom temperature using a log‐link for each: *C*(*s*,* t*) is temperature (in degrees Celcius) at each location *s* and year *t*, and *γ*_{n1}, *γ*_{n2}, *γ*_{w1} and *γ*_{w1} represent the estimated log‐linear and log‐quadratic impact of temperature on numbers density and average weight. I model annual intercepts *β*_{n}(*t*) and *β*_{w}(*t*) as a random‐walk process: *γ*_{n1}, *γ*_{w1}, *γ*_{n2}, *γ*_{w2}, *β*_{n}, *β*_{w}) are treated as random effects.

This model is fitted to data for all years *d*(*s*,* t*) for all locations during forecast years *C*(*s*,* t*_{forecast}) during forecast years. Finally, density predictions are used to calculate the centroid of the population's distribution, *Z*(*t*): *a*(*s*) is the area associated with each modelled location *s*, and *z*(*s*) is a measure of location. In the following, I predict poleward movement and define *z*(*s*) as the distance of location *s* from the equator (in kilometres).

This habitat‐envelope model attributes all variation in spatial distribution to a quadratic effect of bottom temperature on log‐density and neglects any residual spatial correlation. However, extensive previous research suggests that predictive performance and statistical interpretation are degraded when spatial models neglect spatial autocorrelation in residuals (Bahn & McGill, ; Dormann et al., ; Thorson, Ianelli, et al., ). I therefore apply an alternative spatio‐temporal model that can incorporate quadratic temperature effects as well as residual spatial patterns (spatial patterns that are constant among all years) and spatio‐temporal patterns (spatial patterns that vary among years). This “semiparametric” model has a habitat‐envelope model as its deterministic skeleton, but is also able to identify areas with higher or lower density than expected, and to estimate how quickly these areas with higher/lower density revert to their expected density when forecasting forward in time (Kai, Thorson, Piner, & Maunder, ; Thorson, Ianelli, et al., ).

Specifically, I apply a vector‐autoregressive spatio‐temporal (VAST) model that has been used extensively elsewhere (Thorson, ; Thorson & Barnett, ), but which has not been used for short‐term forecasts of spatial distribution. This model again uses a Poisson‐link delta model (Equations ) and a random‐walk process for annual intercepts (Equation ). However, the spatio‐temporal model involves changing the linear predictor for log‐numbers density *n*(*s*,* t*) and log‐average weight *w*(*s*,* t*): *n* and *w*. Spatial errors follow a Gaussian random field and are treated as random effects: **R**_{n} and **R**_{w} follow a Matérn correlation function where I estimate geometric anisotropy and a separate decorrelation rate for **R**_{n} and **R**_{w} (see Supporting Information Appendix S2 for details). I use a “predictive process” approximation to simplify computation for spatial and spatio‐temporal variation at 100 “knots” in a stochastic partial different equation approximation to the Matérn correlation function (Lindgren, Rue, & Lindström, ). Spatio‐temporal variation is specified similarly, but also follows an autoregressive process across years: *ρ*_{n} and *ρ*_{w} are estimated parameters that govern the rate at which areas with higher/lower density than expected revert to the average spatial distribution during the forecast period.

I specifically explore two versions of this VAST model. The first version excludes temperature effects (i.e., *γ*_{n1} = *γ*_{n2} = *γ*_{w1} = *γ*_{w2} = 0), while the second estimates temperature effects; the two models are otherwise identical. The former model explains differences in biomass density *d*(*s*,* t*) purely via spatial and spatio‐temporal residual terms, and I therefore refer to it as a “non‐parametric” forecast model. The latter includes both non‐parametric components and a parametric effect of temperature. I therefore follow previous studies in calling it a “semiparametric” forecast model (Shelton, Thorson, Ward, & Feist, ; Sugeno & Munch, ; Thorson, Ono, & Munch, ). I fit this model using package VAST (^{−6}. This final step is implemented to ensure that the function maximizer is very tightly converged. I then confirm that parameters are estimable by confirming that the Hessian matrix is positive definite at the maximum‐likelihood estimates. Parameter estimation using VAST takes approximately 10–20 min for each species and retrospective run, so the analysis was feasible without using of high‐performance computing tools.

I have defined four methods to forecast poleward movement of fish populations given available sampling data: habitat‐envelope models, a spatio‐temporal model with or without temperature, and an annual‐regression estimator. I compare performance of these four estimators when fitting to data for the 20 most abundant (in terms of sampled biomass) fish and crab (decapod) species in the bottom trawl survey operated by the Alaska Fishery Science Center in the continental shelf of the Eastern Bering Sea from 1982 to 2015. This annual survey has used a fixed‐station design with over 370 samples per year and consistent gear over this period (Lauth & Conner, ), and I download data from the AFSC website (*FishData* (

For each species, I then fit each forecast model 21 times, that is, using data for all years 1982–2015, using data 1982–2014 and predicting 2015, using data 1982–2013 and predicting 2014–2015, …, and using data for 1982–1995 and predicting 1996–2015. This involves fitting four estimators in 21 retrospective models using data for 20 species (1,680 model fits total). For each model fit, I then record forecasts of the population centroid 1, 2 or 3 years ahead of the last fitted year. I also record the predicted standard error for this forecast, where asymptotic standard errors are calculated using a generalization of the delta method for empirical Bayes models (Kass & Steffey, ).

I follow Tommasi et al. () in arguing that a well‐performing estimator will have two characteristics: (a) It will outperform a “persistence” forecast; and (b) it will estimate uncertainty in a useful manner.

I evaluate each forecast model by comparing performance when predicting future changes in COG against the observed change in COG calculated using the abundance‐and‐area weighted average (AAWA) estimator. I evaluate skill when forecasting future changes in distribution because distribution shifts are likely to be disruptive to place‐based management measures and fishing activities. I do not know the true centroid of the population in any year, but the AAWA estimator is independent of each forecast model and therefore represents a fair and unbiased metric against which to compare each forecast model. I specifically calculate changes in AAWA, *t*_{final}:*t* ∊ {1, 2, 3}. I also calculate change in centroid *t* forecast years for each estimator when fitting to data through year *t*_{final}. For example, given the forecasted centroid *t*_{forecast} using data through *t*_{final} and the habitat‐envelope model:

The persistence forecast will predict that *Y*_{persistence} (*t*_{final} + Δ*t*) = *Y*_{persistence} (*t*_{final}) and therefore that Δ*Y*_{persistence} (Δ*t*) = 0. I therefore calculate the predictive error for each estimator relative to the persistence forecast as: *R*^{2}(Δ*t*) = 1–*V*(Δ*t*). A model performing as well as the persistence forecast will have *V*(Δ*t*) = 1 and *R*^{2}(Δ*t*) = 0, while a model with *R*^{2}(Δ*t*) > 0 outperforms the persistence forecast while a model with *R*^{2}(Δ*t*) < 0 has degraded performance relative to a persistence forecast. Hypothetically, a model with, for example, *R*^{2}(Δ*t*) = 0.2 has explained 20% of the variance in future changes in the observed abundance‐and‐area weighted centroid.

I also evaluate model performance by asking which model can accurately estimate the uncertainty of their forecasts. I therefore calculate the quantile *x*;* μ*,*σ*) is the cumulative distribution function for a normal distribution with mean μ and standard deviation *σ* evaluated at location *x*, and where the forecast standard error

Results show that the majority of species show an optimal temperature within the range observed in the Eastern Bering Sea, such that changes in temperature could drive changes in spatial preferences. Specifically, the habitat‐envelope model and VAST model with temperature both involve a quadratic effect of temperature on log‐biomass density. Comparing across species, both models for the majority of species show decreasing biomass density as temperature goes to either the highest or lowest values observed in the Eastern Bering Sea (Figure ). Specifically, there are 40 estimated “temperature responses” (affecting numbers density and average biomass for each of 20 species) for each model estimating a quadratic effect of temperature, and a negative curvature is observed in 28 temperature effects for the habitat‐envelope model and 26 effects for the VAST model with temperature (see Figure ). The VAST model often estimates a more broad temperature tolerance than the habitat‐envelope model, for example, for arrowtooth flounder (*Atherestes stomias*, Pleuronectidae) where the temperature response has nearly twice the width for the VAST model than the habitat‐envelope model for both numbers density and average‐weight components of the delta model.

The VAST model with temperature also shows positive autocorrelation among years for spatio‐temporal residuals for all species (median across species: ρ_{n} = 0.86 and ρ_{w} = 0.65, Table ). This indicates that areas with higher/lower density than expected are likely to exhibit higher/lower density for several years sequentially. As a consequence, any year with a centre of gravity that is more skewed towards the pole/equator than its average value will often be followed by another year with a distribution similarly skewed towards the pole/equator.

Comparing forecasts of distribution for Alaska pollock (*Gadus chalcogrammus*, Gadidae) using all four estimators shows that the habitat‐envelope and annual‐regression estimators capture neither the northward shift in the abundance‐and‐area weighted average (AAWA) estimator nor the large interannual variation in this estimator (Figure ). For example, neither HE nor AR estimators capture the nearly 200 km poleward movement from 1995 to 1998. By contrast, the spatio‐temporal estimators with or without temperature show similar performance, and both capture the observed northward shift for this species when fitted to data for 1982–2015. Both also generate forecast intervals that are much wider than the habitat‐envelope and annual‐regression estimators, and these wide forecast intervals appear to be necessary to contain estimates of centre of gravity arising when fitting to subsequent data. For example, the spatio‐temporal estimator with temperature using data for 1982–1995 predicts a rapid increase in confidence interval width when predicting COG for 1996–1998. This rapid increase in confidence interval width is appropriate, given that 1998 has the most northward distribution during the entire period 1982–2015 and is still somewhat outside the estimated forecast interval for this year.

The spatio‐temporal model without temperature forecasts an exponential decay towards the average COG during the forecast period (i.e., after the last year of fitted data). The average COG for *G. chalcogrammus* has shifted northward over 100 km from 1982 to 2015, so the later forecasts decay towards an average level that is more poleward than fits to earlier years (e.g., the purple forecast lines in 2015 are more northward than the blue forecast lines; Figure ). By contrast, the spatio‐temporal model with temperature generates forecasts that combine elements of both the VAST model without temperature and the habitat‐envelope model. Specifically, the VAST model with temperature generates forecasts that follow an exponential decay towards the average COG of the historical period, but also evidence of interannual variability predicted by temperature patterns in each year (e.g., the purple forecast lines for the VAST model is less variable than the purple lines for the VAST model with temperature). For example, when fitting to data from 1982 to 1995, the COG forecast in 1997–1998 is higher than the forecast for 1999–2000, and this pattern is also seen in the habitat‐envelope model. However, differences in forecast between VAST with and without temperature are small relative to the difference between VAST and the habitat‐envelope model.

I next evaluate predictive skill when forecasting poleward movement 1, 2 or 3 years after the last year of fitted data for Alaska pollock. I first compare forecasted poleward movement using each estimator against the observed poleward movement using the abundance‐and‐area weighted estimator (Figure ). This comparison shows that all models predict a poleward movement 1, 2 and 3 years forward using data 1982–1995, and this prediction is consistent with subsequent observations. However, the forecast interval is much too narrow for both the habitat‐envelope and annual‐regression estimators, and is more appropriate for the VAST forecasts with or without temperature. The VAST forecast also appears to be more accurate than either habitat‐envelope or annual‐regression forecasts. For example, habitat envelope and annual regression predict a poleward shift when forecasting 1 year forward and fitting to data through 1999, whereas both VAST models forecast a southward shift and the southward shift is in fact observed (Figure ). The improved forecast accuracy is clearly shown when calculating the correlation between forecasted and observed shifts, which is >0.5 for 1‐, 2‐ or 3‐year forecasts using VAST with or without temperature, but is lower for habitat‐envelope (<0.28) or annual‐regression (<0.39) forecasts (Figure ).

The correlation between forecasted and observed poleward movement is much greater for VAST models (either with or without temperature) than either habitat‐envelope or annual‐regression estimates across all 20 species, whether using 1‐, 2‐ or 3‐year forecasts (Figure ). The variance explained by the habitat‐envelope estimator for the median species is actually negative (Table ), indicating that this forecast method has lower skill than a persistence forecast. Variance explained by the HE model is highly negative for species where it estimates a strong temperature response (e.g., see Figure : yellowfin sole (*Limanda aspera*, Pleuronectidae), shortfin eelpout (*Lycodes brevipes*, Zoarcidae) and wattled eelpout (*Lycodes palearis*, Zoarcidae)), but where the forecasted variance in distribution has little correlation with observed distribution shifts (Supporting Information Figures S1 and S2). The median variance explained is low but positive for the annual‐regression estimator (0.02–0.06) and higher for both VAST models (0.08–0.25), and for these models the variance explained is higher for 3‐year forecasts (0.20–0.25) relative to 2‐year (0.14–0.16) or 1‐year forecasts (0.08). Forecast models generally explain a larger portion of variance for 3‐year forecasts because the error of the persistence forecast increases faster than the error of VAST forecasts (Supporting Information Appendix S3: Figure S3). The median variance explained is generally <25%, although individual species have higher variance explained (e.g., 69% for 3‐year forecasts of Pacific cod (*Gadus marcocephalus,* Gadidae)), and VAST and annual‐regression forecasts have greater skill than the persistence forecast. VAST has very similar performance with or without temperature, likely because the inclusion of a quadratic temperature response reduces the standard deviation of residual spatio‐temporal variation by only 4%–6% on average (Supporting Information Appendix S3: Table S1), and predictions of change in COG are very similar between these two models (Supporting Information Figure S1). The habitat‐envelope and annual‐regression estimators perform poorly in terms of forecast interval coverage (Figure ), where the observed poleward shift falls outside the 90% forecast interval for the majority of 1‐, 2‐, or 3‐year forecasts. By contrast, the VAST models generate forecast intervals that contain the true value an appropriate proportion of the time, indicating that forecast intervals for this model are a useful measure of predictive uncertainty.

Retrospective skill testing involves fitting ecological models to historical data, forecasting future changes and comparing forecasts with subsequent observations. I have demonstrated the potential role of retrospective skill testing for models forecasting distribution shifts by comparing two conventional estimators (habitat envelope, HE, and annual regression, AR) with a recently developed vector‐autoregressive spatio‐temporal (VAST) estimator. Using real‐world data from 20 marine species in the Eastern Bering Sea and comparing performance with 400 retrospective analyses (20 forecast years for each species), I have showed that the conventional HE estimator has lower forecast skill then a simple “persistence” estimator. By contrast, the annual‐regression estimator generally explains a small (2%–6%) portion of variance in poleward movement, and the spatio‐temporal estimators explain several times more variance (8%–25%). The spatio‐temporal estimator is also the only method to estimate forecast intervals with appropriate width. I therefore conclude that the spatio‐temporal estimator has suitable forecast skill, both in terms of accuracy and estimated uncertainty.

The VAST model performs better than the HE model for two main reasons. First, it accounts for temporally autocorrelated residuals in species distribution via the estimated autocorrelation parameters (*ρ*_{n} and *ρ*_{w}). The importance of accounting for autocorrelated residuals when fitting species distribution models (Bahn & McGill, ; Dormann et al., ) or when forecasting population dynamics (Ives, Dennis, Cottingham, & Carpenter, ; Johnson et al., ) has been demonstrated previously, but has not been widely used when forecasting species distribution models. Secondly, the spatio‐temporal model estimates the temporal variance of local changes in species density during the modelled period and uses this estimated variance to generate reasonable forecast intervals for future changes. As the explanatory power of covariates increases, the variance of residual spatial and spatio‐temporal variation will decrease and the performance of the HE and VAST models will converge. However, the variance of residual spatial and spatio‐temporal variation remains large for all 20 species after accounting for a quadratic effect of bottom temperature (Supporting Information Appendix S3: Table S1). I therefore recommend using a spatio‐temporal model as a testing ground for determining the relative explanatory power of covariates and residual environmental drivers, and recommend future research exploring additional covariates that could better explain historical variation in density.

Regarding the Eastern Bering Sea case‐study presented here, results are consistent with prior analyses of this system in several important ways. First off, results show substantial variability in northward trends in COG for demersal species in the Eastern Bering Sea (see Supporting Information Figure S1), including instances where species are moving southward, or where a northward shift in distribution either is or is not predicted by a positive response to increasing regional temperatures (Mueter & Litzow, ) or decreasing cold pool area (Kotwicki & Lauth, ). Unlike previous analyses, however, the VAST model explored here estimates autocorrelated variability in spatio‐temporal patterns in population density (*ρ*_{n} and *ρ*_{w} in Table ). This underlies autocorrelated variability in COG, and the latter can complicate interpretation of the correlation between COG and environmental covariates either when restricting analysis to a small subset of years (e.g., 2004–2009 in Hollowed et al. ) or when testing significance in a linear regression (e.g., Mueter & Litzow, ). Finally, this study did not explore several mechanisms that have invoked to explain distribution shifts in this region, for example, density‐dependent habitat selection (Spencer, ; Thorson, Rindorf, Gao, Hanselman, & Winker, ) or age‐specific temperature responses (Barbeaux & Hollowed, ; Thorson, Ianelli, et al., ). I therefore recommend future work exploring whether density dependence or age‐structure can improve forecast skill in this or other regions.

Environmental management typically requires predictions over seasonal (<1 year), short‐term (1–5 year), medium‐term (5–50 year) and long‐term (>50 year) planning horizons, and models may be more or less skilful at forecasts for each time‐scale. Seasonal forecasts of spatial distribution for fished species can be useful for fishers to modify fleet behaviour, timing, and fishing techniques, and have been useful for prawn aquaculture in northeast Queensland, tuna fishing the Great Australia Bight, and lobster fishing the Gulf of Maine (see Tommasi et al. for a review). Predictive skill for seasonal forecasts has been extensively tested, for example, showing improved predictions of sardine distribution in the California Current relative to a persistence forecast (Kaplan et al., ). Although there are sufficient data to conduct retrospective skill testing over seasonal or short‐term forecasts, there is surprisingly little research using skill testing to compare performance among alternative models. I therefore recommend skill testing to either identify which models to use for different planning horizons, or how to weight predictions from multiple models when using an ensemble model for forecasting (e.g., Anderson et al., ).

Finally, this study highlights the need for increased skill testing for models used for short‐ and long‐term forecasting of distribution shifts. For example, the decreased performance of a habitat‐envelope model relative to a persistence forecast will likely surprise many readers. However, the poor performance of a habitat‐envelope model is in‐line with recent research showing that historical distribution shifts are poorly explained by local or regional temperature for Alaska pollock (Thorson, Ianelli, et al., ) and other groundfish in the Eastern Bering Sea (Kotwicki & Lauth, ). I recommend using retrospective analyses as a testing ground to improve probabilistic methods for forecasting short‐term distribution shifts, and in particular exploring multispecies models for spatio‐temporal dynamics (Latimer, Banerjee, Sang, Mosher, & Silander, ; Thorson & Barnett, ; Thorson, Rindorf, et al. ; Warton et al., ). In cases with strong species interactions or similar responses to shared environmental drivers, the information available in multispecies data sets may improve predictability for future dynamics (Ovaskainen et al., ; Schliep et al., ; Thorson, Munch, & Swain, ; Trainor et al., ). I therefore hypothesize that multispecies spatio‐temporal models will improve short‐term forecasts of the northward distribution shift for commercial fishes in this region relative to the single‐species forecast models explored here. Finally, I recommend further research using forecasting models to predict distribution based on future emission scenarios, for example, for use in future climate forecasts (Hollowed et al., ), but suggests that these methods should be skill‐tested prior to use for this task.

I thank K. Kristensen for developing the Template Model Builder software, without which the vector‐autoregressive spatio‐temporal model would not be feasible to fit. I also thank the many scientists who collected data in the Eastern Bering Sea shelf bottom trawl survey used here, which is an invaluable resource for exploring distribution shifts in marine species. Finally, I thank S. Kotwicki, J. Hastie, M. McClure and two anonymous reviewers for helpful comments on an earlier draft and grant #15‐027 from the NOAA Habitat Assessment Improvement Plan that supported documentation for a precursor to package VAST.

All data used are publicly available from the Alaska Fisheries Science Center at