PLoS ONEplosplosonePLOS ONE1932-6203Public Library of ScienceSan Francisco, CA USAPONE-D-23-0344610.1371/journal.pone.0291906Research ArticlePhysical sciencesMathematicsProbability theoryRandom variablesCovarianceResearch and analysis methodsSimulation and modelingPhysical sciencesMathematicsProbability theoryRandom variablesEngineering and technologySignal processingAutocorrelationResearch and analysis methodsMathematical and statistical techniquesStatistical methodsAutocorrelationPhysical sciencesMathematicsStatisticsStatistical methodsAutocorrelationEarth sciencesGeographyPhysical geographyWatershedsComputer and information sciencesGeoinformaticsGeostatisticsEarth sciencesGeographyGeoinformaticsGeostatisticsResearch and analysis methodsMathematical and statistical techniquesStatistical methodsAnalysis of variancePhysical sciencesMathematicsStatisticsStatistical methodsAnalysis of varianceResearch and analysis methodsMathematical and statistical techniquesCluster analysisk means clusteringIndexing and partitioning the spatial linear model for large data setsIndexing and partitioning the spatial linear model for large data setshttps://orcid.org/0000-0003-4302-6895Ver HoefJay M.ConceptualizationFormal analysisInvestigationMethodologyValidationVisualizationWriting – original draftWriting – review & editing^{1}*https://orcid.org/0000-0002-3393-5529DumelleMichaelFormal analysisMethodologySoftwareWriting – review & editing^{2}HighamMattFormal analysisMethodologySoftwareWriting – review & editing^{3}PetersonErin E.Data curationMethodologySoftwareWriting – review & editing^{4}IsaakDaniel J.Data curationMethodologyWriting – review & editing^{5}Marine Mammal Laboratory, NOAA-NMFS Alaska Fisheries Science Center, Seattle, WA, United States of AmericaUnited States Environmental Protection Agency, Corvallis, Oregon, United States of AmericaSt. Lawrence University Department of Mathematics, Computer Science, and Statistics, Canton, New York, United States of AmericaAustralian Research Council Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS), Queensland University of Technology, Brisbane, Queensland, AustraliaRocky Mountain Research Station, U.S. Forest Service, Boise, ID, United States of AmericaAbonazelMohamed R.EditorCairo University, EGYPT
The authors have declared that no competing interests exist.
* E-mail: jay.verhoef@noaa.gov202311120231811e0291906622023792023This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
We consider four main goals when fitting spatial linear models: 1) estimating covariance parameters, 2) estimating fixed effects, 3) kriging (making point predictions), and 4) block-kriging (predicting the average value over a region). Each of these goals can present different challenges when analyzing large spatial data sets. Current research uses a variety of methods, including spatial basis functions (reduced rank), covariance tapering, etc, to achieve these goals. However, spatial indexing, which is very similar to composite likelihood, offers some advantages. We develop a simple framework for all four goals listed above by using indexing to create a block covariance structure and nearest-neighbor predictions while maintaining a coherent linear model. We show exact inference for fixed effects under this block covariance construction. Spatial indexing is very fast, and simulations are used to validate methods and compare to another popular method. We study various sample designs for indexing and our simulations showed that indexing leading to spatially compact partitions are best over a range of sample sizes, autocorrelation values, and generating processes. Partitions can be kept small, on the order of 50 samples per partition. We use nearest-neighbors for kriging and block kriging, finding that 50 nearest-neighbors is sufficient. In all cases, confidence intervals for fixed effects, and prediction intervals for (block) kriging, have appropriate coverage. Some advantages of spatial indexing are that it is available for any valid covariance matrix, can take advantage of parallel computing, and easily extends to non-Euclidean topologies, such as stream networks. We use stream networks to show how spatial indexing can achieve all four goals, listed above, for very large data sets, in a matter of minutes, rather than days, for an example data set.
http://dx.doi.org/10.13039/100000139U.S. Environmental Protection AgencyDW-13-92434601-0https://orcid.org/0000-0003-4302-6895Ver HoefJay M.JVH: The project received financial support through Interagency Agreement DW-13-92434601-0 from the U.S. Environmental Protection Agency (EPA), and through Interagency Agreement 81603 from the Bonneville Power Administration (BPA), with the National Marine Fisheries Service, NOAA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Data AvailabilityThe SPIN method has been implemented in the spmodel R package https://cran.r-project.org/web/packages/spmodel/index.html. The example data can be downloaded from the Github repository, https://github.com/jayverhoef/midColumbiaLSN.git.Introduction
The general linear model, including regression and analysis of variance (ANOVA), is still a mainstay in statistics,
Y=Xβ+ε
where Y is an n × 1 vector of response random variables, X is the design matrix with covariates (fixed explanatory variables, containing any combination of continuous, binary, or categorical variables), β is a vector of parameters, and ε is a vector of zero-mean random variables, which are classically assumed to be uncorrelated, var(ε) = σ^{2}I. The spatial linear model is a version of Eq (1) where var(ε) = V, and V is a patterned covariance matrix that is modeled using spatial relationships. Generally, spatial relationships are of two types: spatially-continuous point-referenced data, often called geostatistics, and finite sets of neighbor-based data, often called lattice or areal data [1]. For geostatistical data, we associate random variables in Eq (1) with their spatial locations by denoting the random variable as Y(s_{i}); i = 1, …, n, and ε(s_{i}); i = 1, …, n, where s_{i} is a vector of spatial coordinates for the ith point, and the i, jth element of V is cov(ε(s_{i}), ε(s_{j})). Table 1 provides a list of all of the main notation used in this article.
10.1371/journal.pone.0291906.t001Commonly-used symbols and their meanings in this paper.
0-1 matrix to subset y to the neighborhood of the jth location
C^
an estimator of var(β^bd)
The main goals from a geostatistical linear model are to 1) estimate V, 2) estimate β, 3) make predictions at unsampled Y(s_{j}), where j = n + 1, …, N, form a set of spatial locations without observations, and 4) for some region B, make a prediction of the average value Y(B)=∫BY(s)ds/|B|, where |B| is the area of B. Estimation and prediction both require O(n2) for V storage and O(n3) operations for V^{−1} [2], which, for massive data sets, is computationally expensive and may be prohibitive. Our overall objective is to use spatial indexing ideas to make all four goals possible for very large spatial data sets. We maintain the moment-based approach of classical geostatistics, which is distribution free, and we work to maintain a coherent model of stationarity and a single set of parameter estimates.
Quick review of the spatial linear model
When the outcome of the random variable Y(s_{i}) is observed, we denote it y(s_{i}), which are contained in the vector y. These observed data are used first to estimate the autocorrelation parameters in V, which we will denote as θ. In general, V can have n(n + 1)/2 parameters, but use of distance to describe spatial relationships typically reduces this to just 3 or 4 parameters. An example of how V depends on θ is given by the exponential autocorrelation model, where the i, jth element of V is
cov[ε(si),ε(sj)]=τ2exp(-di,j/ρ)+η2I(di,j=0)
where θ = (τ^{2}, η^{2}, ρ)′, d_{i,j} is the Euclidean distance between s_{i} and s_{j}, and I(·) is an indicator function, equal to 1 if its argument is true, otherwise it is 0. The parameter η^{2} is often called the “nugget effect,” τ^{2} is called the “partial sill,” and ρ is called the “range” parameter. In Eq (2), the variances are constant (stationary), which we denote σ^{2} = τ^{2} + η^{2}, when d_{i,j} = 0. Many other examples of autocorrelation model are given in [1, 3].
We will use restricted maximum likelihood (REML) [4, 5] to estimate parameters of V. REML is less biased than full maximum likelihood [6]. REML estimates of covariance parameters are obtained by minimizing
L(θ|y)=log|Vθ|+rθ′Vθ-1rθ+log|X′Vθ-1X|+c
for θ, where V_{θ} depends on spatial autocorrelation parameters θ, and rθ=y-Xβ^θ, β^θ=(X′Vθ-1X)-1X′Vθ-1y, and c is a constant that does not depend on θ. It has been shown [7, 8] that Eq (3) form unbiased estimating equations for covariance parameters, so Gaussian data are not strictly necessary. After Eq (3) has been minimized for θ, then these estimates, call them θ^, are used in the autocorrelation model, e.g. Eq 2, for all of the covariance values to create V^. This is the first use of data y. The usual frequentist method for geostatistics, with a long tradition [9], “uses the data twice” [10]. Now V^, along with a second use of the data, are used to estimate regression coefficients or make predictions at unsampled locations. By plugging V^ into the well-known best-linear-unbiased estimate (BLUE) of β for Eq (1), we obtain the empirical best-linear-unbiased estimate (EBLUE), e.g. [11],
β^=(X′V^-1X)-1X′V^-1y
The estimated variance of Eq (4) is
var^(β^)=(X′V^-1X)-1
Let a single unobserved location be denoted s_{0}, with a covariate vector of x_{0} (containing the same covariates and length as a row of X). Then empirical best-linear-unbiased prediction (EBLUP) [12] at an unobserved location is
Y^(s0)=x0′β^+c^0′V^-1(y-Xβ^),
where c^0≡cov^(ε,ε(s0)), using the same autocorrelation model, e.g. Eq (2), and estimated parameters, θ^, that were used to develop V^. Note that if we condition on V^ as fixed, then Eq (6) is a linear combination of y, and can also be written as η0′y when Eq (4) is substituted for β^. The prediction Eq (6) can be seen as the conditional expectation of Y(s_{0})|y with plug-in values for β, V, and c. The estimated variance of EBLUP is,
var^(Y^(s0))=σ^02-c^0′V^-1c^0+(x0-X′V^-1c^0)′(X′V^-1X)-1(x0-X′V^-1c^0)
where σ^02 is the estimated variance of Y(s_{0}) using the same covariance model as V^. [12]
Spatial methods for big data
Here, we give a brief overview of the most popular methods currently used for large spatial data sets. There are various ways to classify such methods. For our purposes, there are two broad approaches. One is to adopt a Gaussian Process (GP) model for the data and then approximate the GP. The other is to model locally, essentially creating smaller data sets and using existing models.
There are several good reviews on methods for approximating the GP [13–16]. These methods include low rank ideas such as radial smoothing [17–19], fixed rank kriging [20–23], predictive processes [24, 25], and multiresolution Gaussian processes [26, 27]. Other approaches include covariance tapering [28–30], stochastic partial differential equations [31, 32], and factoring the GP into a series of conditional distributions [33, 34], which was extended to nearest neighbor Gaussian processes [35–38] and other sparse matrix improvements [39–41]. The reduced rank methods are very attractive, and allow models for situations where distances are non-Euclidean (for a review and example, see [42]), as well as fast computation.
Modeling locally involves an attempt to maintain classical geostatistical models by creating subsets of the data, using existing methods on subsets, and then making inference from subsets. For example, [43, 44] created local data sets in a spatial moving window, and then estimated variograms and used ordinary kriging within those windows. This idea allows for nonstationary variances but forces an unnatural asymmetric autocorrelation because the range parameter changes when moving a window. Nor does it estimate β, but rather there is a different β for every point in space. Another early idea was to create a composite likelihood by taking products of subset-likelihoods and optimizing for autocorrelation parameters θ [45], and then θ^ can be held fixed when predicting in local windows. However, this does not solve the problem of estimating a single β.
More recently, two broad approaches have been developed for modeling locally. One is a ‘divide and conquer’ approach, which is similar to [45]. Here, it is permissible to re-use data in subsets, or not use some data at all [46–48], with an overview provided by [49]. Another approach is a simple partition of the data into groups, where partitions are generally spatially compact [50–53]. This is sensible for estimating covariance parameters and will provide an unbiased estimate for β^, however the estimated variance var^(β)^ will not be correct. Continuity corrections for predictions are provided, but predictions may not be efficient near partition boundaries.
A blocked structure for the covariance matrix based on spatially-compact groupings was proposed by [54], who then formulated a hybrid likelihood based on blocks of different sizes. The method that we feature is most similar to [54], but we show that there is no need for a hybrid likelihood, and that our approach is different than composite likelihood. Our spatial indexing approach is very simple and extends easily to random effects, and accommodates virtually any covariance matrix that can be constructed. We also show how to obtain the exact covariance matrix of estimated fixed effects without any need for computational derivatives or numerical approximations.
Motivating example
One of the attractive features of the method that we propose is that it will work with any valid covariance matrix. To motivate our methods, consider a stream network (Fig 1a). This is the Mid-Columbia River basin, located along part of the border between the states of Washington and Oregon, USA, with a small part of the network in Idaho as well (Fig 1b). The stream network consists of 28,613 stream segments. Temperature loggers were placed at 9,521 locations on the stream network, indicated by purple dots in Fig 1a. A close-up of the stream network, indicated by the dark rectangle in Fig 1a, is given as Fig 1c, where we also show a systematic placement of prediction locations with orange dots. There are 60,099 prediction locations that will serve as the basis for point predictions. The response variable is an average of daily maximum temperatures in August from 1993 to 2011. Explanatory variables obtained for both observations and prediction sites included elevation at temperature logger site, slope of stream segment at site, percentage of upstream watershed composed of lakes or reservoirs, proportion of upstream watershed composed of glacial ice surfaces, mean annual precipitation in watershed upstream of sensor, the northing coordinate, base-flow index values, upstream drainage area, a canopy value encompassing the sensor, mean August air temperature from a gridded climate model, mean August stream discharge, and occurrence of sensor in tailwater downstream from a large dam (see [55] for more details).
10.1371/journal.pone.0291906.g001Study area for the motivating example.
(a) A stream network from the mid-Columbia River basin, where purple points show 9521 sample locations that measured mean water temperature during August. (b) Most of the stream network is located in Washington and Oregon in the United States. (c) A close-up of the black rectangle in (a). The orange points are prediction locations.
These data were previously analyzed in [55] with geostatistical models specific to stream networks [11, 56]. The models were constructed as spatial moving averages, e.g., [57, 58], also called process convolutions, e.g., [59, 60]. Two basic covariance matrices are constructed, and then summed. In one, random variables were constructed by integrating a kernel over a white noise process strictly upstream of a site, which are termed “tail-up” models. In the other construction, random variables were created by integrating a kernel over a white noise process strictly downstream of a site, which are termed “tail-down” models. Both types of models allow analytical derivation of autocovariance functions, with different properties. For tail-up models, sites remain independent so long as they are not connected by water flow from an upstream site to a downstream site. This is true even if two sites are very close spatially, but each on a different branch just upstream of a junction. Tail-down models are more typical as they allow spatial dependence that is generally a function of distance along the stream, but autocorrelation will still be different for two pairs of sites that are an equal distance apart, when one pair is connected by flow, and the other is not.
When considering big data, such as those in Fig 1, we considered the methods as described in the previous section. The basis-function/reduced rank approaches would be difficult for stream networks because an inspection of Fig 1 reveals that we would need thousands of basis functions in order to cover all headwater stream segments and run the basis functions downstream only. A separate set of basis functions would be needed that ran upstream, and then weighting would be required to split the basis functions at all stream junctions. In fact, all of the GP model approximation methods would require modifying a covariance structure that has already been developed specifically for steam networks. The spatial indexing method that we propose below is much simpler, requiring no modification to the covariance structure, and, as we will demonstrate, proved to be adequate, not only for stream networks, but more generally.
Objectives
In what is to follow, we will use spatial indexing, leading to covariance matrix partitioning and local predictions. We will use the acronym SPIN, for SPatial INdexing, as the collection of methods for covariance parameter estimation, fixed effects estimation, and point and block prediction. Our objective is to show how each of these inferences can be made computationally faster with SPIN, and still provide unbiased results with valid confidence/prediction intervals.
This article uses several acronyms. Table 2 provides a handy reference to the meaning of all acronyms used here.
10.1371/journal.pone.0291906.t002Acronyms used in this paper.
ANOVA
analysis of variance
BLUE
best linear unbiased estimation
CI90
coverage rates for 90% confidence intervals
COMP
spatially compact partitioning
COPE
covariance parameter estimation
EBLUE
empirical best-linear-unbiased estimation
EBLUP
empirical best-linear-unbiased prediction
FEFE
fixed-effects parameter estimation
GEOSTAT
geostatistical simulation method
GP
Gaussian process
MIXD
mix of random and spatially compact partitioning
NNGP
nearest-neighbor Gaussian processes method
PI90
coverage rates for 90% prediction intervals
RAND
random partitioning
REML
restricted maximum likelihood
RMSE
root mean-squared error
RMSPE
root mean-squared-prediction error
SPIN
computationally-fast inference methods using spatial indexing
SUMSINE
simulation method based on random sine waves
Methods
The main advantage of the SPIN method is due to the way the covariance matrix is indexed and partitioned to allow for faster evaluation of the REML equations, Eq (3), whose optimization is iterative, requiring many evaluations involving the inverse of the covariance matrix. This optimization provides estimation of the covariance parameters, which we describe next.
Estimation of covariance parameters
Consider the covariance matrix to be used in Eqs (4) and (6). First, we index the data to create a covariance matrix with P partitions based on the indexes {i; i = 1, …, P},
V=(V1,1V1,2⋯V1,PV2,1V2,2⋯V2,P⋮⋮⋱⋮VP,1VP,2⋯VP,P)
In a similar way, imagine a corresponding indexing and partition of the spatial linear model as,
(y1y2⋮yP)=(X1X2⋮XP)β+(ε1ε2⋮εP)
Now, for the purposes of estimating covariance parameters, we maximize the REML equations based on a covariance matrix,
Vpart=(V1,10⋯00V2,2⋯0⋮⋮⋱⋮00⋯VP,P)
rather than Eq (8). The computational advantage of using Eq (10) in Eq (3) is that we only need to invert matrices of size V_{i,i} for all i, and, because we have large amounts of data, we assume that {V_{i,i}} are sufficient for estimating covariance parameters. If the size of V_{i,i} is fixed, then the computational burden grows linearly with n. Also, Eq (10) in Eq (3) allows for use of parallel computing because each V_{i,i} can be inverted independently.
Note that we are not concerned with the variance of θ^, which is generally true in classical geostatistics. Rather, θ contains nuisance parameters that require estimation in order to estimate fixed effects and make predictions. Because data are massive, we can afford to lose some efficiency in estimating the covariance parameters. For example, sample sizes ≥ 125 are generally recommended for estimating the covariance matrix for geostatistical data [61]. REML is for the most part unbiased. If we have thousands of samples, and if we imagine partitioning the spatial locations into data sets (in ways that we describe later), then using Eq (10) in Eq (3) is, essentially, using REML many times to obtain a pooled estimate of θ^.
Partitioning the covariance matrix is most closely related to the idea of quasi-likelihood [62], composite likelihood [45] and divide and conquer [63]. However, for REML, they are not exactly equivalent. From Eq (3), the term log|X′Vθ-1X| using composite likelihood, ∏i=1PL(θ|yi), results in
∑i=1Plog|Xi′Vi,i-1Xi|
while using V_{part} results in
log|∑i=1PXi′Vi,i-1Xi|
An advantage to spatial indexing, when compared to composite likelihood, can be seen when X contains columns with many zeros, such as may occur for categorical explanatory variables. Then, partitioning X may result in X_{i} that has columns with all zeros, which presents a problem when computing log|Xi′Vi,i-1Xi| for composite likelihood, but not when using V_{part}.
The SPIN indexing can also allow for faster inversion of the covariance matrix when estimating fixed effects, but that requires some new results to obtain the proper standard errors of the estimated fixed effects, which we describe next.
Estimation of <italic>β</italic>
The generalized least squares estimate for β was given in Eq (4). Although the inverse V^{−1} only occurs once (as compared to repeatedly when optimizing the REML equations), it will still be computationally prohibitive if a data set has thousands of samples. Note that under the partitioned model, Eq (9) with covariance matrix Eqs (10), (4), is,
β^bd=Txx-1txy
where Txx=∑i=1PXi′V^i,i-1Xi and txy=∑i=1PXi′V^i,i-1yi. This is a “pooled estimator” of β across the partitions. This should be a good estimator of β at a much reduced computational cost. It will also be convenient to show that Eq (11) is linear in y, by noting that
β^bd=[Txx-1X1V^1,1-1|Txx-1X2V^2,2-1|…|Txx-1XPV^P,P-1][y1y2⋮yP]=Qy
To estimate the variance of β^bd we cannot ignore the correlation between the partitions, so we consider the full covariance matrix Eq (8). If we compute the covariance matrix for Eq (11) under the full covariance matrix Eq (8), we obtain
var^(β^bd)=Txx-1+Txx-1WxxTxx-1
where Wxx=∑i=1P-1∑j=i+1P[Xi′Vi,i-1Vi,jVj,j-1Xj+(Xi′Vi,i-1Vi,jVj,j-1Xj)′]. Note that while we set parts of V = 0Eq (10) in order to estimate θ and β, we computed the variance of β^ using the full V in Eq (8). Using a plug-in estimator, whereby θ is replaced by θ^, no further inverses of any V_{i,j} are required if all Vi,i-1 are stored as part of the REML optimization. There is only a single additional inverse required, which is R × R, where R is the rank of the design matrix X, and is already computed for Txx-1 in Eq (11). Also note that if we simply substituted Eq (10) into Eq (5), then we obtain only Txx-1 as the variance of β^bd. In Eq (13), Txx-1WxxTxx-1 is the adjustment that is required for correlation among the partitions for a pooled estimate of β^bd. Partitioning of the spatial linear model allows computation from Eq (11), but then going back to the full model for developing Eq (13), which is a new result. This can be contrasted to the approaches for variance estimation of fixed effects using pseudo likelihood, composite likelihood, and divide and conquer found in the earlier literature review.
Eq (13) is quite fast and grows linearly for computing the number of inverse matrices Vi,i-1 (that is, if observed sample size is 2n, then there are twice as many inverses as a sample of size n, if we hold partition size fixed). Also note that all inverses may already be computed as part of REML estimation of θ. However, Eq (13) is quadratic in pure matrix computations due to the double sum in W_{xx}. These can be made parallel, but may take too long for more than about 100,000 samples. One alternative is to use the empirical variation in β^i=(Xi′V^i,i-1Xi)-1Xi′V^i,i-1yi, where the ith matrix calculations are already needed for Eq (11) and β^i can be simply computed and stored. Then, let
var^alt1(β^bd)=1P(P-1)∑i=1P(β^i-β^bd)(β^i-β^bd)′
which has been used before for partitioned data, e.g. [64]. A second alternative is to pool the estimated variances of each β^i, which are var^(β^i)=(Xi′V^i,i-1Xi)-1, to obtain
var^alt2(β^bd)=1P2∑i=1Pvar^(β^i)
where the first P in the denominator is for averaging individual var^(β^i), and the second P is the reduction in variance due to averaging β^i. Eqs (13)–(15) are tested and compared below using simulations.
Point prediction
The predictor for Y(s_{0}) was given in Eq (6). As for estimation, the inverse V^{−1} only occurs once (as compared to repeatedly when optimizing to obtain the REML estimates). If the data set has tens of thousands of samples, it will still be computationally prohibitive. Note that under the partitioned model, Eq (9), that assumes zero correlation among partitions, Eq (10), from Eq (6) the predictor is,
Y^(s0)=x0′β^bd+tcy-txc′β^bd
where β^bd is obtained from Eq (11), tcy=∑i=1Pc^i′Vi,i-1yi, txc=∑i=1PXi′Vi,i-1c^i, and c^i=cov^(Y(s0),yi), using the same autocorrelation model and parameters as for V^. Even though the predictor is developed under the block diagonal matrix Eq (10), the true prediction variance can be computed under Eq (8), as we did for estimation. However, the performance of these predictors turned out to be quite poor.
We recommend point predictions based on local data instead, which is an old idea, e.g. [43], and has already been implemented in software for some time, e.g. [10]. The local data may be in the form of a spatial limitation, such as a radius around the prediction point, or by using a fixed number of nearest neighbors. For example, the R [65] package nabor [66] finds nearest neighbors among hundreds of thousands of samples very quickly. Our method will be to use a single set of global covariance parameters as estimated under the covariance matrix partition Eq (10), and then predict with a fixed number of nearest neighbors. We will investigate the effect due to the number of nearest neighbors through simulation.
A purely local predictor lacks model coherency, as discussed in the literature review section. We use a single θ^ for covariance, but there is still the issue of β^. As seen in Eq (6), estimation of β is implicit in the prediction equations. If y_{j} ⊂ y are data in the neighborhood of prediction location s_{j}, then using Eq (6) with local y_{j} is implicitly adopting a varying coefficient model for β^, making it also local, so call it β^j, and it will vary for each prediction location s_{j}. A further issue arises if there are categorical covariates. It is possible that a level of the covariate is not present in the local neighborhood, so some care is needed to collapse any columns in the design matrix that are all zeros. These are some of the issues that call to question the “coherency” of a model when predicting locally.
Instead, as for estimating the covariance parameters, we will assume that the goal is to have a single global estimate of β. Then we take as our predictor for the jth prediction location,
Y^ℓ(sj)=xj′β^bd+c^j′V^j-1(yj-Xjβ^bd)
where X_{j} and V^j are the design and covariance matrices, respectively, for the same neighborhood as y_{j}, x_{j} is a vector of covariates at prediction location j, c^j=cov(Y(sj),yj) (using the same autocorrelation model and parameters as for V^j), and β^bd was given in Eq (11). It will be convenient for block kriging to note that if we condition on V^j being fixed, then Eq (17) can be written as a linear combination of y, call it λj′y, similar to η0′y as mentioned after Eq (6). Suppose there are m neighbors around s_{j}, so y_{j} is m × 1. Let y_{j} = N_{j}y, where N_{j} is a m × n matrix of zeros and ones that subset the n × 1 vector of all data to only those in the neighborhood. Then
Y^ℓ(sj)=xj′Qy+c^j′V^j-1Njy-c^j′V^j-1XjQy=λj′y
where Q was defined in Eq (12).
Let C^ be an estimator of var(β^bd) in Eqs (13), (14), or 15), then the prediction variance of Eq (17) is var(Y(sj)-Y^ℓ(sj)) when using the local neighborhood set of data, which is
var^(Y^ℓ(sj))=Eβ^bd[var{yj,Y(sj)}(Y(sj)-xj′β^bd-c^j′V^j-1(yj-Xjβ^bd)|β^bd)]+varβ^bd[E{yj,Y(sj)}(Y(sj)-xj′β^bd+c^j′V^j-1(yj-Xjβ^bd)|β^bd)]=σ^2-c^j′V^j-1c^j+(xj-Xj′V^j-1c^j)′C^(xj-Xj′V^j-1c^j)
where σ^2 is the estimated value of var(Y(s_{j})) using θ^ and the same autocorrelation model that was used for V^. Eq (19) can be compared to Eq (7).
Block prediction
None of the literature reviewed earlier considered block prediction, yet it is an important goal in many applications. In fact, the origins of kriging were founded on estimating total gold reserves in the pursuit of mining [9]. The goal of block prediction is to predict the average value over a region, rather than at a point. If that region is a compact set of points denoted as B, then the random quantity is
Y(B)=1|B|∫BY(s)ds
where |B|=∫B1ds is the area of B. In practice, we approximate the integral by a dense set of points on a regular grid within B. Let us call that dense set of points D={sj;j=n+1,…,N}, where recall that {s_{j};j = 1, …, n} are the observed data. Then the grid-based approximation to Eq (20) is YD=(1/N)∑j∈DY(sj) with generic predictor
Y^D=1N∑j∈DY^(sj)
We are in the same situation as for prediction of single sites, where we are unable to invert the covariance matrix of all n observed locations for predicting {Y^(sj);j=n+1,n+2,…,N}. Instead, let us use the local predictions as developed in the previous section, which we will average to compute the block prediction. Let the point predictions be a set of random variables denoted as {Y^ℓ(sj),j=n+1,n+2,…,N}. Denote y_{o} a vector of random variables for observed locations, and y_{u} a vector of unobserved random variables on the prediction grid D to be used as an approximation to the block. Recall that we can write Eq (18) as Y^ℓ(sj)=λj′yo. We can put all λ_{j} into a large matrix,
W=(λ1′λ2′⋮λN′)(N-n)×n
The average of all predictions, then, is
Y^D=a′Wyo
where a = (1/N, 1/N, …, 1/N)′. Let a*′=a′W, and so the block prediction a*′yo is also linear in y_{o}.
Let the covariance matrix for the vector (yo′,yu′)′ be
V=(Vo,oVo,uVu,oVu,u)
where V_{o,o} = V in Eq (8). Then, assuming unbiasedness, that is, E(a*′yo)=E(a′yu)⇒a*Xoβ=aXuβ, where X_{o} and X_{u} are the design matrices for the observed and unobserved variables, respectively, then the block prediction variance is
E(a*′yo-a′yu)2=a*′Vo,oa*-2a*′Vo,ua+a′Vu,ua
Although the various parts of V can be very large, the necessary vectors can be created on-the-fly to avoid creating and storing the whole matrix. For example, take the third term in Eq (22). To make the kth element of vector V_{u,u}a, we can create the kth row of V_{u,u}, and then take the inner product with a. This means that only the vector V_{u,u}a must be stored. We then simply take this vector as an inner product with a to obtain a′V_{u,u}a. Also note that computing Eq (21) grows linearly with observed sample size n due to fixing the number of neighbors used for prediction, but Eq (22) grows quadratically, in both n and N, simply due to the matrix dimensions in V_{o,o} and V_{u,u}. We can control the growth of N by choosing the density of the grid approximation, but it may require subsampling of y_{o} if the number of observed data is too large. We often have very precise estimates of block averages, so this may not be too onerous if we have hundreds of thousands of observations.
The SPIN method
As we have shown, SPIN is a collection of methods for covariance parameter estimation, fixed effects estimation, and point and block prediction, based on spatial indexing. SPIN, as described above, estimates covariance parameters using REML, given by Eq (3), with a valid autocovariance model [e.g., Eq (2) used in a partitioned covariance matrix, given by Eq (10)]. Using these estimated covariance parameters, we estimate β using Eq (11), with estimated covariance matrix, Eq (13), unless explicitly stating the use of Eqs (14) or (15). For point prediction, we use Eq (17) with estimated variance Eq (19), unless explicitly stating the purely local version for β^ given by Eq (6) with estimated variance Eq (7). For block prediction, we use Eq (21) with Eq (22).
Simulations
To test the validity of SPIN, we simulated n spatial locations randomly within the [0, 1] × [0, 1] unit square to be used as observations, and we created a uniformly-spaced (N − n) = 40 × 40 prediction grid within the unit square.
We simulated data with two methods. The first simulation method created data sets that were not actually very large, using exact geostatistical methods that require the Cholesky decomposition of the covariance matrix. For these simulations, we used the spherical autocovariance model to construct V,
cov[ε(si),ε(sj)]=τ2(1-3di,j2ρ+di,j32ρ3)I(di,j<ρ)+η2I(di,j=0)
where terms are defined as in Eq (2). To simulate normally-distributed data from N(0, V), let L be the lower triangular matrix such that V = LL′. If vector z is simulated as independent standard normal variables, then ε = Lz is a simulation from N(0, V). Unfortunately, computing L is an O(n3) algorithm, on the same order as inverting V, which limits the size of data for simulation. Fig 2a and 2b shows two realizations from N(0, V), where the sample size was n = 2000 and the autocovariance model, Eq (23), had a τ^{2} = 10, ρ = 0.5, and η^{2} = 0.1. Each simulation took about 3 seconds. Note that when including evaluation of predictions, simulations are required at all N spatial locations. We call this the GEOSTAT simulation method. For all simulations, we fixed τ^{2} = 10 and η^{2} = 0.1, but allowed ρ to vary randomly from a uniform distribution between 0 and 2.
10.1371/journal.pone.0291906.g002Examples of simulated surfaces used to test methods.
(a) and (b) are two different realizations of 2000 values from the GEOSTAT method with a range of 2. (c) and (d) are two realizations of 100,000 values from the SUMSINE method. Bluer values are lower, and yellower areas are higher.
We created another method for simulating spatially patterned data for up to several million records. Let S = [s_{1}, s_{2}] be the 2-column matrix of the spatial coordinates of data, where s_{1} is the first coordinate, and s_{2} is the second coordinate. Let
S*=[s1*,s2*]=S[cos(U1,iπ)-sin(U1,iπ)sin(U1,iπ)cos(U1,iπ)]
be a random rotation of the coordinate system by radian U_{1,i}π, where U_{1,i} is a uniform random variable. Then let
εi=U2,i(1-i-1100)[sin(iU3,i2π[s1*+U4,iπ])+sin(iU5,i2π[s2*+U6,iπ])]
which is a 2-dimensional sine wave surface with a random amplitude (due to uniform random variable U_{2,i}), random frequencies on each coordinate (due to uniform random variables U_{3,i} and U_{5,i}), and random shifts on each coordinate (due to uniform random variables U_{4,i} and U_{6,i}). Then the response variable is created by taking ε=∑i=1100εi, where expected amplitudes decrease linearly, and expected frequencies increase, with each i. Further, the ε were standardized to zero mean and a variance of 10 for each simulation, and we added a small independent component with variance of 0.1 to each location, similar to the nugget effect η^{2} for the GEOSTAT method. Fig 2c and 2d shows two realizations from the sum of random sine-wave surfaces, where the sample size was 100,000. Each simulation took about 2 seconds. We call this the SUMSINE simulation method.
Thus, random errors, ε, for the simulations were based on GEOSTAT or SUMSINE methods. In either case, we created two fixed effects. A covariate, x_{1}(s_{i}), was generated from standard independent normal-distributions at the s_{i} locations. A second spatially-patterned covariate, x_{2}(s_{i}), was created, using the same model, but a different realization, as the random error simulation for ε. Then the response variable was created as,
Y(si)=β0+β1x1(si)+β2x2(si)+ε(si)
for i = 1, 2, …, for a specified sample size n, or N (if wanting simulations at prediction sites), and β_{0} = β_{1} = β_{2} = 1.
Evaluation of simulation results
For one summary of performance of fixed effects estimation, we consider the simulation-based estimator of root-mean-squared error,
RMSE=1K∑k=1K(β^p,k-βp)2
for the kth simulation among K, where β^p,k is the kth simulation estimate for the pth β parameter, and β_{p} is the true parameter used in simulations. We only consider β_{1} and β_{2} in Eq (25). The next simulation-based estimator we consider is 90% confidence interval coverage,
CI90=1K∑k=1KI(β^p,k-1.645var^(β^p,k)<βp<β^p,k+1.645var^(β^p,k))
To evaluate point prediction we also consider the simulation-based estimator of root-mean-squared prediction error,
RMSPE=1K×1600∑k=1K∑j=11600(Y^k(sj)-yk(sj))2
where Y^k(sj) is the predicted value at the jth location for the kth simulation and y_{k}(s_{j}) is the realized value at the jth location for the kth simulation. The final summary that we consider is 90% prediction interval coverage,
PI90=1K×1600∑k=1K∑j=11600I(Y^k(sj)-1.645var^(Y^k(sj))<yk(sj)<Y^k(sj)+1.645var^(Y^k(sj)))
where var^(Y^k(sj)) is an estimator of the prediction variance.
Effect of partition method
We wanted to test SPIN over a wide range of data. Hence, we simulated 1000 data sets where simulation method was chosen randomly, with equal probability, between GEOSTAT and SUMSINE methods. If GEOSTAT was chosen, a random sample size between 1000 and 2000 was generated. If SUMSINE was chosen, a random sample size between 2000 and 10,000 was generated. Thus, throughout the study, the simulations occurred over a wide range of parameters, with two different simulation methods and randomly varying autocorrelation. In all cases, the error models fitted to the data were misspecified, because we fitted an exponential autocorrelation model to the true models, GEOSTAT and SUMSINE, that generated them. This should provide a good test of the robustness of the SPIN method and provide fairly general conclusions on the effect of partition method.
After simulating the data, we considered 3 indexing methods. One was completely random, the second was spatially compact, and the third was a mixed strategy, starting with compact, and then 10% were randomly reassigned. To create compact data partitions, we used k-means clustering [67] on the spatial coordinates. K-means has the property of minimizing within group variances and maximizing among group variances. When applied to spatial coordinates, k-means creates spatially compact partitions. An example of each partition method is given in Fig 3. We created partition sizes that ranged randomly from a target of 25 to 225 locations per group (k-means has some variation in group size). It is possible to create one partition for covariance estimation, and another partition for estimating fixed effects. Therefore we considered all nine combinations of the three partition methods for each estimation method.
10.1371/journal.pone.0291906.g003Illustration of three methods for partitioning data.
Sample size was 1000, and the data were partitioned into 5 groups of 200 each. (a) Random assignment to group. (b) K-means clustering on x- and y-coordinates. (c) K-means on x- and y-coordinates, with 10% randomly re-assigned from each group. Each color represents a different grouping.
Table 3 shows performance summaries for the three partition methods, for both fixed effect estimation and point prediction, over wide-ranging simulations when using SPIN with 50 nearest-neighbors for predictions. It is clear that, whether for fixed effect estimation, or prediction, the use of compact partitions was the best option. The worst option was random partitioning. The mixed approach was often close to compact partitioning in performance.
10.1371/journal.pone.0291906.t003Effect of partition method.
COPE
FEFE
RMSE_{1}
RMSE_{2}
RMSPE
CI90_{1}
CI90_{2}
PI90
RAND
RAND
0.1407
0.4133
6.650
0.8980
0.8540
0.9157
COMP
0.1244
0.2975
6.649
0.9210
0.8490
0.9157
MIXD
0.1261
0.3382
6.649
0.9160
0.8500
0.9157
COMP
RAND
0.1416
0.4020
6.406
0.9000
0.9210
0.9053
COMP
0.1196
0.2858
6.405
0.9170
0.8910
0.9053
MIXD
0.1214
0.3234
6.405
0.9110
0.9040
0.9052
MIXD
RAND
0.1408
0.4154
6.406
0.8950
0.8900
0.9058
COMP
0.1197
0.2886
6.405
0.9150
0.8800
0.9058
MIXD
0.1212
0.3300
6.405
0.9100
0.8810
0.9059
Results using 1000 simulations as described in the text. The first column of the table gives data partition method for the covariance parameter estimation (COPE) using REML, which was one of random partitioning (RAND), compact partitioning (COMP), or a mix of compact with 10% randomly distributed (MIXD). The second column of the table uses covariance parameters as estimated in the first row, and gives the data partition method for fixed effects estimation (FEFE), which was one of RAND, COPE, or MIXD. RMSE, RMSPE, CI90, and PI90 are described in the text. RMSE_{1} and RMSE_{2} are for the first (spatially independent) and second (spatially patterned) covariates, respectively. Similarly, CI90_{1} and CI90_{2} are for first and second covariates, respectively.
Effect of partition size
Next, we investigated the effect of partition size. We only used compact partitions, because they were best, and we used partition sizes of 25, 50, 100, and 200 for both covariance parameter estimation and fixed effect estimation, and again used 50 nearest-neighbors for predictions. We simulated data in the same way as above, and used the same performance summaries. Here, we also included the average time, in seconds, for each estimator. The results are shown in Table 4. In general, larger partition sizes had better RMSE for estimating covariance parameters, but the gains were very small after size 50. For fixed effects estimation, partition size of 50 was often better than 100, and approximately equal to size 200. For prediction, RMSPE was lower as partition size increased. In terms of computing speed, covariance parameter estimation was slower as partition size increased, but fixed effect estimation was faster as partition size increased (because of fewer loops in Eq (13). Partition sizes of 25 often had poor coverage in terms of both CI90 and PI90, but coverage was good for other partition sizes. Based on Tables 3 and 4, one good overall strategy is to use compact partitions of block size 50 for covariance parameter estimation, and block size 200 for fixed effect estimation, for both efficiency and speed. Note that when partition size is different for fixed effect estimation from covariance parameter estimation, new inverses of diagonal blocks in Eq (10) are needed. If partition size is the same for fixed effect and covariance parameter estimation, inverses of diagonal blocks can be passed from REML to fixed effects estimation, so another good strategy is to use block size 50 for both fixed effect and covariance parameter estimation.
10.1371/journal.pone.0291906.t004Effect of partition sizes.
COPE
FEFE
RMSE_{1}
RMSE_{2}
RMSPE
CI90_{1}
CI90_{2}
PI90
TIME_{C}
TIME_{F}
25
25
0.147
0.645
6.77
0.938
0.845
0.932
2.821
3.328
25
50
0.131
0.340
6.77
0.955
0.807
0.932
2.821
1.249
25
100
0.133
0.372
6.77
0.930
0.833
0.932
2.821
0.758
25
200
0.130
0.346
6.77
0.938
0.810
0.932
2.821
0.730
50
25
0.146
0.593
6.14
0.943
0.963
0.909
3.031
3.328
50
50
0.121
0.290
6.13
0.897
0.900
0.909
3.031
1.249
50
100
0.122
0.309
6.13
0.912
0.922
0.908
3.031
0.758
50
200
0.120
0.288
6.13
0.917
0.922
0.909
3.031
0.730
100
25
0.143
0.634
6.13
0.930
0.882
0.906
4.802
3.328
100
50
0.121
0.304
6.13
0.900
0.885
0.907
4.802
1.249
100
100
0.122
0.322
6.13
0.905
0.917
0.906
4.802
0.758
100
200
0.120
0.299
6.13
0.910
0.910
0.906
4.802
0.730
200
25
0.144
0.637
6.13
0.927
0.877
0.905
12.760
3.328
200
50
0.121
0.300
6.13
0.897
0.887
0.905
12.760
1.249
200
100
0.122
0.322
6.13
0.905
0.905
0.905
12.760
0.758
200
200
0.120
0.300
6.13
0.907
0.902
0.905
12.760
0.730
Results are based on 1000 simulations, using the same simulation parameters as in Table 3. The first column of the table gives data partition sizes for the covariance parameter estimation (COPE), and the second column gives data partition size for fixed effects estimation (FEFE), while using covariance parameters as estimated in the first column. The columns RMSE_{1}, RMSE_{2}, RMSPE, CI90_{1}, CI90_{2}, and PI90 are the same as in Table 3. TIME_{C} is the average time, in seconds, for covariance parameter estimation, and TIME_{F} is the average time, in seconds, for fixed effects estimation.
Variance estimation for fixed effects
In the section on estimating β, we described three possible estimators for the covariance matrix of β^bd, with Eq (13) being theoretically correct, and faster alternatives Eqs (14) and (15). The alternative estimators will only be necessary for very large sample sizes, so to test their efficacy we simulated 1000 data sets with random sample sizes, from 10,000 to 100,000, using the SUMSINE method. We then fitted the covariance model, using compact partitions of size 50, and fixed effects, using partition sizes of 25, 50, 100, and 200. We computed the estimated covariance matrix of the fixed effects using Eqs (13)–(15), and evaluated performance based on 90% confidence interval coverage.
Results in Table 5 show that all three estimators, at all block sizes, have confidence interval coverage very close to the nominal 90% when estimating β_{1}, the independent covariate. However, when estimating the spatially-patterned covariate, β_{2}, the theoretical estimator has proper coverage for block sizes 50 and greater, while the two alternative estimators have proper coverage only for block size 50. It is surprising that the results for the alternative estimators are so specific to a particular block size, and these estimators warrant further research.
10.1371/journal.pone.0291906.t005CI90 for <italic>β</italic><sub>1</sub> and <italic>β</italic><sub>2</sub>.
Part. Size
β_{1}
β_{2}
Eq (13)
Eq (14)
Eq (15)
Eq (13)
Eq (14)
Eq (15)
25
0.906
0.914
0.925
0.807
0.283
0.294
50
0.907
0.907
0.921
0.897
0.920
0.898
100
0.905
0.909
0.924
0.913
0.687
0.661
200
0.900
0.896
0.907
0.876
0.686
0.658
Results are based on 1000 simulations, using three different variance estimates, given by their equation numbers. Eq (13) is theoretically correct, while Eq (14) is based on empirical variation in β^ among partitions, and Eq (15) is based on averaging the covariance matrices of β^ among partitions.
Prediction with global estimate of <italic>β</italic>
In the sections on point and block prediction, we described prediction using both a local estimator of β, and the global estimator β^bd. To compare them, and examine the effect of the number of nearest neighbors, we simulated 1000 data sets as described in earlier, using compact partitions of size 50 for both covariance and fixed-effects estimation. We then predicted values on the gridded locations with 25, 50, 100, and 200 nearest neighbors.
Results in Table 6 show that prediction with the global estimator β^bd had smaller RMSPE, especially with smaller numbers of nearest neighbors. As expected, predictors have lower RMSPE with more nearest neighbors, but gains are small after block size 50. Prediction intervals for both methods had proper coverage. The local estimator of β was faster because it used the local estimator of the covariance of β, while predictions with β^bd needed the global covariance estimator (Eq 13) to be used in Eq (19). Higher numbers of nearest neighbors took longer to compute, especially with numbers greater than 100. Of course, predictions for the block average had much smaller RMSPE than points. Again, prediction got better when using more nearest neighbors, but improvements were small with more than 50. Computing time for block averaging increased with number of neighbors, especially when greater than 100, and took longer than point predictions.
10.1371/journal.pone.0291906.t006Effect of number of nearest neighbors for RMSPE and PI90.
nNN
RMSPE_{1}
RMSPE_{2}
PI90_{1}
PI90_{2}
RMSPE_{3}
PI90_{3}
Time_{1}
Time_{2}
Time_{3}
25
6.62
6.36
0.908
0.907
0.204
0.912
0.6
2.4
6.9
50
6.45
6.33
0.907
0.907
0.201
0.907
1.2
3.0
7.5
100
6.37
6.32
0.907
0.907
0.201
0.904
4.4
6.3
10.5
200
6.34
6.31
0.907
0.907
0.200
0.905
23.9
25.7
29.0
Results are based on 1000 simulations, using the same simulation parameters as in Table 3. The first column of the table gives number of nearest neighbors. Time is average computing time in seconds. The subscript 1 indicates a local estimator of β^ using Eq (6), while subscript 2 indicates global estimator of β^ using Eq (17). The subscript 3 indicates the block predictor, Eq (21).
A comparison of methods
To compare methods, we simulated 1000 data sets using GEOSTAT (partial sill was 10, range was 0.5 and nugget was 0.1) where we fix sample size at n = 1000, and the errors were standardized before adding fixed effects. We compared 3 methods: 1) estimation and prediction using the full covariance matrix for all 1000 points, 2) SPIN with compact blocks of 50 for both covariance and fixed effects parameter estimation, and 50 nearest-neighbors for prediction, and 3) nearest-neighbor Gaussian processes (NNGP). NNGP had good performance in [16] and software is readily available in the R package spNNGP [68]. For spNNGP, we used default parameters for the conjugate prior method and a 25 × 25 search grid for phi and alpha, which were the dimensions of the search grid found in [16]. We stress that we do not claim this to be a definitive comparison among methods, as the developers of NNGP could surely make adjustments to improve performance. Likewise, partition size and number of nearest neighbors for prediction could be adjusted to optimize performance of SPIN for any given simulation or data set. We offer these results to show that, broadly, SPIN and NNGP are comparable, and very fast, with little performance lost in comparison to using the full covariance matrix.
Table 7 shows that RMSE for estimation of the independent covariate, and the spatially-patterned covariate, were approximately equal for SPIN and NNGP, and only slightly worse than the full covariance matrix. RMSPE for SPIN was equal to the full covariance matrix, and both were just slightly better than NNGP. Confidence and prediction intervals for all three methods were very close to the nominal 90%.
10.1371/journal.pone.0291906.t007Comparison of 3 methods for fixed effects estimation and point prediction.
Method
RMSE_{1}
RMSE_{2}
RMSPE
CI90_{1}
CI90_{2}
PI90
TIME
Full
0.0088
0.0359
0.292
0.893
0.903
0.899
110.2
SPIN
0.0090
0.0380
0.292
0.908
0.913
0.906
3.0
NNGP
0.0090
0.0381
0.294
0.888
0.881
0.905
21.8
Data were simulated from 1000 random locations with a 40 × 40 prediction grid. The first column of the table gives the method, where Full uses the full 1000 × 1000 covariance matrix, SPIN uses spatial partitioning with compact blocks of size 50 and 50 nearest-neighbor prediction points. NNGP uses default parameters from R package for the conjugate prior method with a 25 × 25 search grid on phi and alpha. The columns RMSE_{1}, RMSE_{2}, RMSPE, CI90_{1}, CI90_{2}, and PI90 are the same as in Table 3. TIME is the average time, in seconds, for fixed effects estimation and prediction combined.
Fig 4 shows computing times, using 5 replicate simulations, for each method for up to 100,000 records. Both NNGP and SPIN can use parallel processing, but here we used a single processor to remove any differences due to parallel implementations. Fitting the full covariance matrix with REML, which is iterative, took more than 30 minutes with sample sizes > 2500. Computing time for NNGP is clearly linear with sample size, while for SPIN, it is quadratic when using Eq (13), but linear when using the alternative variance estimators for fixed effects (Eqs 14 and 15). Using the alternative variance estimators, SPIN was about 10 times faster than NNGP, and even with quadratic growth when using Eq (13), SPIN was faster than NNGP for up to 100,000 records.
10.1371/journal.pone.0291906.g004Computing times as a function of sample size for three methods: 1) Full covariance matrix (black line), 2) NNGP (red line), and 3) SPIN (green lines).
For SPIN, the theoretically correct variance estimator (Eq 13) is solid green, while faster alternatives (Eqs 14 and 15) are dashed green.
Application to stream networks
We applied spatial indexing to covariance matrices constructed using stream network models as described for the motivating example in the Introduction. These are variance component models, with a tail-up component, a tail-down component, and a Euclidean-distance component, each with 2 covariance parameters, along with a nugget effect; thus, there are 7 covariance parameters (4 partial sills, and 3 range parameters). A full covariance matrix was developed for these models [69], and we easily adapted it for spatial partitioning. We used compact blocks of size 50 for estimation, and 50 nearest neighbors for predictions. The 4 partial sill estimates were 1.76, 0.40, 2.57, and 0.66 for tail-up, tail-down, Euclidean-distance, and nugget effect, respectively. These indicate that tail-up and Euclidean-distance components dominated the structure of the overall autocovariance, and both had large range parameters. It took 7.98 minutes to fit the covariance parameters. The fitted fixed effects took an additional 2.15 minutes of computing time (Table 8), which are very similar to results found in [55]. Predictions for 65,099 locations are shown in Fig 5, which took 47 minutes.
10.1371/journal.pone.0291906.g005Temperature predictions at 65,099 locations for the Mid-Columbia river.
Yellower colors are higher values, while bluer colors are lower values.
10.1371/journal.pone.0291906.t008Fixed effects table for Mid-Columbia river data.
Effect
β^bd
se(β^bd)
z-value
Prob(> |z|)
Intercept
30.9324
5.8816
5.2592
< 0.00001
Elevation^{1}
-4.0312
0.5052
-7.9787
< 0.00001
Slope^{2}
-0.1504
0.0289
-5.2009
< 0.00001
Lakes^{3}
0.5287
0.1003
5.2690
< 0.00001
Precipitation^{4}
-0.0018
0.0004
-4.4639
0.00001
Northing^{5}
-0.6315
0.3002
-2.1038
0.03565
Flow^{6}
-0.1118
0.0217
-5.1429
< 0.00001
Drainage Area^{7}
0.0363
0.0236
1.5400
0.12388
Canopy^{8}
-0.0238
0.0033
-7.1280
< 0.00001
Air Temperature^{9}
0.4538
0.0119
38.2106
< 0.00001
Discharge^{10}
0.0031
0.0140
0.2227
< 0.82385
The se(β^bd) is the standard error using Eq (13). The z-value is the estimate divided by its standard error. Prob(> |z|) is the probability of getting the fixed effect estimate if it were truly 0, assuming a standard normal distribution.
^{1} Elevation (m/1000) at sensor site
^{2} Slope (100m/m) of stream reach of sensor site
^{3} Percentage of watershed upstream of sensor site composed of lake or reservoir surfaces
^{4} Mean annual precipitation (mm) in watershed upstream of sensor site
^{5} Albers equal area northing coordinate (10km) of sensor site
^{6} Percentage of the base flow to total flow of sensor site
^{7} Drainage area (10,000 km^{2}) upstream of sensor site
^{8} Riparian canopy coverage (%) of 1 km stream reach encompassing a sensor site
^{9} Mean annual August air temperature (°C)
^{10} Mean annual August discharge (m^{3}/sec)
In summary, the original analysis [55] took 10 days of continuous computing time to fit the model and make predictions with a full 9521 × 9521 covariance matrix. Using SPIN, fitting the same model took about 10 minutes, with an additional 47 minutes for predictions. Note that these models take more time than Euclidean distance alone because there are 7 covariance parameters, and the tail-up and tail-down models use stream distance, which takes longer to compute. For this example, we used parallel processing with 8 cores when fitting covariance parameters and fixed effects, and making predictions, which made analyses considerably faster. We did not use block prediction, because that was not a particular goal for this study. However, it is generally important, and has been used for estimating fish abundance [70].
Discussion and conclusions
We have explored spatial partitioning to speed computations for massive data sets. We have provided novel and theoretically correct development of variance estimators for all quantities. We proposed a globally coherent model for covariance and fixed effects estimation, and then use that model for improved predictions, even when those predictions are done locally based on nearest neighbors. We include block kriging in our development, which is absent among literature on big data for spatial methods.
Our simulations showed that, over a range of sample sizes, simulation methods, and range of autocorrelation, spatially compact partitions are best. There does not appear to be a need for “large blocks,” as used in [54]. A good overall strategy, that combines speed without giving up much precision, is based on 50/50/50, where compact partitions of size 50 are used for both covariance parameter estimation and fixed effects estimation, and 50 nearest neighbors are used for prediction. This strategy compares very favorably with a default strategy for NNGP.
One benefit of the data indexing is that it extends easily to any geostatistical model with a valid covariance matrix. There is no need to approximate a Gaussian process. We provided one example for stream network models, but other examples include geometric anisotropy, nonstationary models, spatio-temporal models (including those that are nonseparable), etc. Any valid covariance matrix can be indexed and partitioned, offering both faster matrix inversions and parallel computing, while providing valid inferences with proper uncertainty assessment.
We would like to thank Devin Johnson, Brett McClintock, Alan Pearse, and one anonymous reviewer for their reviews. The findings and conclusions in the paper are those of the author(s) and do not necessarily represent the views of the reviewers nor the EPA, BPA, and National Marine Fisheries Service, NOAA. Any use of trade, product, or firm names does not imply an endorsement by the US Government.
ReferencesCressieNAC. SteinML. A modeling approach for large spatial datasets. ChilesJP, DelfinerP. PattersonHD, ThompsonR. Recovery of inter-block information when block sizes are unequal. Patterson H, Thompson R. Maximum likelihood estimation of components of variance. In: Proceedings of the 8th International Biometric Conference. Biometric Society, Washington, DC; 1974. p. 197–207.MardiaKV, MarshallR. Maximum likelihood estimation of models for residual covariance in spatial regression. HeydeCC. A quasi-likelihood approach to the REML estimating equations. CressieN, LahiriSN. Asymptotics for REML estimation of spatial covariance parameters. CressieN. The origins of kriging. Johnston K, Ver Hoef JM, Krivoruchko K, Lucas N. Using ArcGIS Geostatistical Analyst. vol. 300. Esri Redlands, CA; 2001.Ver HoefJM, PetersonE. A moving average approach for spatial statistical models of stream networks (with discussion). ZimmermanDL, CressieN. Mean squared prediction error in the spatial linear model with estimated covariance parameters. Sun Y, Li B, Genton MG. Geostatistics for large datasets. In: Porcu E, Montero JM, Schlather M, editors. Advances and challenges in Space-Time Modelling of Natural Events. Springer; 2012. p. 55–77.BradleyJR, CressieN, ShiT. A comparison of spatial predictors when datasets could be very large. Liu H, Ong YS, Shen X, Cai J. When Gaussian process meets big data: A review of scalable GPs. arXiv preprint arXiv:180701065. 2018;.HeatonMJ, DattaA, FinleyAO, FurrerR, GuinnessJ, GuhaniyogiR, et al. A case study competition among methods for analyzing large spatial data. KammannEE, WandMP. Geoadditive models. RuppertD, WandMP, CarrollRJ. WoodSN, LiZ, ShaddickG, AugustinNH. Generalized additive models for gigadata: modeling the UK black smoke network daily data. Cressie, Noel, Johannesson, G. Spatial prediction for massive datasets. In: Mastering the Data Explosion in the Earth and Environmental Sciences: Proceedings of the Australian Academy of Science Elizabeth and Frederick White Conference. Canberra, Australia: Australian Academy of Science; 2006. p. 11.CressieN, JohannessonG. Fixed rank kriging for very large spatial data sets. KangEL, CressieN. Bayesian inference for the spatial random effects model. KatzfussM, CressieN. Spatio-temporal smoothing and EM estimation for massive remote-sensing data sets. BanerjeeS, GelfandAE, FinleyAO, SangH. Gaussian predictive process models for large spatial data sets. FinleyAO, SangH, BanerjeeS, GelfandAE. Improving the performance of predictive process modeling for large datasets. NychkaD, BandyopadhyayS, HammerlingD, LindgrenF, SainS. A multiresolution Gaussian process model for the analysis of large spatial datasets. KatzfussM. A multi-resolution approximation for massive spatial datasets. FurrerR, GentonMG, NychkaD. Covariance tapering for interpolation of large spatial datasets. KaufmanCG, SchervishMJ, NychkaDW. Covariance tapering for likelihood-based estimation in large spatial data sets. SteinML. Statistical properties of covariance tapers. LindgrenF, RueH, LindströmJ. An explicit link between Gaussian fields and Gaussian markov random fields: the stochastic partial differential equation approach. BakkaH, RueH, FuglstadGA, RieblerA, BolinD, IllianJ, et al. Spatial modeling with R-INLA: A review. VecchiaAV. Estimation and model identification for continuous spatial processes. SteinML, ChiZ, WeltyLJ. Approximating likelihoods for large spatial data sets. DattaA, BanerjeeS, FinleyAO, GelfandAE. Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. DattaA, BanerjeeS, FinleyAO, GelfandAE. On nearest-neighbor Gaussian process models for massive spatial data. Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, Banerjee S. Applying nearest neighbor Gaussian processes to massive spatial data sets forest canopy height prediction across tanana valley alaska.”. arXiv preprint arXiv:170200434. 2017;.FinleyAO, DattaA, CookBD, MortonDC, AndersenHE, BanerjeeS. Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Katzfuss M, Guinness J. A general framework for Vecchia approximations of Gaussian processes. arXiv preprint arXiv:170806302. 2017;.Katzfuss M, Guinness J, Gong W, Zilber D. Vecchia approximations of Gaussian-process predictions. arXiv preprint arXiv:180503309. 2018;.Zilber D, Katzfuss M. Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. arXiv preprint arXiv:190607828. 2019;.Ver HoefJM. Kriging models for linear networks and non-Euclidean distances: Cautions and solutions. HaasTC. Lognormal and moving window methods of estimating acid deposition. HaasTC. Local prediction of a spatio-temporal process with an application to wet sulfate deposition. CurrieroFC, LeleS. A composite likelihood approach to semivariogram estimation. LiangF, ChengY, SongQ, ParkJ, YangP. A resampling-based stochastic approximation method for analysis of large geostatistical data. EidsvikJ, ShabyBA, ReichBJ, WheelerM, NiemiJ. Estimation and prediction in spatial models with block composite likelihoods. BarbianMH, AssunçãoRM. Spatial subsemble estimator for large geostatistical data. VarinC, ReidN, FirthD. An overview of composite likelihood methods. ParkC, HuangJZ, DingY. Domain decomposition approach for fast Gaussian process regression of large spatial data sets. ParkC, HuangJZ. Efficient computation of Gaussian process regression for large spatial data sets by patching local Gaussian processes. HeatonMJ, ChristensenWF, TerresMA. Nonstationary Gaussian process models using spatial hierarchical clustering from finite differences. ParkC, ApleyD. Patchwork kriging for large-scale gaussian process regression. Caragea P, Smith RL. Approximate likelihoods for spatial processes. Preprint. 2006; https://rls.sites.oasis.unc.edu/postscript/rs/approxlh.pdf.IsaakDJ, WengerSJ, PetersonEE, HoefJMV, NagelDE, LuceCH, et al. The Norwest summer stream temperature model and scenarios for the western U.S.: a crowd-sourced database and new geospatial tools foster a user community and predict broad climate warming of rivers and streams. Ver HoefJM, PetersonEE, TheobaldD. Spatial statistical models that use flow and stream distance. BarryRP, JayM, HoefV. Blackbox kriging: spatial prediction without specifying variogram models. Ver HoefJM, BarryRP. Constructing and fitting models for cokriging and multivariable spatial prediction. HigdonD. A process-convolution approach to modelling temperatures in the north atlantic ocean (disc: p191-192). Higdon D, Swall J, Kern J. Non-stationary spatial modeling. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics 6—Proceedings of the Sixth Valencia International Meeting. Clarendon Press [Oxford University Press]; 1999. p. 761–768.WebsterR, OliverMA. BesagJ. Statistical analysis of non-lattice data. GuhaS, HafenR, RoundsJ, XiaJ, LiJ, XiB, et al. Large complex data: divide and recombine (d&r) with rhipe. ChapmanDG, JohnsonAM. Estimation of Fur Seal Pup Populations by Randomized Sampling. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.ElsebergJ, MagnenatS, SiegwartR, NüchterA. Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration. MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J, editors. Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1. University of California Press; 1967. p. 281–297.Finley AO, Datta A, Banerjee S. R package for nearest neighbor Gaussian process models. arXiv:200109111 [stat]. 2020;.Ver HoefJM, PetersonEE, CliffordD, ShahR. SSN: an R package for spatial statistical modeling on stream networks. IsaakDJ, Ver HoefJM, PetersonEE, HoranDL, NagelDE. Scalable population estimates using spatial-stream-network (SSN) models, fish density surveys, and national geospatial database frameworks for streams. 10.1371/journal.pone.0291906.r001Decision Letter 0AbonazelMohamed R.Academic Editor2023Mohamed R. AbonazelThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version0
17 May 2023
PONE-D-23-03446Indexing and Partitioning the Spatial Linear Model for Large Data SetsPLOS ONE
Dear Dr. Ver Hoef,
Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
Please submit your revised manuscript by Jul 01 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.
Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.
If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.
We look forward to receiving your revised manuscript.
Kind regards,
Mohamed R. Abonazel, Ph.D.
Academic Editor
PLOS ONE
Journal Requirements:
When submitting your revision, we need you to address these additional requirements.
1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at
https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and
2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.
3. Thank you for stating the following in the Acknowledgments Section of your manuscript:
"The project received financial support through Interagency Agreement DW-13-92434601-0 from the U.S. Environmental Protection Agency (EPA), and through Interagency Agreement 81603 from the Bonneville Power Administration (BPA), with the National Marine Fisheries Service, NOAA."
We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.
Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:
"JVH: The project received financial support through Interagency Agreement DW-13-92434601-0 from the U.S. Environmental Protection Agency (EPA), and through Interagency Agreement 81603 from the Bonneville Power Administration (BPA), with the National Marine Fisheries Service, NOAA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."
Please include your amended statements within your cover letter; we will change the online submission form on your behalf.
4. We note that Figures 1 and 5 in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.
We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:
a. You may seek permission from the original copyright holder of Figures 1 and 5 to publish the content specifically under the CC BY 4.0 license.
We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:
“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”
Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.
In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”
b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.
The following resources for replacing copyrighted map figures may be helpful:
USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/
The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/
Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html
NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/
Landsat: http://landsat.visibleearth.nasa.gov/
USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#
1. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #1: Yes
**********
2. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #1: Yes
**********
3. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.
Reviewer #1: Yes
**********
4. Is the manuscript presented in an intelligible fashion and written in standard English?
PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.
Reviewer #1: No
**********
5. Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)
Reviewer #1: Indexing and Partitioning the Spatial Linear Model for Large Data Sets
Few major comments are attached for the improvement of the article.
1. There are too many variables (parameters) in this article, I suggest to add a table of acronym to define each variable separately.
2. What is Eq (1)? You should write Equation (1).
3. Σ is the standard notation for sum, but you used for different concept. You should avoid using standard notation for different concepts.
4. In the literature review, you used 13-16, 17-19, 20-22, 27-29. You should explain these separately.
5. Figure 1 is not on the place of Figure 1, you should place your figures over there where you claimed.
6. Where did you get the methodology? Is it standard or your own novelty? You did not cited a single reference there.
7. There are too many short forms in this article. I suggest again to create a new table to define each short form separately. See Page 12/26.
8. Tables are not on the place of tables, you should place your tables over there where you claimed.
9. The literature about Partitioning is very limited, I recommend to add more detail about graph theory. I recommend to add the following basics, Notes on the Localization of Generalized Hexagonal Cellular Networks, Mathematics. DOI: 10.3390/math11040844. Verification of some topological indices of Y-junction based nanostructures by M-polynomials. Journal of Mathematics. DOI: 10.1155/2022/8238651. Sharp bounds on partition dimension of hexagonal Mobius ladder. Journal of King Saud University-Science, Dec. 2021. DOI:10.1016/j.jksus.2021.101779. Metric-based resolvability of polycyclic aromatic hydrocarbons. European Physical Journal Plus. DOI:10.1140/epjp/s13360-021-01399-8
10. In page 17/26, line 496, what is covariance matrice? A typo.
11. There are many typos in this article, you should recheck this article.
12. References are not in the same pattern see 59 and 60 (page section).
13. See the difference in journal name of ref 62 with others.
14. Figures are very blur unable to understand, kindly revise the quality of pictures.
15. Try to write the article in understandable way, you are teaching newcomers not only publishing articles.
16. Elaborate each section in an understandable way, make section and enumerate them.
**********
6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #1: No
**********
[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]
While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
10.1371/journal.pone.0291906.r002Author response to Decision Letter 0Submission Version1
26 Jun 2023
Please see the attached response letter to the editor and reviewer. Thank you very much for your reviews.
Submitted filename: rebuttalLetter.pdf
10.1371/journal.pone.0291906.r003Decision Letter 1AbonazelMohamed R.Academic Editor2023Mohamed R. AbonazelThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version1
2 Aug 2023
PONE-D-23-03446R1Indexing and Partitioning the Spatial Linear Model for Large Data SetsPLOS ONE
Dear Dr. Ver Hoef,
Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.
Please submit your revised manuscript by Sep 16 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.
Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.
If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.
If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.
We look forward to receiving your revised manuscript.
Kind regards,
Mohamed R. Abonazel, Ph.D.
Academic Editor
PLOS ONE
Journal Requirements:
Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.
Additional Editor Comments:
The authors are requested to make appropriate modifications to this manuscript as suggested by the reviewer.
[Note: HTML markup is below. Please do not edit.]
Reviewers' comments:
Reviewer's Responses to Questions
Comments to the Author
1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.
Reviewer #1: All comments have been addressed
Reviewer #2: All comments have been addressed
**********
2. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #1: Partly
Reviewer #2: Yes
**********
3. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #1: No
Reviewer #2: Yes
**********
4. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.
Reviewer #1: No
Reviewer #2: Yes
**********
5. Is the manuscript presented in an intelligible fashion and written in standard English?
PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.
Reviewer #1: Yes
Reviewer #2: Yes
**********
6. Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)
Reviewer #1: You may accept this article from my side. Authors addressed all the suggested comments from my side. It is quite to accept this article.
Thank you
Reviewer #2: The authors present an approximate inference method for large spatial datasets by indexing and partitioning the data into blocks and taking these blocks to be independent, thus reducing the computational load. While this approach is not new, the authors present enough variations and innovation to differentiate it from the already published methods. The authors provide simulated examples where the fixed effects in the model were estimated with fair accuracy. The method is applied to a real dataset with success.
The article is written well and is easily readable for people who engage in practicing spatial statistical methods. However, I have the following comments about the work:
1. Is the primary objective of this method to efficiently estimate the covariate effects? You do not present covariance parameter estimation results anywhere.
2. I am skeptical about the spatial prediction performance of the method. As per the described data generation method, the magnitude of the response variable Y would be around 40 at most. From Tables 3 and 4, the RMSPE is of similar magnitude. This indicates rather poor performance and since you do not compare these with any other method, it is hard to say if the method is working well or not. Please correct me about the magnitude of Y, if I am wrong.
3. The RMSPE in Table 7 is surprisingly low compared to the numbers in Tables 3, 4 and 6. Are the data generated similarly, with same scale of Y in both cases? If so, what lead to such huge improvements?
4. This is a minor comment, but the spatial range parameter being bigger than 1 (or even 0.5) for a field of size unit square is unrealistic. But for investigating purposes, it is fine.
5. Eq (19) has a typo, it should be Vj in the second term and not Vi
**********
7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #1: No
Reviewer #2: No
**********
[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]
While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.
10.1371/journal.pone.0291906.r004Author response to Decision Letter 1Submission Version2
24 Aug 2023
Please see the rebuttal letter. Thank you.
Submitted filename: rebuttalLetter2.pdf
10.1371/journal.pone.0291906.r005Decision Letter 2AbonazelMohamed R.Academic Editor2023Mohamed R. AbonazelThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Submission Version2
11 Sep 2023
Indexing and Partitioning the Spatial Linear Model for Large Data Sets
PONE-D-23-03446R2
Dear Dr. Ver Hoef,
We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.
Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.
An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.
If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.
Kind regards,
Mohamed R. Abonazel, Ph.D.
Academic Editor
PLOS ONE
Additional Editor Comments (optional):
Reviewers' comments:
Reviewer's Responses to Questions
Comments to the Author
1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.
Reviewer #1: All comments have been addressed
Reviewer #2: All comments have been addressed
**********
2. Is the manuscript technically sound, and do the data support the conclusions?
The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.
Reviewer #1: Yes
Reviewer #2: Yes
**********
3. Has the statistical analysis been performed appropriately and rigorously?
Reviewer #1: Yes
Reviewer #2: Yes
**********
4. Have the authors made all data underlying the findings in their manuscript fully available?
The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.
Reviewer #1: Yes
Reviewer #2: Yes
**********
5. Is the manuscript presented in an intelligible fashion and written in standard English?
PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.
Reviewer #1: Yes
Reviewer #2: Yes
**********
6. Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)
Reviewer #1: You may accept this article from my side. Authors addressed all the suggested comments from my side. It
is quite to accept this article.
Reviewer #2: The authors have addressed all my concerns. Great work on the article! With the revised tables, the work really shows that it is a good approximate inference method for large spatial dataset
**********
7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.
If you choose “no”, your identity will remain anonymous but your review may still be made public.
Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.
Reviewer #1: Yes: Muhammad Azeem
Reviewer #2: No
**********
10.1371/journal.pone.0291906.r006Acceptance letterAbonazelMohamed R.Academic Editor2023Mohamed R. AbonazelThis is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
23 Oct 2023
PONE-D-23-03446R2
Indexing and partitioning the spatial linear model for large data sets
Dear Dr. Ver Hoef:
I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.
If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.
If we can help with anything else, please email us at plosone@plos.org.
Thank you for submitting your work to PLOS ONE and supporting open access.