| Title: | Tools for Statistical Inference with Geo-Coded Data |
| Version: | 0.1.0 |
| Description: | Fast computation of Conley (1999) <doi:10.1016/S0304-4076(98)00084-0> spatial heteroskedasticity and autocorrelation consistent (HAC) standard errors for linear regression models with geo-coded data, with a fast C++ implementation by Christensen, Hartman, and Samii (2021) <doi:10.1017/S0020818321000187>. Performance-critical distance calculations, kernel weighting, and variance component accumulation are implemented in C++ via 'Rcpp' and 'RcppArmadillo'. Includes tools for estimating the spatial correlation range from covariograms and correlograms following the bandwidth selection method proposed in Lehner (2026) <doi:10.48550/arXiv.2603.03997>, and diagnostic visualizations for bandwidth selection. |
| Depends: | R (≥ 4.1.0) |
| License: | GPL (≥ 3) |
| URL: | https://github.com/axlehner/SpatialInference |
| BugReports: | https://github.com/axlehner/SpatialInference/issues |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| LinkingTo: | Rcpp, RcppArmadillo |
| Imports: | Rcpp, sf, data.table, magrittr, stats, tibble |
| Suggests: | lfe, fixest, dplyr, stringr, spdep, ncf, gstat, sandwich, ggplot2, modelsummary, knitr, rmarkdown, geosphere, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| NeedsCompilation: | yes |
| Packaged: | 2026-03-21 03:46:39 UTC; alexanderlehner |
| Author: | Alexander Lehner |
| Maintainer: | Alexander Lehner <alehner@worldbank.org> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-25 20:50:23 UTC |
SpatialInference: Conley Spatial HAC Standard Errors and Correlation Range Estimation
Description
Fast computation of Conley (1999) spatial heteroskedasticity and autocorrelation consistent (HAC) standard errors for linear regression models with geo-coded data. Performance-critical distance calculations, kernel weighting, and variance component accumulation are implemented in C++ via Rcpp and RcppArmadillo. Includes tools for estimating the spatial correlation range from covariograms and correlograms following the bandwidth selection method proposed in Lehner (2026), and diagnostic visualizations for bandwidth selection.
Author(s)
Maintainer: Alexander Lehner alehner@worldbank.org (ORCID)
References
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
Eddelbuettel, D. and Francois, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18. doi:10.18637/jss.v040.i08
Eddelbuettel, D. and Sanderson, C. (2014). RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational Statistics & Data Analysis, 71, 1054–1063. doi:10.1016/j.csda.2013.02.005
See Also
Useful links:
Report bugs at https://github.com/axlehner/SpatialInference/issues
Spatial Variance Component for Balanced Panel
Description
Spatial Variance Component for Balanced Panel
Usage
Bal_XeeXhC(dmat, X, e, n1, k)
Arguments
dmat |
pre-computed distance matrix (n x n) |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
Value
A k x k matrix representing the spatial X'ee'X component.
Create Distance Matrix
Description
Create Distance Matrix
Usage
DistMat(M, cutoff, kernel = "bartlett", dist_fn = "Haversine")
Arguments
M |
a matrix of locations (n x 2, latitude and longitude) |
cutoff |
the distance cutoff (bandwidth) in km |
kernel |
(string) kernel function (default is bartlett-triangular) |
dist_fn |
(string) distance function (Haversine) |
Value
A symmetric n x n matrix of kernel weights with 1s on the diagonal.
Temporal Variance Component (Time Dimension)
Description
Temporal Variance Component (Time Dimension)
Usage
TimeDist(times, cutoff, X, e, n1, k)
Arguments
times |
numeric vector of time period identifiers |
cutoff |
the lag cutoff (number of time periods) |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
Value
A k x k matrix representing the temporal X'ee'X component.
Centroids of Contiguous US Counties (2017)
Description
An sf data frame containing the centroids of all 3,108 counties of the
contiguous United States (2017 geographies), along with synthetic
spatially-correlated noise variables for use in examples and vignettes.
Usage
US_counties_centroids
Format
An sf data frame with 3,108 rows and the following columns:
- STATE
Numeric state FIPS code.
- NAME
County name.
- NAMELSAD
County name with legal/statistical area description.
- GISJOIN
Unique ID for joining with IPUMS data (2017 geographies).
- lat
Latitude of the county centroid (WGS84).
- lon
Longitude of the county centroid (WGS84).
- unit
Cross-sectional unit identifier (constant
1for cross-sectional use).- year
Time period identifier (constant
1for cross-sectional use).- noise1
Synthetic spatially-correlated variable (noise 1).
- noise2
Synthetic spatially-correlated variable (noise 2).
- dist
Distance variable (spatially non-stationary example).
- geometry
Point geometry column (NAD83 / EPSG:4269).
Source
IPUMS NHGIS, University of Minnesota, https://www.nhgis.org.
Spatial Variance Component (X'ee'X)
Description
Spatial Variance Component (X'ee'X)
Usage
XeeXhC(M, cutoff, X, e, n1, k, kernel = "bartlett", dist_fn = "Haversine")
Arguments
M |
matrix of locations (n x 2, latitude and longitude) |
cutoff |
the distance cutoff (bandwidth) in km |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
kernel |
(string) kernel function (default is bartlett-triangular) |
dist_fn |
(string) distance function (Haversine) |
Value
A k x k matrix representing the spatial X'ee'X component.
Spatial Variance Component for Large Datasets
Description
Memory-efficient variant that avoids constructing the full n x n distance matrix.
Usage
XeeXhC_Lg(M, cutoff, X, e, n1, k, kernel = "bartlett", dist_fn = "Haversine")
Arguments
M |
matrix of locations (n x 2, latitude and longitude) |
cutoff |
the distance cutoff (bandwidth) in km |
X |
model matrix (n x k) |
e |
vector of residuals (length n) |
n1 |
number of observations |
k |
number of regressors |
kernel |
(string) kernel function (default is bartlett-triangular) |
dist_fn |
(string) distance function (Haversine) |
Value
A k x k matrix representing the spatial X'ee'X component.
Compute Conley Standard Error for a Single Coefficient
Description
Convenience wrapper around conley_SE() that returns only the spatial
Conley standard error for the first regressor. Useful for quick extraction
of Conley SEs in scripting contexts.
Usage
compute_conley_lfe(lfeobj, cutoff, kernel_choice = "bartlett", ...)
Arguments
lfeobj |
A regression object of class |
cutoff |
Numeric. The spatial bandwidth (cutoff distance in km). |
kernel_choice |
Character string specifying the kernel function.
Default is |
... |
Additional arguments passed to |
Value
A single numeric value: the spatial Conley standard error for the first (or only) regressor.
References
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
Examples
data(US_counties_centroids)
if (requireNamespace("lfe", quietly = TRUE)) {
reg <- lfe::felm(noise1 ~ noise2 | unit + year | 0 | lat + lon,
data = US_counties_centroids, keepCX = TRUE)
compute_conley_lfe(reg, cutoff = 500)
}
Conley Spatial HAC Variance-Covariance Estimation
Description
Computes Conley (1999) spatial HAC (Heteroskedasticity and Autocorrelation
Consistent) variance-covariance matrices for regression models estimated
with lfe::felm(). Supports cross-sectional spatial correlation,
serial (temporal) correlation, and the combined spatial HAC estimator.
Multiple kernel functions and distance metrics are available.
Usage
conley_SE(
reg,
unit,
time,
lat,
lon,
kernel = "bartlett",
dist_fn = "Haversine",
dist_cutoff = 500,
lag_cutoff = 5,
lat_scale = 111,
verbose = FALSE,
cores = 1,
balanced_pnl = FALSE
)
Arguments
reg |
A regression object of class |
unit |
Character string naming the cross-sectional unit identifier
variable in the |
time |
Character string naming the time period identifier variable
in the |
lat |
Character string naming the latitude variable in the |
lon |
Character string naming the longitude variable in the |
kernel |
Character string specifying the kernel function for spatial
weighting. One of |
dist_fn |
Character string specifying the distance function.
|
dist_cutoff |
Numeric. The spatial bandwidth (cutoff distance in km)
beyond which observations receive zero weight. Default is |
lag_cutoff |
Numeric. The temporal bandwidth (number of time periods)
for serial correlation correction. Default is |
lat_scale |
Numeric. Scaling factor for latitude (km per degree).
Default is |
verbose |
Logical. If |
cores |
Integer. Number of CPU cores for parallel computation via
|
balanced_pnl |
Logical. If |
Value
A named list of three variance-covariance matrices, each of dimension k x k where k is the number of regressors:
- OLS
The standard OLS variance-covariance matrix from the felm object.
- Spatial
The spatial-only Conley VCV, correcting for cross-sectional spatial correlation.
- Spatial_HAC
The full spatial HAC VCV, correcting for both spatial and serial correlation.
References
Christensen, D., Hartman, A. C. and Samii, C. (2021). Legibility and external investment: An institutional natural experiment in Liberia. International Organization, 75(4), 1087–1108. doi:10.1017/S0020818321000187
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Newey, W. K. and West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica, 55(3), 703–708. doi:10.2307/1913610
Examples
data(US_counties_centroids)
if (requireNamespace("lfe", quietly = TRUE)) {
reg <- lfe::felm(noise1 ~ noise2 | unit + year | 0 | lat + lon,
data = US_counties_centroids, keepCX = TRUE)
vcvs <- conley_SE(reg, unit = "unit", time = "year",
lat = "lat", lon = "lon",
kernel = "bartlett", dist_cutoff = 500)
# Spatial standard errors:
sqrt(diag(vcvs$Spatial))
}
Extract Coordinates as Columns from an sf Object
Description
Extracts point coordinates from an sf or sfc object and returns
them as a tibble. This is a lightweight alternative to
sf::st_coordinates() that returns a tibble directly.
Usage
coords_as_columns(x, names = c("x", "y"))
Arguments
x |
An |
names |
Character vector of length 2 specifying the column names
for the x and y coordinates. Default is |
Value
A tibble::tibble() with two columns named according to the
names argument, containing the x and y coordinates.
References
Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446. doi:10.32614/RJ-2018-009
Examples
data(US_counties_centroids)
coords <- coords_as_columns(US_counties_centroids)
head(coords)
Covariogram Range Estimation and Visualization
Description
Estimates a covariogram from an sf data frame (either from a single
variable or regression residuals) using gstat::variogram() with
covariogram = TRUE, extracts the zero-crossing as the estimated
correlation range via extract_corr_range(), and produces a diagnostic
plot.
Usage
covgm_range(
df.input,
depvar = "noise1",
indepvar = "noise2",
maxdist = NA,
spacing = NA,
single.variable = FALSE
)
Arguments
df.input |
An |
depvar |
Character string naming the dependent variable. Default
is |
indepvar |
Character string naming the independent variable (used
when |
maxdist |
Numeric. Maximum distance for the covariogram (in metres).
Default is |
spacing |
Numeric. Bin width for the covariogram (in metres). Default
is |
single.variable |
Logical. If |
Value
A ggplot2::ggplot() object showing the covariogram with the
estimated correlation range marked by a vertical red line and annotated
text.
References
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Pebesma, E. J. (2004). Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30(7), 683–691. doi:10.1016/j.cageo.2004.03.012
Examples
data(US_counties_centroids)
if (requireNamespace("fixest", quietly = TRUE) &&
requireNamespace("gstat", quietly = TRUE) &&
requireNamespace("ggplot2", quietly = TRUE)) {
covgm_range(US_counties_centroids)
}
Extract Correlation Range from a Correlogram or Covariogram
Description
Identifies the distance at which spatial autocorrelation first crosses
zero, providing an estimate of the spatial correlation range. Works with
correlograms from ncf::correlog() and covariograms from
gstat::variogram() (with covariogram = TRUE).
Usage
extract_corr_range(input, returnzeroifNA = FALSE)
Arguments
input |
A correlogram object from |
returnzeroifNA |
Logical. If |
Details
For correlograms, the function detects the first sign change in the rounded and floored correlation values. For covariograms, it finds the first index where gamma transitions from positive to non-positive. In both cases, the estimated range is the midpoint between the last positive and first non-positive distance bins.
Value
A numeric value representing the estimated correlation range. For covariograms, the unit is km (distance in metres divided by 1000). For correlograms, the unit matches the input distance unit.
References
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Pebesma, E. J. (2004). Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30(7), 683–691. doi:10.1016/j.cageo.2004.03.012
Examples
# With a mock gstatVariogram:
mock_vgm <- data.frame(
np = rep(100, 10),
dist = seq(50000, 500000, by = 50000),
gamma = c(5, 3, 2, 1, 0.5, -0.2, -0.5, -0.3, -0.1, -0.05),
dir.hor = 0, dir.ver = 0, id = "var1"
)
class(mock_vgm) <- c("gstatVariogram", "data.frame")
extract_corr_range(mock_vgm)
Custom Glance Method for lm_sac Output
Description
Extracts goodness-of-fit statistics including Moran's I statistics and
correlation ranges from an lm object augmented by lm_sac().
Used by modelsummary::modelsummary() for table formatting.
Usage
## S3 method for class 'lm'
glance_custom(x, ...)
Arguments
x |
An |
... |
Additional arguments (currently unused). |
Value
A one-row data.frame with columns Model, y_mean, y_SD,
nobs, Moran_y, Moran_resid, Range_resid, and Range_y.
Goodness-of-Fit Mapping for modelsummary
Description
A list specifying how goodness-of-fit statistics should be displayed
in modelsummary::modelsummary() output.
Usage
gm.param
Format
An object of class list of length 5.
Weighted Mean Centre (Centre of Gravity)
Description
Computes the mean centre of an sf data frame, optionally weighted by
an attribute variable. The function first extracts polygon or point
centroids using sf::st_centroid(), then calculates the (weighted)
arithmetic mean of the x and y coordinates. This is the two-dimensional
analogue of a weighted mean and corresponds to the "centre of gravity"
or "mean centre" in spatial statistics. A common use case is computing
the population-weighted centroid of a set of administrative units.
Usage
gravity_centroid(df.sf, weight = NA)
Arguments
df.sf |
An |
weight |
Numeric vector of weights with length equal to |
Value
An sfc_POINT object (CRS 4326 / WGS84) representing the
(weighted) mean centre as a single point.
References
Arlinghaus, S. L. (1994). Practical Handbook of Spatial Statistics. CRC Press.
Examples
data(US_counties_centroids)
# Unweighted mean centre (geographic centroid of all county centroids)
gravity_centroid(US_counties_centroids)
# Weighted mean centre (shifted toward areas with higher noise1 values)
gravity_centroid(US_counties_centroids,
weight = abs(US_counties_centroids$noise1) + 1)
Create Spatial Grid Fixed Effects
Description
Overlays a regular rectangular grid on an sf data frame, intersects each
observation with the grid, and returns a factor variable identifying the
grid cell to which each observation belongs. This is useful for constructing
spatial fixed effects in regression models: by including grid_FE() as a
factor variable, the regression absorbs location-specific variation at the
resolution of the chosen grid. Finer grids absorb more spatial variation
but consume more degrees of freedom.
Usage
grid_FE(df.sf, size, distance = FALSE)
Arguments
df.sf |
An |
size |
Numeric. When |
distance |
Logical. If |
Details
The grid is constructed via sf::st_make_grid() and observations are
assigned to cells via sf::st_intersection(). Observations that fall
outside the grid (e.g., in coastal regions where cells do not cover the
full bounding box) are dropped during intersection.
Value
A factor vector with one element per observation that survived the intersection (see Details), where each level is a grid cell ID.
References
Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446. doi:10.32614/RJ-2018-009
Examples
data(US_counties_centroids)
# Create a 5 x 5 grid of spatial fixed effects
grid_ids <- grid_FE(US_counties_centroids, size = 5)
table(grid_ids)
# Finer grid (10 x 10)
grid_ids_fine <- grid_FE(US_counties_centroids, size = 10)
length(levels(grid_ids_fine))
Inverse-U Plot of Conley Standard Errors vs. Cutoff Distance
Description
Visualizes the relationship between the spatial bandwidth (cutoff distance) and the resulting Conley standard error. This typically reveals an inverse-U shaped relationship, helping to identify the appropriate bandwidth for spatial HAC estimation.
Usage
inverseu_plot_conleyrange(
df.input,
cutoffrange = NA,
kernel_choice_conley = "epanechnikov",
depvar = "noise1",
indepvar = "noise2",
range_add = FALSE,
...
)
Arguments
df.input |
An |
cutoffrange |
Numeric vector of cutoff distances (in km) to evaluate. |
kernel_choice_conley |
Character string specifying the kernel function.
Default is |
depvar |
Character string naming the dependent variable column.
Default is |
indepvar |
Character string naming the independent variable column.
Default is |
range_add |
Logical. If |
... |
Additional arguments (currently unused). |
Value
A ggplot2::ggplot() object showing Conley SE (y-axis) against
cutoff distance (x-axis), with the HC1 standard error as a grey
dashed horizontal reference line.
References
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
Examples
data(US_counties_centroids)
if (requireNamespace("lfe", quietly = TRUE) &&
requireNamespace("fixest", quietly = TRUE) &&
requireNamespace("ggplot2", quietly = TRUE)) {
inverseu_plot_conleyrange(US_counties_centroids,
cutoffrange = seq(100, 1000, by = 200))
}
Iterate over observations for spatial or serial correlation
Description
Internal helper function called by conley_SE() to compute the X'ee'X
component for a single time period (spatial) or unit (serial).
Usage
iterateObs(
dt,
Xvars,
d,
k,
sub_index,
type,
cutoff,
balanced_pnl,
verbose,
kernel,
dist_fn
)
Arguments
dt |
A |
Xvars |
Character vector of regressor names. |
d |
Pre-computed distance matrix (used when |
k |
Integer, number of regressors. |
sub_index |
The time period or unit index to subset on. |
type |
Character, either |
cutoff |
Numeric, the distance or lag cutoff. |
balanced_pnl |
Logical, whether the panel is balanced. |
verbose |
Logical, whether to print progress. |
kernel |
Character, the kernel function name. |
dist_fn |
Character, the distance function name. |
Value
A k x k numeric matrix representing the X'ee'X contribution.
Linear Model with Spatial Autocorrelation Diagnostics
Description
Estimates a linear regression via lfe::felm() and augments the output
with Moran's I tests for spatial autocorrelation, optional correlograms
for range estimation, and Conley spatial HAC standard errors. The returned
object has class "custom" prepended, enabling display via
modelsummary::modelsummary() with custom tidy and glance methods.
Usage
lm_sac(
formula.chr,
data.sf,
knn_number = 20,
conley_cutoff = 5,
conley_kernel = "bartlett",
correlograms = FALSE,
...
)
Arguments
formula.chr |
Character string specifying the regression formula in
|
data.sf |
An |
knn_number |
Integer. Number of nearest neighbours for the spatial
weights matrix used in Moran's I tests. Default is |
conley_cutoff |
Numeric. Spatial bandwidth (cutoff distance in km)
for the Conley standard error. Default is |
conley_kernel |
Character string specifying the kernel function.
Default is |
correlograms |
Logical. If |
... |
Additional arguments passed to |
Value
An object of class c("custom", "lm") with additional components:
- spatial_FE
Character, the spatial fixed effect variable name.
- Moran_lmresid
Moran's I test statistic on the OLS residuals, or
NAif the test failed.- Moran_response
Moran's I test statistic on the response variable, or
NAif the test failed.- correlog.range_resid
Estimated correlation range from the residual correlogram (km), or
NAifcorrelograms = FALSE.- correlog.range_response
Estimated correlation range from the response correlogram (km), or
NAifcorrelograms = FALSE.- conley_SE
Numeric vector of Conley spatial standard errors (with 0 for intercept and higher-order FE coefficients).
- conley_SE_flex
Conley SEs using the correlogram-based cutoff, or
NAifcorrelograms = FALSE.
References
Lehner, A. (2026). Bandwidth selection for spatial HAC standard errors. arXiv preprint arXiv:2603.03997. doi:10.48550/arXiv.2603.03997
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1–45. doi:10.1016/S0304-4076(98)00084-0
Moran, P. A. P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17–23. doi:10.2307/2332142
Bivand, R. S., Pebesma, E. and Gomez-Rubio, V. (2013). Applied Spatial Data Analysis with R. 2nd ed. Springer.
Examples
data(US_counties_centroids)
if (requireNamespace("lfe", quietly = TRUE) &&
requireNamespace("spdep", quietly = TRUE) &&
requireNamespace("stringr", quietly = TRUE) &&
requireNamespace("dplyr", quietly = TRUE) &&
requireNamespace("sandwich", quietly = TRUE)) {
out <- lm_sac("noise1 ~ noise2 | unit + year | 0 | lat + lon",
US_counties_centroids, conley_cutoff = 500)
out$conley_SE
}
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
Custom Tidy Method for lm_sac Output
Description
Extracts coefficient estimates, HC1 robust standard errors, and Conley
spatial standard errors from an lm object augmented by lm_sac().
Used by modelsummary::modelsummary() for table formatting.
Usage
## S3 method for class 'lm'
tidy_custom(x, ...)
Arguments
x |
An |
... |
Additional arguments (currently unused). |
Value
A data.frame with columns term, estimate, std.error
(HC1 robust), conf.high (Conley SE), and conley (Conley SE).