A tutorial for the geodetector R package

Chengdong Xu, Yue Hou, Jinfeng Wang, Qian Yin (IGSNRR, CAS)

2024-08-20

Geodetector method

Spatial stratified heterogeneity (SSH), referring to the within strata are more similar than the between strata, such as landuse types and climate zones, is ubiquitous in spatial data. SSH instead of random is a set of information, which has been being a window for humans to understand the nature since Aristotle time. In another aspect, a model with global parameters would be confounded if input data is SSH, the problem dissolves if SSH is identified so simple models can be applied to each stratum separately. Note that the “spatial” here can be either geospatial or the space in mathematical meaning.

Geodetector is a novel tool to investigate SSH: (1) measure and find SSH of a variable Y ; (2) test the power of determinant X of a dependent variable Y according to the consistency between their spatial distributions; and (3) investigate the interaction between two explanatory variables X1 and X2 to a dependent variable Y. All of the tasks are implementable by the geographical detector q-statistic: \[\begin{equation} q=1- \frac{1}{N\sigma^2}\sum_{h=1}^{L}N_h\sigma_h^2 \end{equation}\]

where N and σ2 stand for the number of units and the variance of Y in study area, respectively; the population Y is composed of L strata (h = 1, 2, …, L), Nh and σh2 stand for the number of units and the variance of Y in stratum h, respectively. The strata of Y (red polygons in Figure 1) are a partition of Y, either by itself ( h(Y) in Figure 1) or by an explanatory variable X which is a categorical variable ( h(Y) in Figure 1). X should be stratified if it is a numerical variable, the number of strata L might be 2-10 or more, according to prior knowledge or a classification algorithm.

Figure 1. Principle of geodetector
Figure 1. Principle of geodetector

(Notation: Yi stands for the value of a variable Y at a sample unit i ; h(Y) represents a partition of Y ; h(X) represents a partition of an explanatory variable X. In geodetector, the terms “stratification”, “classification” and “partition” are equivalent.)

Interpretation of q value (please refer to Fig.1). The value of q ∈ [0, 1].

If Y is stratified by itself h(Y), then q = 0 indicates that Y is not SSH; q = 1 indicates that Y is SSH perfectly; the value of q indicates that the degree of SSH of Y is q.

If Y is stratified by an explanatory variable h(X), then q = 0 indicates that there is no association between Y and X ; q = 1 indicates that Y is completely determined by X ; the value of q-statistic indicates that X explains 100q% of Y. Please notice that the q-statistic measures the association between X and Y, both linearly and nonlinearly.

For more detail of Geodetector method, please refer:

[1] Wang JF, Li XH, Christakos G, Liao YL, Zhang T, Gu X, Zheng XY. Geographical detectors-based health risk assessment and its application in the neural tube defects study of the Heshun Region, China. International Journal of Geographical Information Science, 2010, 24(1): 107-127.

[2] Wang JF, Zhang TL, Fu BJ. A measure of spatial stratified heterogeneity. Ecological Indicators,2016, 67(2016): 250-256.

[3] Wang JF, Xu CD. Geodetector:Principle and prospective. Geographica Sinica,2017,72(1):116-134.

R package for geodetector

geodetector package includes five functions: factor_detector, interaction_detector, risk_detector, ecological_detector and geodetector. The first four functions implementing the calcution of factor detector, interaction detector, risk detector and ecological detector, which can be calculated using table data, e.g. csv format(Table 1). The last function geodetector is an auxiliary function, which can be used to implement the calculation for shapefile format map data(Figure 2).

Table 1. Demo data in table format
incidence watershed soiltype elevation
7.20 2 3 6
7.01 2 3 6
6.79 2 3 6
6.73 4 3 6
6.77 4 3 1
6.74 4 3 6

geodetector package is available for data.frame. Please check the data type in advance.

As a demo, neural-tube birth defects (NTD) Y and suspected risk factors or their proxies Xs in villages are provided, including data for the health effect GIS layers and environmental factor GIS layers, “elevation”, “soil type”, and “watershed”.

Figure 2. Demo data in GIS format (a) NTDs prevalence *Y*, (b) Elevation *X<sub>1</sub>*, (c) Soil types *X<sub>2</sub>*, (d) Watersheds *X<sub>3</sub>*

Figure 2. Demo data in GIS format (a) NTDs prevalence Y, (b) Elevation X1, (c) Soil types X2, (d) Watersheds X3

After download of geodetector package, using install.packages function to install it.

install.packages("geodetector")

Load package:

library(geodetector)

Read data in table format:

data(CollectData)

Data class:

class(CollectData)
## [1] "data.frame"

Field names:

names(CollectData)
## [1] "incidence" "watershed" "soiltype"  "elevation"

1 Factor detector

The factor detector q-statistic measures the SSH of a variable Y, or the determinant power of a covariate X of Y.

factor_detector implement the function of factor detector. In the following demo, the first parameter “incidence” represent explained variable, the second parameter “elevation” represent explanatory variable, and the third parameter” CollectData” represent dataset.

The output of the function include q statistic and the corresponding p value.

factor_detector("incidence","elevation",CollectData) 
## [[1]]
##           q-statistic    p-value
## elevation   0.6067087 0.04080407

Another way also can be used to implement the function, in which the input parameters can be the index of each field. For example, in the following demo, the first parameter “1” represent explained variable in the first column of the dataset, the second parameter “3” represent explanatory variable in the third column of the dataset.

factor_detector(1,3, CollectData)
## [[1]]
##          q-statistic   p-value
## soiltype   0.3857168 0.3632363

If there are more than one variable, the function can be used as the following. In which, c(“soiltype”,“watershed”,“elevation”) or c(2,3,4) are field names and index of field for explanatory variables in columns 2, 3, 4, respectively.

factor_detector ("incidence",c("soiltype","watershed","elevation"),CollectData)

or

factor_detector (1,c(2,3,4), CollectData)
## [[1]]
##           q-statistic      p-value
## watershed   0.6377737 0.0001169914
## 
## [[2]]
##          q-statistic   p-value
## soiltype   0.3857168 0.3632363
## 
## [[3]]
##           q-statistic    p-value
## elevation   0.6067087 0.04080407

2 Interaction detector

The interaction detector reveals whether the risk factors X1 and X2 (and more X) have an interactive influence on a disease Y.

The function interaction_detector implement the interaction detector. In the following demo, the first parameter “incidence” represent explained variable, the second parameter c(“soiltype”,“watershed”,“elevation”) represent explanatory variables, and the third parameter “CollectData” represent dataset.

interaction_detector("incidence",c("soiltype","watershed","elevation"),CollectData)
##       [,1]        [,2]        [,3]               
##  [1,] "soiltype"  "watershed" "0.735680548139531"
##  [2,] "soiltype"  "soiltype"  "0.385716842809428"
##  [3,] "watershed" "watershed" "0.637773670070423"
##  [4,] "soiltype"  "elevation" "0.663523698335635"
##  [5,] "soiltype"  "soiltype"  "0.385716842809428"
##  [6,] "elevation" "elevation" "0.606708709727727"
##  [7,] "watershed" "elevation" "0.71359677853471" 
##  [8,] "watershed" "watershed" "0.637773670070423"
##  [9,] "elevation" "elevation" "0.606708709727727"

3 Risk detector

The risk detector calculates the average values in each stratum of explanatory variable (X), and presents if there exists difference between two strata.

The function risk_detector implement the risk detector. In the following demo, the first parameter “incidence” represents explained variable, the second parameter “soiltype” represents explanatory variables, and the third parameter “CollectData” represent dataset.

In the function, result information for each variable is presented in two parts.

The first part gives the average value of explained variable in each stratum of a explanatory variables.

The second part tests whether there are significant difference between the means of two strata; if there is a significant difference (t test with significant level of 0.05), the corresponding value is “TRUE”, else it is “FALSE”.

risk_detector("incidence","soiltype",CollectData)
## [[1]]
## [[1]]$`Risk Detector`
##   soiltype Mean of explained variable
## 1        1                   6.340000
## 2        2                   6.687500
## 3        3                   6.583279
## 4        4                   5.843810
## 5        5                   6.347073
## 
## [[1]]$`Significance t-test:0.05`
##       1     2     3     4     5
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE
## 5 FALSE  TRUE  TRUE  TRUE FALSE

Another way also can be used to implement the function, in which the input parameters can be the index of each field. For example, in the following demo, the the first parameter “1” represent explained variable in the first column of the dataset, the second parameter “2” represent explanatory variable in the second column of the dataset.

risk_detector(1,2, CollectData)
## [[1]]
## [[1]]$`Risk Detector`
##   watershed Mean of explained variable
## 1         1                   6.167813
## 2         2                   6.813103
## 3         3                   6.474231
## 4         4                   6.728000
## 5         5                   5.910000
## 6         6                   5.845714
## 7         7                   6.494167
## 8         8                   6.360769
## 9         9                   6.579231
## 
## [[1]]$`Significance t-test:0.05`
##       1     2     3     4     5     6     7     8     9
## 1 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 2  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 3  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 4  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 7  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 8  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## 9  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

If there are more than one variable, the function can be used as the following. In which, c(“soiltype”,“watershed”,“elevation”) and c(2,3,4) are field names and index of field for explanatory variables.

risk_detector("incidence",c("soiltype","watershed","elevation"),CollectData)

or

risk_detector(1,c(2,3,4), CollectData)
## [[1]]
## [[1]]$`Risk Detector`
##   watershed Mean of explained variable
## 1         1                   6.167813
## 2         2                   6.813103
## 3         3                   6.474231
## 4         4                   6.728000
## 5         5                   5.910000
## 6         6                   5.845714
## 7         7                   6.494167
## 8         8                   6.360769
## 9         9                   6.579231
## 
## [[1]]$`Significance t-test:0.05`
##       1     2     3     4     5     6     7     8     9
## 1 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 2  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 3  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 4  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 7  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## 8  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## 9  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## 
## 
## [[2]]
## [[2]]$`Risk Detector`
##   soiltype Mean of explained variable
## 1        1                   6.340000
## 2        2                   6.687500
## 3        3                   6.583279
## 4        4                   5.843810
## 5        5                   6.347073
## 
## [[2]]$`Significance t-test:0.05`
##       1     2     3     4     5
## 1 FALSE  TRUE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE
## 5 FALSE  TRUE  TRUE  TRUE FALSE
## 
## 
## [[3]]
## [[3]]$`Risk Detector`
##   elevation Mean of explained variable
## 1         1                   6.455882
## 2         2                   6.171111
## 3         3                   6.258108
## 4         4                   6.621364
## 5         5                   5.908889
## 6         6                   6.888636
## 7         7                   5.790000
## 
## [[3]]$`Significance t-test:0.05`
##       1     2     3     4     5     6     7
## 1 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 2  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 7  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

4 Ecological detector

The ecological detector tests whether there is significant differences between two risk factors X1 ~ X2.

The function ecological_detector implement the ecological detector. In the following demo, the first parameter “incidence” represents explained variable, the second parameter c(“soiltype”,“watershed”) represents explanatory variables, and the third parameter “CollectData” represent dataset. In the function, the F statistic is used to test the difference with the significant level of 0.05.

ecological_detector("incidence",c("soiltype","watershed"),CollectData)
## $`Significance.F-test:0.05`
##           soiltype watershed
## soiltype     FALSE      TRUE
## watershed     TRUE     FALSE

If there are more than two variables, the function can be used as the following.

ecological_detector("incidence",c("soiltype","watershed","elevation"),CollectData)
## $`Significance.F-test:0.05`
##           soiltype watershed elevation
## soiltype     FALSE      TRUE      TRUE
## watershed     TRUE     FALSE     FALSE
## elevation     TRUE     FALSE     FALSE

where, c(“soiltype”,“watershed”,“elevation”) are field names of field for explanatory variables.