2- The Path coefficient model

In a path coefficient analysis, descriptive statistics and Pearson correlation coefficients (double-headed arrows) between variables may be estimates which is done in this package. Moreover, and especially simple or multiple linear regression of dependent (or endogenous) variable(s) on independent variable(s) may be done, a task is done here. Of course, in a sequential path coefficient analysis, intervening or endogenous variables exist and analyses are performed step-by-step via this package, but in a simple path coefficient analysis one step is enough, which is done in this package along with the path diagram which is drawn automatically, but for complicated or sequential path, some more works must be done which is discussed later in this manual. In a path model, path coefficient or direct effects (Pi’s) indicates the direct effect of a variable on another, and are standardized partial regression coefficients (in Wright’s terminology) due they are estimated from correlations or from the transformed (standardized) data as: $P_i = \beta_i\frac{\sigma_{X_i}}{\sigma_Y}$ . The path equations are as follows:

One dependent variable:

$\mathbf{X} = \begin{pmatrix} P_1 + P_2r_{12} + P_3r_{13} + ... + P_nr_{1n} = r_{Y1} \\ P_1r_{21} + P_2 + P_3r_{23} + ... + P_nr_{2n} = r_{Y2} \\ P_1r_{31} + P_2r_{32} + P_3 + ... + P_nr_{3n} = r_{Y3}\\ \vdots + \vdots \\ P_1r_{n1} + P_2r_{n2} + P_3r_{n3} + ... + P_n = r_{Yn} \\ \end{pmatrix}$

Extension to more dependent variables:

Our package is capable of performing this straightforward task through detailed explanations. As stated by Bondari (1990), for two dependent variables $Y_1$ and $Y_2$ :

$Y_1=p_1X_1+p_2X_2+p_3X_3+... +p_nX_n\\ Y_2=p'_1X_1+p'_2X_2+p'_3X_3+... +p'_nX_n\\ ...\\ where:\\ r_{Y_1Y_2}=p_1p'_1+p_2p'_2+p_3p'_3+...+p_np'_n+\sigma_{i=j}p_ip'_1r_{ij}=\sigma_{i,j}p_ip'_ir_{ij}$

The commands above are shown in the Figures 1&2. The simple path diagram:

Fig. 1: A simple path diagram (courtesy of Sewall Wright)

Fig. 2: A multivariate path diagram(courtesy of Bondari, 1990)

The opening part of this vignette (instruction manual) provides a brief introduction to the concepts underpinning path coefficient analysis. The subsequent part showcases two practical demonstrations. In a path coefficient analysis, the Pearson correlation coefficients between dependent variables and their related independent variables are decomposed, as previously mentioned.

Our ** package can be applied in two cases: simple and sequential path coefficient analysis. If not installed, the ** package is being installed firstly through:

if(!require('Path.Analysis')){
    install.packages('Path.Analysis')
}
#> Loading required package: Path.Analysis
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
library('Path.Analysis')

The analyses requires the following R packages:

library(car)
library(stats)
library(Hmisc)
library(pastecs)
library(devtools)
library(usethis)
library(testthat)
library(knitr)
library(rmarkdown)
    
## For graphical displays
library(metan)
library(ComplexHeatmap)
library(grDevices)
library(DiagrammeR)

2-1- Simple path coefficient analysis

2-1-1- worked example 1:

When data is put within the data folder of $\mathbf{}$ package. This is the simplest dataset in this package consisting of a dependent variable called Y and 3 independent called X1, X2 and X3. Then in the command prompt line type the following commands and run the analyses:

data(dtsimp)

head(dtsimp[1:3, ])

Correlation between variables:

corr(dtsimp, verbose = FALSE)

Simple linear regression between Y and X1-X3 vars:

reg(dtsimp, 1, verbose = FALSE)

Plot the path main diagram

matdiag(dtsimp, 1)

#> [[1]]
#>        y    x1    x2    x3
#> y   1.00  0.43 -0.12  0.03
#> x1  0.43  1.00 -0.14  0.08
#> x2 -0.12 -0.14  1.00 -0.08
#> x3  0.03  0.08 -0.08  1.00
#> 
#> n= 105 
#> 
#> 
#> P
#>    y      x1     x2     x3    
#> y         0.0000 0.2226 0.7772
#> x1 0.0000        0.1682 0.4333
#> x2 0.2226 0.1682        0.4329
#> x3 0.7772 0.4333 0.4329       
#> 
#> [[2]]
#> [[2]]$p
#>               y           x1        x2        x3
#> y  0.000000e+00 4.281686e-06 0.2225777 0.7772096
#> x1 4.281686e-06 0.000000e+00 0.1682316 0.4333210
#> x2 2.225777e-01 1.682316e-01 0.0000000 0.4328677
#> x3 7.772096e-01 4.333210e-01 0.4328677 0.0000000
#> 
#> [[2]]$lowCI
#>             y         x1         x2         x3
#> y   1.0000000  0.2616079 -0.3046920 -0.1646039
#> x1  0.2616079  1.0000000 -0.3188570 -0.1161105
#> x2 -0.3046920 -0.3188570  1.0000000 -0.2650856
#> x3 -0.1646039 -0.1161105 -0.2650856  1.0000000
#> 
#> [[2]]$uppCI
#>             y         x1         x2        x3
#> y  1.00000000 0.57567143 0.07331527 0.2184383
#> x1 0.57567143 1.00000000 0.05769229 0.2650146
#> x2 0.07331527 0.05769229 1.00000000 0.1160351
#> x3 0.21843826 0.26501461 0.11603511 1.0000000
#> Warning in summary.lm(mlreg): essentially perfect fit: summary may be
#> unreliable
#> [[1]]
#> 
#> Call:
#> lm(formula = datap[, resp] ~ ., data = datap)
#> 
#> Coefficients:
#> (Intercept)            y           x1           x2           x3  
#>   1.109e-14    1.000e+00    3.295e-18   -1.762e-17    7.056e-17  
#> 
#> 
#> [[2]]
#> 
#> Call:
#> lm(formula = datap[, resp] ~ ., data = datap)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -3.397e-15 -1.069e-15 -1.786e-16  9.219e-16  1.232e-14 
#> 
#> Coefficients:
#>               Estimate Std. Error   t value Pr(>|t|)    
#> (Intercept)  1.109e-14  1.644e-15  6.75e+00 9.81e-10 ***
#> y            1.000e+00  1.065e-17  9.39e+16  < 2e-16 ***
#> x1           3.295e-18  1.457e-17  2.26e-01    0.821    
#> x2          -1.762e-17  9.490e-17 -1.86e-01    0.853    
#> x3           7.056e-17  2.392e-16  2.95e-01    0.769    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.845e-15 on 100 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 2.722e+33 on 4 and 100 DF,  p-value: < 2.2e-16

Fig. 3: Diagram of the path coefficient analysis of ‘dtsimp’ sample dataset.

> Note: when user faces with an external data: Suppose we have data stored in a hard drive at the path Path/to/data in a file called mydata.xls. To perform the following steps in RStudio console, follow these instructions:

library(readxl), if installed the readxl package.

dtraw <- read_excel(“Path/to/data/mydata.xls”).

2-1-2- worked example 2:

The next dataset, called dtraw is used in this part. It is also a built-in data in ** and contains nine variables: one dependent variable called Y and eight independent variables labeled X1 through X8. This dataset belongs to a population of a Camelina oil crop in its seed oil (Y) and C18, C18.1, C18.2, C18.3, C20.0, C20.1, C20.2, C22.1 fatty acids (marked as X1-X8) were measured. Then type the following commands in the RStudio console and run them:

data(dtraw)

rownames(dtraw) <- dtraw[, 1]

dtraw[, 1] <- NULL

head(dtraw[1:4, ])

The output is as follows:

data(dtraw)
dtraw <- as.data.frame(dtraw)
rownames(dtraw) <- dtraw[, 1]
dtraw[, 1] <- NULL
head(dtraw[1:4, ])
#>         Y   X1    X2    X3    X4   X5    X6   X7   X8
#> DH1 38.58 2.20 15.61 15.05 35.37 1.29 14.16 1.49 3.20
#> DH2 38.73 2.23 15.34 15.56 34.50 1.23 14.46 1.47 3.33
#> DH3 38.87 2.14 16.66 15.41 36.82 1.24 14.06 2.07 3.19
#> DH4 36.72 2.84 14.46 16.42 34.33 1.27 14.07 1.38 3.13

This dataset can be analyzed via ** packages as follows using ‘corr_plot’ function of ‘metan’ package, thanks to (Olivoto and Dal’Col Lúcio 2020).

Running ‘cor_plot’ function for ‘dtsimp’:

data(dtsimp)
cor_plot(dtsimp)

Fig. 4: Correlogram of dtsimp dataset, a built-in sample data.

Running the ‘matdiag’ function for ‘dtraw’ dataset ignoring the first column from left, or column names:

Fig. 5: Diagram of the path coefficient analysis of dtraw

The most significant part of my ** package is fitting such diagram, which is produced with the assistance of the DiagrammeR package.

It is important to exercise caution when encountering a short Plot Window in RStudio. To resolve this issue, navigate to R-Studio and position the cursor at the top of the graph window until four-way arrows appear. Then, effortlessly drag the top of the plot region upwards towards the variable list. If the figure region problem originated from this, running the code without any modifications will generate the anticipated graph. Additionally, ensure that your outer default margins are correctly sized and that your R plot area labels are not truncated. https://www.programmingr.com/r-error-messages/r-figure-margins-too-large/

When response existed between dependents, but not the first from left:

data(heart)
desc(heart, 2)
#> $`Descriptive statistics:`
#>      Biking       Heart.disease        Smoking       
#>  Min.   : 1.119   Min.   : 0.5519   Min.   : 0.5259  
#>  1st Qu.:20.205   1st Qu.: 6.5137   1st Qu.: 8.2798  
#>  Median :35.824   Median :10.3853   Median :15.8146  
#>  Mean   :37.788   Mean   :10.1745   Mean   :15.4350  
#>  3rd Qu.:57.853   3rd Qu.:13.7240   3rd Qu.:22.5689  
#>  Max.   :74.907   Max.   :20.4535   Max.   :29.9467  
#> 
#> $`Descriptive statistics:`
#>                    Biking Heart.disease      Smoking
#> nbr.val      4.980000e+02   498.0000000  498.0000000
#> nbr.null     0.000000e+00     0.0000000    0.0000000
#> nbr.na       0.000000e+00     0.0000000    0.0000000
#> min          1.119154e+00     0.5518982    0.5258500
#> max          7.490711e+01    20.4534962   29.9467431
#> range        7.378796e+01    19.9015981   29.4208931
#> sum          1.881863e+04  5066.9199578 7686.6471384
#> median       3.582446e+01    10.3852547   15.8146139
#> mean         3.778841e+01    10.1745381   15.4350344
#> SE.mean      9.626099e-01     0.2048706    0.3714820
#> CI.mean.0.95 1.891286e+00     0.4025192    0.7298687
#> var          4.614556e+02    20.9020349   68.7234260
#> std.dev      2.148152e+01     4.5718743    8.2899593
#> coef.var     5.684684e-01     0.4493447    0.5370872
#> 
#> $`Correlation coefficients:`
#>                    Biking Heart.disease    Smoking
#> Biking         1.00000000    -0.9354555 0.01513618
#> Heart.disease -0.93545547     1.0000000 0.30913098
#> Smoking        0.01513618     0.3091310 1.00000000
# matdiag(heart, 2)

*Please be cautious that the diagram is only produced automatically when there is only one dependent variable and related independent variable (causative). In the data set, the dependent variable (Y) should be the first variable from the left, and the other variables should be ordered from left to right, as observed in dtsimp or dtraw. In other words, when the target is simple path coefficient analysis, you can call the packages via: **matdiag(dtsimp, 1). The package extracts textual outputs (without graphs) under any conditions, even when there is missing data.*

2-2- Sequential path coefficient analysis

2-2-1- worked example:

As mentioned earlier, there are two types of path diagrams or methodologies: simple and multivariate. The multivariate form requires more steps and work, but the relationships between variables are the same and easy to understand. In the case of a sequential path diagram, this methodology is more complex because it includes intervening variables that need to be accounted for. Let’s consider a specific scenario with a dataset. For more information see (Arminian et al. 2008). Regarding the dataset, let’s assume our data is stored in a hard drive with the path “~path_to_data/” and is named ‘dtseq.xls’. To load this dataset into the Rstudio console, follow these steps:

library(readxl) #following installing the readxl package

dtseq <- read_excel(“~path_to_data/dtseq.xls”)

Methods like ‘Pearson’ or ‘Spearman’ can be used to analyze the correlation between variables. A correlogram is a tool that combines scatterplots and histograms, making it possible to examine the relationship between each pair of numeric variables in a matrix. The correlation is visually depicted in scatterplots, while the diagonal of the correlogram showcases the distribution of each variable using a histogram or density plot. (Source: https://python-graph-gallery.com/correlogram/) This analysis can be presented in the form of tables or matrices, which can be generated using the ‘PerformanceAnalytics’ and ‘metan’ packages.

step 1: YLD v.s FS, DFT, FW, FV:

library(metan)

data(dtseq)

dtseq1 <- dtseq[, c(2, 4, 3, 6, 5)]

head(dtseq1)

matdiag(dtseq1, 1)

#> # A tibble: 6 × 5
#>     YLD    FS   DFT    FW    FV
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 410.   31.6  51.7  8.69 10.0 
#> 2  84.7  38.5  52    8.16  8.33
#> 3 360.   25.3  54.7  7.48  5.65
#> 4 380.   33.9  49   10.0   9.33
#> 5 311.   24.7  50    9.19 10.0 
#> 6 404.   19.1  52.5  9.97  9.87

Fig. 6: Diagram of dtseq1, modified of the dtseq data.

Network diagrams, also known as graphs, visually depict the connections between a group of entities. Each entity is represented as a node or vertice, and the connections between nodes are shown as links or edges (source: https://www.data-to-viz.com/graph/network.html). In R software, you can create network plots or connections between objects using the ‘corrr’ package. This package allows you to create colored links that can be thin or thick, depending on the strength of the correlation, to represent the correlations between objects. Take a look at the graph that illustrates the correlations for ‘dtseq1’. It showcases a larger number of variables, making it visually appealing and informative.

#> Correlation computed with
#> • Method: 'pearson'
#> • Missing treated using: 'pairwise.complete.obs'

Fig. 7: Network plot of the dtraw2.

Fig. 8: Heatmap of the dtraw2 dataset.

Attractive heatmaps

For plotting the heatmaps and clustering of observations and variables simultaneously, we can use some packages developed such as ComplexHeatmap<10.1002/imt2.43> (Gu Z (2022). “Complex Heatmap Visualization.” iMeta. doi:10.1002/imt2.43.), and pheatmap packages. We here introduce the application of ComplexHeatmap package in clustering the dtraw2 dataset measured on 35 genotypes of a plant with 9 traits.

Fig. 9: Complex heatmap plot1 of the dtraw2.

Step 2: FS vs. FLP, DFL:

#> # A tibble: 6 × 3
#>      FS   FLP   DFL
#>   <dbl> <dbl> <dbl>
#> 1  31.6  55.2  16.7
#> 2  38.5  55.7  18.3
#> 3  25.3  49.8  17.6
#> 4  33.9  59.1  17.9
#> 5  24.7  49.5  15.5
#> 6  19.1  67.3  17.3

Fig. 10: Diagram of the path coefficient analysis of the dtseq2 (part of dtseq)

Step 3: DFT vs. FLP, DFL:

#> # A tibble: 6 × 3
#>     DFT   FLP   DFL
#>   <dbl> <dbl> <dbl>
#> 1  51.7  55.2  16.7
#> 2  52    55.7  18.3
#> 3  54.7  49.8  17.6
#> 4  49    59.1  17.9
#> 5  50    49.5  15.5
#> 6  52.5  67.3  17.3

Fig. 11: Diagram of the path coefficient analysis of dtseq3 (part of dtseq)

Step 4: FW vs. FLP, DFL:

#> # A tibble: 6 × 3
#>      FW   FLP   DFL
#>   <dbl> <dbl> <dbl>
#> 1  8.69  55.2  16.7
#> 2  8.16  55.7  18.3
#> 3  7.48  49.8  17.6
#> 4 10.0   59.1  17.9
#> 5  9.19  49.5  15.5
#> 6  9.97  67.3  17.3

Fig. 12: Diagram of the path coefficient analysis of dtseq4 (part of dtseq)

Step 5: FV vs. FLP, DFL:

#> # A tibble: 6 × 3
#>      FV   FLP   DFL
#>   <dbl> <dbl> <dbl>
#> 1 10.0   55.2  16.7
#> 2  8.33  55.7  18.3
#> 3  5.65  49.8  17.6
#> 4  9.33  59.1  17.9
#> 5 10.0   49.5  15.5
#> 6  9.87  67.3  17.3

Fig. 13: Correlation plot of the dtseq5 (part of dtseq)

Step 6: DFL vs. FLP:

#> # A tibble: 6 × 2
#>     DFL   FLP
#>   <dbl> <dbl>
#> 1  16.7  55.2
#> 2  18.3  55.7
#> 3  17.6  49.8
#> 4  17.9  59.1
#> 5  15.5  49.5
#> 6  17.3  67.3

Fig. 14: Network plot of the dtseq6 (part of dtseq).

Multivariate analysis of variance (MANOVA) to estimate SSCP matrices and so on. This requires the following package to be installed:

data(dtseqr)
dtseqr <- as.data.frame(dtseqr)
dtseqr[, 1] <- as.factor(dtseqr[, 1])  # Rep
dtseqr[, 2] <- as.factor(dtseqr[, 2])  # Genotypes

f <- lm(cbind(YLD, DFT, FS, FV, FW, DFL, FLP) ~ Rep + Genotypes, dtseqr)

summary(Anova(f))  # all results for MANOVA
#> 
#> Type II MANOVA Tests:
#> 
#> Sum of squares and products for error:
#>           YLD        DFT          FS        FV        FW       DFL        FLP
#> YLD 30872.750 1305.82650 -1284.15150 -402.8620 -420.6260 -213.0040 1418.89700
#> DFT  1305.827  677.12718    86.45078   -1.4006   50.1155  118.7334   69.47155
#> FS  -1284.151   86.45078   254.45997  -23.1646   24.3031   43.1752   41.54635
#> FV   -402.862   -1.40060   -23.16460   28.2584    6.8566   18.5540  -39.72960
#> FW   -420.626   50.11550    24.30310    6.8566   24.0384   22.1664  -42.02800
#> DFL  -213.004  118.73340    43.17520   18.5540   22.1664   56.4220  -35.56200
#> FLP  1418.897   69.47155    41.54635  -39.7296  -42.0280  -35.5620  271.59110
#> 
#> ------------------------------------------
#>  
#> Term: Rep 
#> 
#> Sum of squares and products for the hypothesis:
#>           YLD        DFT         FS      FV       FW    DFL       FLP
#> YLD  327.6100  195.02750  124.79950  10.860 -53.2140  7.240 -336.8410
#> DFT  195.0275  116.10063   74.29363   6.465 -31.6785  4.310 -200.5228
#> FS   124.7995   74.29363   47.54103   4.137 -20.2713  2.758 -128.3159
#> FV    10.8600    6.46500    4.13700   0.360  -1.7640  0.240  -11.1660
#> FW   -53.2140  -31.67850  -20.27130  -1.764   8.6436 -1.176   54.7134
#> DFL    7.2400    4.31000    2.75800   0.240  -1.1760  0.160   -7.4440
#> FLP -336.8410 -200.52275 -128.31595 -11.166  54.7134 -7.444  346.3321
#> 
#> Multivariate Tests: Rep
#>                  Df test stat  approx F num Df den Df     Pr(>F)    
#> Pillai            2   0.98891   1.25750     14     18    0.31903    
#> Wilks             2   0.01109   9.70834     14     16 2.5259e-05 ***
#> Hotelling-Lawley  2  89.15113  44.57557     14     14 3.7404e-09 ***
#> Roy               2  89.15113 114.62289      7      9 4.5158e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> ------------------------------------------
#>  
#> Term: Genotypes 
#> 
#> Sum of squares and products for the hypothesis:
#>            YLD        DFT           FS         FV         FW         DFL
#> YLD 697553.096 5033.04937 -11241.57825 4343.21250 2541.13650 -2906.41425
#> DFT   5033.049  116.01356    -91.18987    7.00395   -1.07955   -11.41207
#> FS  -11241.578  -91.18987    926.14245    3.20730  -24.60870   134.06535
#> FV    4343.213    7.00395      3.20730   83.14020   37.53540   -13.49070
#> FW    2541.136   -1.07955    -24.60870   37.53540   23.42280    -5.21070
#> DFL  -2906.414  -11.41207    134.06535  -13.49070   -5.21070    54.84165
#> FLP   5863.115   22.55888    -56.60295  126.96570   90.78030    72.84285
#>            FLP
#> YLD 5863.11525
#> DFT   22.55888
#> FS   -56.60295
#> FV   126.96570
#> FW    90.78030
#> DFL   72.84285
#> FLP  769.10145
#> 
#> Multivariate Tests: Genotypes
#>                  Df test stat approx F num Df   den Df     Pr(>F)    
#> Pillai            7     3.896    2.510     49 98.00000 5.5418e-05 ***
#> Wilks             7     0.000   16.018     49 45.03719 < 2.22e-16 ***
#> Hotelling-Lawley  7  4971.966  637.803     49 44.00000 < 2.22e-16 ***
#> Roy               7  4949.541 9899.081      7 14.00000 < 2.22e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Anova(f)$SSPE    # individual printing SSCP matrix of error

Anova(f)$SSPE[4:5, 4:5] # SSCP matrix of error for two dependent variables i.e Fv and FW.
#>         FV      FW
#> FV 28.2584  6.8566
#> FW  6.8566 24.0384

Following performing multivariate path coefficient analysis, it is necessary to estimate the correlation coefficient between residuals ( here FV and FW are final dependent variables) as follow. To do this, the Error matrix needs to be calculated in MANOVA.

ru1u2 <- Anova(f)$SSPE[4, 5]/(  sqrt(Anova(f)$SSPE[4, 4])*sqrt(Anova(f)$SSPE[5, 5]))
cat("\nCorrelation coefficient between residuals is:\n", ru1u2)
#> 
#> Correlation coefficient between residuals is:
#>  0.2630766

After performing or analyzing sequential path analyses step-by-step, it is now time to create a sequential path diagram, which includes a multivariate path diagram. To do this, one can use the program called Graphviz in relation to DiarammeR. If a specific section of the sequential model is considered as a multivariate path, one can draw a multivariate path Diagram (Arminian et al. 2008) and estimate the correlation coefficient between residuals (as previously estimated) as follows:

Fig. 15: Sequential univariate path diagram. It is important to note that residuals can be added to each endogenous variable, which are estimated throughout steps 1 to 6 above.

For full color names or other signs of DiagrammeR or lots of node/nodge attributes, and Graphviz go to be used in the diagrams see the manuals and guides like: https://rich-iannone.github.io/DiagrammeR/articles/graphviz-mermaid.html. Also: vignettes/graphviz-mermaid.Rmd

Fig. 16: The sequential multivariate path Diagram

Notice: Users can see the ‘lavaan’ package in R and simple ‘PATHSAS’ code written by Cramer et al. (Cramer, Wehner, and Donaghy 1999), and also and “semPlot” function of ‘OpenMxas’ package as initial tools for conducting path analyses and SEM (Structural Equation Modeling).

Path Coefficient Analysis

Ali Arminian

2024-09-23

1-Introduction