heterocop: an R package for Gaussian copula semi-parametric inference for heterogeneous data

Ekaterina Tomilina

2024-11-04

This package enables the user to quantify dependencies between mixed (continuous, discrete, binary) variables in the framework of a Gaussian copula model by estimating the correlation matrix of the copula (Tomilina, Mazo, and Jaffrézic (2024)).

Context

When working with \(d\) mixed variables \(X_1, ..., X_d\), it can be complicated to infer a correlation network as well-known statistical methods do not work in case of mixed data. Indeed, most classical network inference methods such as gLasso or Gaussian graphical models rely on the assumption that the data follow a Gaussian distribution. Recently, the Non-paranormal distribution was introduced for network inference for continuous, non-Gaussian data (Liu (2009)). It consists in a transformation of the cumulative distribution functions via a Gaussian copula and provides results for non-Gaussian but continuous variables. We propose an extension of this model to the case of mixed variables.

The Model

Let \(X_1, ..., X_d\) be mixed variables (continuous or discrete). Let \(F_1, ..., F_d\) denote their marginal CDFs, \(\Phi^{-1}\) the inverse of the standard normal CDF and \(\Phi_\Sigma\) the Gaussian CDF of correlation matrix \(\Sigma\). We assume that the multivariate CDF of the vector \((X_1,\dots,X_d)\) is given by: \[F(X_1, ..., X_d)=C_\Sigma(F_1(X_1), ..., F_d(X_d)):=\Phi_\Sigma(\Phi^{-1}(F_1(X_1)), ..., \Phi^{-1}(F_d(X_d))).\]

Estimation

In order to estimate the correlation matrix of the copula, the rho_estim function uses a semiparametric pairwise maximum likelihood estimator. It returns the estimated correlation matrix of the copula and takes as arguments the data set and the variable types in a vector. In the example below, we have used a subset of the ICGC data set (Zhang, Bajari, and D. (2019)) which contains 5 RNA-seq, 5 protein and 5 mutation variables. We have specified the variable types, where a “C” stands for “continuous” and a “D” for “discrete”.

data(icgc_data)
R <- rho_estim(icgc_data,c(rep("C",10),rep("D",5)))

A \(6\times6\) subset of the obtained copula correlation matrix is represented below.

ACACA AKT1S1 ANLN ANXA1 AR ACACA_P
ACACA 1.00 -0.13 0.09 -0.32 0.36 0.58
AKT1S1 -0.13 1.00 -0.13 -0.16 -0.19 0.11
ANLN 0.09 -0.13 1.00 0.13 -0.36 -0.07
ANXA1 -0.32 -0.16 0.13 1.00 -0.36 -0.22
AR 0.36 -0.19 -0.36 -0.36 1.00 0.14
ACACA_P 0.58 0.11 -0.07 -0.22 0.14 1.00

Graphical representation

The cor_network_graph function enables to visualize the obtained network. It takes as arguments the dataset, the correlation matrix, the threshold and a legend.

cor_network_graph(R,TS=0.3,legend=c(rep("RNAseq",5),rep("Proteins",5),rep("Mutations",5)))

Simulation

Our package is also able to simulate data distributed according to the Gaussian copula model. Two functions enable us to generate two types of correlation matrices: block-wise and sparse. The diag_block_matrix function enables the user to get a block-wise correlation matrix. It takes as arguments a vector containing the block sizes and a vector containing the coefficients of each block. An example is shown below.

R <- diag_block_matrix(c(3,2),c(0.4,0.8))
1.0 0.4 0.4 0.0 0.0
0.4 1.0 0.4 0.0 0.0
0.4 0.4 1.0 0.0 0.0
0.0 0.0 0.0 1.0 0.8
0.0 0.0 0.0 0.8 1.0

The matrix_gen function enables the user to generate a sparse correlation matrix of initial sparsity parameter \(\gamma\), which has to be specified. It is based on the Cholesky decomposition where a lower triangular matrix of sparsity \(\gamma\) is generated before being multiplied by its transpose in order to obtain the final matrix. Note that the initial parameter is not equal to the final parameter, which is also returned by the function. In the example below, the first element of the list is the resulting matrix, and the second element of the list is the final sparsity parameter.

R <- matrix_gen(5,0.81)
1.00 0 0.54 0.00 0
0.00 1 0.00 0.00 0
0.54 0 1.00 0.44 0
0.00 0 0.44 1.00 0
0.00 0 0.00 0.00 1
sparsity = 0.64

The CopulaSim function enables the user to generate a data set which CDF can be expressed as a Gaussian copula of correlation matrix R (to be specified). In the example below, we first generate a block diagonal correlation matrix R and then generate the data set. Then, CopulaSim takes as arguments the number of observations, the correlation matrix of the copula, a vector containing the probability distributions and their parameters, the number of repetitions of each distribution, and enables the user to randomize their order. It returns a list of three elements: the data frame containing the generated data, the correlation matrix, and the permutation realized on the rows and columns of R order after randomization.

R <- diag_block_matrix(c(3,5,2),c(0.7,0.3,0.5))
CopulaSim(5,R,c(rep("qnorm(0,1)",5),rep("qexp(0.5)",3),rep("qbinom(4,0.8)",2)),random=TRUE)
#> [[1]]
#>           X1          X2         X3          X4         X5       X6       X7
#> 1  2.1424056  0.53822367 -0.4159329  1.18702567 -0.9237198 3.154596 1.523011
#> 2  1.2196475  0.48552614  0.6316280 -0.18404186  1.4569011 1.467397 1.369171
#> 3 -1.2899258  0.06995913  0.2045720 -0.22193038  1.4856449 1.445284 1.898268
#> 4  0.4477222 -0.33675583 -0.7747620  1.61988927  0.6950959 2.817395 1.552058
#> 5 -0.1923970 -1.35185464 -1.0391505 -0.05757026 -1.5594221 2.613962 0.540430
#>         X8 X9 X10
#> 1 4.365789  4   4
#> 2 2.513484  4   3
#> 3 5.235144  4   3
#> 4 4.356205  3   4
#> 5 3.676731  4   3
#> 
#> [[2]]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,]  1.0  0.3  0.0  0.0  0.3  0.0  0.3  0.3  0.0   0.0
#>  [2,]  0.3  1.0  0.0  0.0  0.3  0.0  0.3  0.3  0.0   0.0
#>  [3,]  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.7   0.7
#>  [4,]  0.0  0.0  0.0  1.0  0.0  0.5  0.0  0.0  0.0   0.0
#>  [5,]  0.3  0.3  0.0  0.0  1.0  0.0  0.3  0.3  0.0   0.0
#>  [6,]  0.0  0.0  0.0  0.5  0.0  1.0  0.0  0.0  0.0   0.0
#>  [7,]  0.3  0.3  0.0  0.0  0.3  0.0  1.0  0.3  0.0   0.0
#>  [8,]  0.3  0.3  0.0  0.0  0.3  0.0  0.3  1.0  0.0   0.0
#>  [9,]  0.0  0.0  0.7  0.0  0.0  0.0  0.0  0.0  1.0   0.7
#> [10,]  0.0  0.0  0.7  0.0  0.0  0.0  0.0  0.0  0.7   1.0
#> 
#> [[3]]
#>  [1] 10  9  4  2  8  1  3  5  6  7

Additionally, the gauss_gen function, which is used in CopulaSim, generates the latent Gaussian variables linked by the correlation matrix R. Its only arguments are the correlation matrix R and the number of observations.

latent_data <- gauss_gen(R,10)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
0.74 0.54 0.75 0.40 0.32 0.14 0.46 0.78 0.26 0.26
0.44 0.43 0.40 0.39 0.51 0.38 0.62 0.10 0.56 0.15
0.54 0.30 0.14 0.49 0.33 0.26 0.46 0.09 0.61 0.89
0.83 0.66 0.55 0.00 0.05 0.12 0.38 0.31 0.11 0.10
0.51 0.50 0.81 0.13 0.01 0.07 0.03 0.26 0.29 0.15
0.76 0.38 0.35 0.05 0.44 0.10 0.07 0.15 0.78 0.41
0.52 0.57 0.52 0.86 0.55 0.18 0.74 0.46 0.63 0.51
0.89 0.80 0.56 0.71 0.80 0.69 0.55 0.87 0.24 0.02
0.84 0.68 0.94 0.23 0.53 0.09 0.71 0.71 0.28 0.92
0.50 0.54 0.40 0.60 0.15 0.86 0.58 0.43 0.10 0.58

References

Liu. 2009. “The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs.” Journal of Machine Learning Research.

Tomilina, Mazo, and Jaffrézic. 2024. “Mixed Copula-Based Models for Semi-Parametric Network Inference: An Application to Multi-Omics Data.”

Zhang, J., R. Bajari, and Andric D. 2019. “The International Cancer Genome Consortium Data Portal.” Nat Biotechnol. https://doi.org/doi:10.1038/s41587-019-0055-9.