Getting Started with the PCMBase R-package

Venelin Mitov

2019-03-15

Data

The input data for the phylogenetic comparative methods covered in the PCMBase package consists of a phylogenetic tree of \(N\) tips and a \(k\times N\) matrix, \(X\) of observed trait-values, where \(k\) is the number of traits. The matrix \(X\) can contain NAs corresponding to missing measurements and NaN’s corresponding to non-existing traits for some of the tips.

Models

A model is defined by a set of parameters and a rule stating how these parameters should be fit to the data. For example, a multivariate phylogenetic Ornstein-Uhlenbeck mixed model has the following parameters:

The rule defining how the above OU-parameters should be fit to the data is defined by the multivariate OU Gaussian distribution, as described in (insert reference); Currently, PCMBase supports multivariate Gaussian models satisfying the following two conditions:

  1. the mean of the underlying stochastic process at the end of a time interval on a branch of the tree depends linearly on the ancestral value at the beginning of the branch;
  2. the variance-covariance matrix of the stochastic process after a time interval does not depend on the ancestral value.

In PCMBase, models are specified as S3 objects, i.e. ordinary R-lists with a class attribute. The base S3 class of all models is called "PCM", which is inherited by more specific model-classes. Let us create a BM PCM for two traits:

modelBM <- PCM(model = "BM", k = 2)

Printing the model object shows a short verbal description, the S3-class, the number of traits, k, the number of numerical parameters of the model, p, the model regimes and the current values of the parameters for each regime (more on regimes in the next sub-section):

modelBM
## Brownian motion model
## S3 class: BM, GaussianPCM, PCM; k=2; p=8; regimes: 1. Parameters/sub-models:
## X0 (VectorParameter, _Global, numeric; trait values at the root):
## [1] 0 0
## Sigma_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Choleski factor of the unit-time variance rate):
## , , 1
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## 
## Sigmae_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Choleski factor of the non-heritable variance or the variance of the measurement error):
## , , 1
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## 
## 

One may wonder why in the above description, p = 8 instead of 10 (see also ?PCMParamCount). The reason is that both, the matrix Sigma and the matrix Sigmae, are symmetric matrices and their matching off-diagonal elements are counted only one time.

Model regimes

Model regimes are different models associated with different parts of the phylogenetic tree. This is a powerful concept allowing to model different evolutionary modes on different lineages on the tree. Let us create a 2-trait BM model with two regimes called a and b:

modelBM.ab <- PCM("BM", k = 2, regimes = c("a", "b"))
modelBM.ab
## Brownian motion model
## S3 class: BM, GaussianPCM, PCM; k=2; p=14; regimes: a, b. Parameters/sub-models:
## X0 (VectorParameter, _Global, numeric; trait values at the root):
## [1] 0 0
## Sigma_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Choleski factor of the unit-time variance rate):
## , , a
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## 
## , , b
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## 
## Sigmae_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Choleski factor of the non-heritable variance or the variance of the measurement error):
## , , a
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## 
## , , b
## 
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## 
## 

Now, we can set some different values for the parameters of the model we’ve just created. First, let us specify an initial value vector different from the default 0-vector:

modelBM.ab$X0[] <- c(5, 2)

X0 is defined as a parameter with S3 class class(modelBM.ab$X0). This specifies that X0 is global vector parameter shared by all model regimes. This is also the reason, why the number of parameters is not the double of the number of parameters in the first model:

PCMParamCount(modelBM)
## [1] 8
PCMParamCount(modelBM.ab)
## [1] 14

The other parameters, Sigma and Sigmae are local for each regime:

# in regime 'a' the traits evolve according to two independent BM processes (starting from the global vecto X0).
modelBM.ab$Sigma_x[,, "a"] <- rbind(c(1.6, 0),
                                  c(0, 2.4))
modelBM.ab$Sigmae_x[,, "a"] <- rbind(c(.1, 0),
                                   c(0, .4))
# in regime 'b' there is a correlation between the traits
modelBM.ab$Sigma_x[,, "b"] <- rbind(c(1.6, .8),
                                  c(.8, 2.4))
modelBM.ab$Sigmae_x[,, "b"] <- rbind(c(.1, 0),
                                   c(0, .4))

The above way of setting values for model parameters, while human readable, is not handy during model fitting procedures, such as likelihood maximization. Thus, there is another way to set (or get) the model parameter values from a numerical vector:

param <- double(PCMParamCount(modelBM.ab))

# load the current model parameters into param
PCMParamLoadOrStore(modelBM.ab, param, offset=0, load=FALSE)
## [1] 14
print(param)
##  [1] 5.0 2.0 1.6 0.0 2.4 1.6 0.8 2.4 0.1 0.0 0.4 0.1 0.0 0.4
# modify slightly the model parameters
param2 <- jitter(param)

print(param2)
##  [1]  5.011142116  2.008981272  1.602330599  0.016798115  2.417186893
##  [6]  1.601842169  0.785598809  2.383546738  0.115406305 -0.018173201
## [11]  0.391706960  0.096997701  0.004235176  0.401911263
# set the new parameter vector
PCMParamLoadOrStore(modelBM.ab, param2, offset = 0, load=TRUE)
## [1] 14
print(modelBM.ab)
## Brownian motion model
## S3 class: BM, GaussianPCM, PCM; k=2; p=14; regimes: a, b. Parameters/sub-models:
## X0 (VectorParameter, _Global, numeric; trait values at the root):
## [1] 5.011142 2.008981
## Sigma_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Choleski factor of the unit-time variance rate):
## , , a
## 
##          [,1]       [,2]
## [1,] 1.602331 0.01679812
## [2,] 0.000000 2.41718689
## 
## , , b
## 
##          [,1]      [,2]
## [1,] 1.601842 0.7855988
## [2,] 0.800000 2.3835467
## 
## Sigmae_x (MatrixParameter, _UpperTriangularWithDiagonal, _WithNonNegativeDiagonal; Choleski factor of the non-heritable variance or the variance of the measurement error):
## , , a
## 
##           [,1]       [,2]
## [1,] 0.1154063 -0.0181732
## [2,] 0.0000000  0.3917070
## 
## , , b
## 
##           [,1]        [,2]
## [1,] 0.0969977 0.004235176
## [2,] 0.0000000 0.401911263
## 
## 

Simulating data on a phylogenetic tree

The first functionality of the PCMBase package is to provide an easy way to simulate multiple trait data on a tree under a given (possibly multiple regime) PCM.

For this example, first we simulate a birth death tree with two parts “a” and “b” using the phytools R-package:

# make results reproducible
set.seed(2)

# number of regimes
R <- 2

# number of extant tips
N <- 100

tree.a <- PCMTree(rtree(n=N))
PCMTreeSetLabels(tree.a)
PCMTreeSetPartRegimes(tree.a, part.regime = c(`101` = "a"), setPartition = TRUE)

lstDesc <- PCMTreeListDescendants(tree.a)
splitNode <- names(lstDesc)[which(sapply(lstDesc, length) > N/2 & sapply(lstDesc, length) < 2*N/3)][1]

tree.ab <- PCMTreeInsertSingletons(
  tree.a, nodes = as.integer(splitNode), 
  positions = PCMTreeGetBranchLength(tree.a, as.integer(splitNode))/2)
PCMTreeSetPartRegimes(
  tree.ab,
  part.regime = structure(c("a", "b"), names = as.character(c(N+1, splitNode))), 
  setPartition = TRUE)

# Currently this is causing a failure of the pkgdown::build_site(), so we use the 
# ape's tree-plotting function
# PCMTreePlot(tree.ab)

palette <- PCMColorPalette(2, c("a", "b"))
plot(tree.ab, show.tip.label=FALSE, 
     edge.color = palette[PCMTreeGetRegimesForEdges(tree.ab)])

Now we can simulate data on the tree using the modelBM.ab$X0 as a starting value:

traits <- PCMSim(tree.ab, modelBM.ab, modelBM.ab$X0)

Calculating likelihoods

Calculating a model likelihood for a given tree and data is the other key functionality of the PCMBase package.

PCMLik(traits, tree.ab, modelBM.ab)
## [1] -409.4394
## attr(,"X0")
## [1] 5.011142 2.008981
## attr(,"class")
## [1] "VectorParameter" "_Global"         "numeric"        
## attr(,"description")
## [1] "trait values at the root"

For faster and repeated likelihood evaluation, I recommend creating a likelihood function for a given data, tree and model object. Passing this function object to optim would save the need for pre-processing the data and tree at every likelihood evaluation.

# a function of a numerical parameter vector:
likFun <- PCMCreateLikelihood(traits, tree.ab, modelBM.ab)

likFun(param2)
## [1] -409.4394
## attr(,"X0")
## [1] 5.011142 2.008981
## attr(,"class")
## [1] "VectorParameter" "_Global"         "numeric"        
## attr(,"description")
## [1] "trait values at the root"