Markov Chain Analysis on Phylogenetic Trees

The markophylo package allows for estimating substitution rates using a user-specified, i.e., hypothesis driven, substitution rate matrix in a continuous-time markov chain model on phylogenies. The package is quite efficient, with the workhorse functions written in C++ (using the Rcpp and RcppArmadillo packages), and can analyze data on discrete characters like the following (among others):

The package contains five simulated data sets. These data illustrate the kinds of models that the package is capable of fitting to discrete character data recorded for multiple taxa (for which a phylogeny is supplied by the user). The markophylo::estimaterates function is flexible, capable of capturing the following processes/behaviour:

Contact

If you have any questions or requests for a feature, drop me a line at utkarshdang@cunet.carleton.ca.

Installation

CRAN

The following commands installs markophylo binaries from the Comprehensive R Archive Network (CRAN):

install.packages("markophylo", dependencies = TRUE, repos = "https://cran.r-project.org")

Source package from CRAN

install.packages(“markophylo”, dependencies = TRUE, repos = “https://cran.r-project.org”, type = “source”)

Source package from file

Installing markophylo from the source distribution after downloading the package source from https://cran.r-project.org/package=markophylo/ can be done using the following commands:

OS X Yosemite

install.packages(c("Rcpp","RcppArmadillo","ape","phangorn",
"numDeriv","knitr"), repos = "https://cran.r-project.org")

install.packages("markophylo_1.0.2.tar.gz", repos = NULL, type = "source")

Make sure the version number of markophylo reflects the latest version on CRAN.

If “-lgfortran” and/or “-lquadmath” errors are encountered on a OS X system, you may need a gfortrain version. Previously, it used to be to download gfortran-4.8.2-darwin13.tar.bz2 file into /usr/local but the file version may have changed. This issue has been known to occur in the past with RcppArmadillo. This is also mentioned in the Rcpp FAQs vignette on https://cran.r-project.org/package=Rcpp/index.html.

Windows

Installing an R package from source that requires compilation on Windows requires prior installation of Rtools from https://cran.r-project.org/bin/windows/Rtools/. Rtools contains MinGW compilers needed for building packages requiring compilation of Fortran, C or C++ code. The PATH variable should be allowed to be modified during installation of Rtools. If the above is not allowed, following the installation of Rtools, the PATH variable must be set to include “RTools/bin” and “Rtools/gcc-x.y.z/bin”, where “x.y.z” refers to the version number of gcc. Then,

install.packages(c("Rcpp","RcppArmadillo","ape","phangorn",
"numDeriv","knitr"), repos = "https://cran.r-project.org")

install.packages("markophylo_1.0.2.tar.gz", repos = NULL, type = "source")

Make sure the version number of markophylo reflects the latest version on CRAN.

Illustration 1

Fitting a simple model and evaluating the results.

Load the package and data simdata1.

library(markophylo)
data(simdata1)

Now, inspect the tree and the substitution rate matrix used to simulate data in simdata1.

ape::plot.phylo(simdata1$tree, edge.width = 2, show.tip.label = FALSE, no.margin = TRUE)
ape::nodelabels(frame = "circle", cex = 0.7)
ape::tiplabels(frame = "circle", cex = 0.7)

print(simdata1$Q)

##      [,1] [,2]
## [1,] -1.0  1.0
## [2,]  0.6 -0.6

In this example, Q is a standard substitution rate matrix used to simulate two character states data for the tips of the phylogeny. Here, equal character frequencies were assumed at the root during simulation. Confirming the number of character states and inspecting the character frequencies at the tips of the phylogeny is straightforward.

print(table(simdata1$data))

## 
##     1     2 
##  8814 11186

Now, using the function estimaterates from the markophylo package, we will estimate a substitution rate matrix from this data. The modelmat argument represents the number of unique rates to be fit using a 2-dimensional matrix, with the diagonals left as NA (see section on creating modelmat below). The model can be fitted in one of two ways:

model1 <- estimaterates(usertree = simdata1$tree, userphyl = simdata1$data, 
                        alphabet = c(1, 2), rootprob = "equal", 
                        modelmat = matrix(c(NA, 1, 2, NA), 2, 2))
print(model1)

## Call:
## estimaterates(usertree = simdata1$tree, userphyl = simdata1$data, 
##     alphabet = c(1, 2), modelmat = matrix(c(NA, 1, 2, NA), 2, 
##         2), rootprob = "equal")
## 
##  4 taxa with 5000 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 1 2 3 4 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    1  2
## 1 NA  2
## 2  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.5680754
## [2,] 1.0111462
## 
## 
## $se
## $se$rates
## , , 1
## 
##            [,1]
## [1,] 0.02315191
## [2,] 0.03119170
## 
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -10055.02 
## AIC           for model wop : -20114.03 
## BIC           for model wop : -20127.07 
## -----------------------------------
## Time taken: 0.106 seconds.

model1 <- estimaterates(usertree = simdata1$tree, userphyl = simdata1$data, 
                        alphabet = c(1, 2), rootprob = "equal", 
                        modelmat = "ARD")
print(model1)

The second code block is using the “ARD” option for modelmat which automatically uses a two state substitution rate matrix here with two rates to be estimated. The output from printing model1 (which is an object of class ‘markophylo’) is self-explanatory. The estimated rates (given as an array in $rates) and the standard errors (both indexed according to the parameter index matrix) are output. The log-likelihood at the maximum likelihood estimates along with the values of the model selection criteria (Akaike information criterion and Bayesian information criterion) are also provided. Because we did not specify a priori any clades to be fit with their own unique rates, these estimates are homogenous over the tree. This is also reflected in the one component list output under “Groups of nodes with the same rates”. Note that the model1 object is of class markophylo.

print(class(model1))

## [1] "markophylo"

Among other things, model1 contains the number of unique patterns observed that can be retreived using model1$w. As expected, for binary discrete characters for four taxa, there are 2^4 unique patterns. These patterns can also be calculated by providing the data set directly to the patterns function.

	Number of Times Pattern Observed
1 1 1 1	1146
1 1 1 2	71
1 1 2 1	85
1 1 2 2	89
1 2 1 1	244
1 2 1 2	33
1 2 2 1	23
1 2 2 2	516
2 1 1 1	471
2 1 1 2	31
2 1 2 1	44
2 1 2 2	233
2 2 1 1	140
2 2 1 2	75
2 2 2 1	73
2 2 2 2	1726

Correcting for unobservable patterns

It is often very important to correct for sampling bias (aka acquisition or ascertainment bias). This could be important in many diverse scenarios and has been corrected for in many previous methodologies. For example, this correction has been used in the following cases:

Analysis of restriction site data (Felsenstein, 1992).
Only characters that are non-constant were recorded in the data (Lewis, 2001).
Ancient genes that are lost from all taxa of interest cannot be observed at the tips (Hao and Golding, 2006).
A gene family does not appear in the COG database unless it occurs in at least three genomes (Cohen and Pupko, 2010).

In such a case, certain character patterns for the taxa of interest are not available in the data. For example, let’s see what happens if we estimate the rates of interest on the data used above in Illustration 1 that was filtered to remove all non-variable character patterns.

filterall1 <- which(apply(simdata1$data, MARGIN = 1, FUN = 
                            function(x) isTRUE(all.equal(as.vector(x), c(1, 1, 1, 1)))))
filterall2 <- which(apply(simdata1$data, MARGIN = 1, FUN = 
                            function(x) isTRUE(all.equal(as.vector(x), c(2, 2, 2, 2)))))
filteredsimdata1 <- simdata1$data[-c(filterall1, filterall2), ]
model1_f <- estimaterates(usertree = simdata1$tree, userphyl = filteredsimdata1,
                          alphabet = c(1, 2), rootprob = "equal", 
                          modelmat = "ARD")
print(model1_f)

## Call:
## estimaterates(usertree = simdata1$tree, userphyl = filteredsimdata1, 
##     alphabet = c(1, 2), modelmat = "ARD", rootprob = "equal")
## 
##  4 taxa with 2128 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 1 2 3 4 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    1  2
## 1 NA  2
## 2  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##          [,1]
## [1,] 3.127061
## [2,] 3.196739
## 
## 
## $se
## $se$rates
## , , 1
## 
##           [,1]
## [1,] 0.1191886
## [2,] 0.1197627
## 
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -5546.87 
## AIC           for model wop : -11097.74 
## BIC           for model wop : -11109.07 
## -----------------------------------
## Time taken: 0.059 seconds.

Clearly, there is a big difference as compared to the output from earlier. However, it is easy to correct for such sampling bias using the estimaterates function by providing a matrix with each row representing a unique unobservable pattern in the data. Here, a matrix with 2 rows, the first row all ones, and the second row all twos is supplied to the unobserved argument.

model1_f_corrected <- estimaterates(usertree = simdata1$tree, userphyl = filteredsimdata1, 
                                    unobserved = matrix(c(1, 1, 1, 1, 2, 2, 2, 2), nrow = 2, 
                                                        byrow = TRUE), alphabet = c(1, 2), 
                        rootprob = "equal", modelmat = "ARD")
print(model1_f_corrected)

## Call:
## estimaterates(usertree = simdata1$tree, userphyl = filteredsimdata1, 
##     unobserved = matrix(c(1, 1, 1, 1, 2, 2, 2, 2), nrow = 2, 
##         byrow = TRUE), alphabet = c(1, 2), modelmat = "ARD", 
##     rootprob = "equal")
## 
##  4 taxa with 2128 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 1 2 3 4 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    1  2
## 1 NA  2
## 2  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.6090545
## [2,] 0.9718825
## 
## 
## $se
## $se$rates
## , , 1
## 
##            [,1]
## [1,] 0.06027376
## [2,] 0.09918309
## 
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -4712.739 
## AIC           for model wop : -9429.478 
## BIC           for model wop : -9440.804 
## -----------------------------------
## Time taken: 0.058 seconds.

##Illustration 2: branch rate heterogeneity

Here, a more complicated model is fit with certain clades following rates separate from the rest of the tree. While simulating the data found in simdata2, the clade with node 7 as its most recent common ancestor (MRCA) was constrained to have twice the substitution rates as the rest of the branches in the tree.

data(simdata2)
print(simdata2$Q)

##      [,1] [,2]
## [1,] -1.0  1.0
## [2,]  0.6 -0.6

print(table(simdata2$data))

## 
##     1     2 
##  8714 11286

While fitting the model, we hypothesize that the clade with node 7 as its most recent common ancestor (MRCA) follows substitution rates unique from the rest of the tree.

model2 <- estimaterates(usertree = simdata2$tree, userphyl = simdata2$data, 
                        alphabet = c(1, 2), bgtype = "ancestornodes", bg = c(7),
                        rootprob = "equal", modelmat = "ARD")
print(model2)

## Call:
## estimaterates(usertree = simdata2$tree, userphyl = simdata2$data, 
##     alphabet = c(1, 2), modelmat = "ARD", bgtype = "ancestornodes", 
##     bg = c(7), rootprob = "equal")
## 
##  4 taxa with 5000 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 7 3 4
## 
## [[2]]
## [1] 1 2 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    1  2
## 1 NA  2
## 2  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##          [,1]      [,2]
## [1,] 1.125722 0.5586981
## [2,] 2.001023 1.0061919
## 
## 
## $se
## $se$rates
## , , 1
## 
##           [,1]       [,2]
## [1,] 0.0720312 0.02677063
## [2,] 0.1055549 0.03522507
## 
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -10893.48 
## AIC           for model wop : -21794.97 
## BIC           for model wop : -21821.03 
## -----------------------------------
## Time taken: 0.161 seconds.

This could also have been accomplished by providing a list with each component representing the node labels, i.e., group of branches, that follow the same rate(s). The node labels can be seen with

ape::plot.phylo(simdata2$tree, edge.width = 2, show.tip.label = FALSE, no.margin = TRUE)
ape::nodelabels(frame = "circle", cex = 0.7)
ape::tiplabels(frame = "circle", cex = 0.7)

Now,

model2_2 <- estimaterates(usertree = simdata2$tree, userphyl = simdata2$data, 
                        alphabet = c(1, 2), bgtype = "listofnodes", bg = list(c(3,4,7),c(1,2,5,6,7)),
                        rootprob = "equal", modelmat = "ARD")
print(model2_2)

## Call:
## estimaterates(usertree = simdata2$tree, userphyl = simdata2$data, 
##     alphabet = c(1, 2), modelmat = "ARD", bgtype = "listofnodes", 
##     bg = list(c(3, 4, 7), c(1, 2, 5, 6, 7)), rootprob = "equal")
## 
##  4 taxa with 5000 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 3 4 7
## 
## [[2]]
## [1] 1 2 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    1  2
## 1 NA  2
## 2  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##          [,1]      [,2]
## [1,] 1.125722 0.5586981
## [2,] 2.001023 1.0061919
## 
## 
## $se
## $se$rates
## , , 1
## 
##           [,1]       [,2]
## [1,] 0.0720312 0.02677063
## [2,] 0.1055549 0.03522507
## 
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -10893.48 
## AIC           for model wop : -21794.97 
## BIC           for model wop : -21821.03 
## -----------------------------------
## Time taken: 0.162 seconds.

In the output, groups of nodes that delimit the branches that follow the same rates (within groups) are displayed as two vector components of an array. The estimated rates then follow the ordering of these components. Generally, the output rates in the first column are the estimated rates (as indexed by the parameter index matrix) for the first group, the second column for the second group, and so on. To get a visual confirmation of the different branches following their own unique rates, the following command can be used:

plottree(model2, colors=c("blue", "darkgreen"), edge.width = 2, show.tip.label = FALSE, 
         no.margin = TRUE)
ape::nodelabels(frame = "circle", cex = 0.7)
ape::tiplabels(frame = "circle", cex = 0.7)

##Illustration 3: partition models.

Here, a partition analysis is carried out. Nucleotide data was simulated such that the first half of sites followed substitution rates different from the other half of sites. Data was simulated in the two partitions with rates $0.33$ and $0.99$. While specifying the model, a list of two components then needs to be supplied to the function. To specify the partition with the two matrices, list(c(1:2500), c(2501:5000)) can be used. Note also that our alphabet here consists of the four nucleotide bases.

data(simdata3)
print(dim(simdata3$data))

## [1] 5000    4

print(table(simdata3$data))

## 
##    a    c    g    t 
## 5092 5048 4973 4887

model3 <- estimaterates(usertree = simdata3$tree, userphyl = simdata3$data, 
                        alphabet = c("a", "c", "g", "t"), rootprob = "equal", 
                        partition = list(c(1:2500), c(2501:5000)), 
                        modelmat = "ER")
print(model3)

## Call:
## estimaterates(usertree = simdata3$tree, userphyl = simdata3$data, 
##     alphabet = c("a", "c", "g", "t"), modelmat = "ER", partition = list(c(1:2500), 
##         c(2501:5000)), rootprob = "equal")
## 
##  4 taxa with 5000 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 1 2 3 4 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    a  c  g  t
## a NA  1  1  1
## c  1 NA  1  1
## g  1  1 NA  1
## t  1  1  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.3451549
## 
## , , 2
## 
##           [,1]
## [1,] 0.9869837
## 
## 
## $se
## $se$rates
## , , 1
## 
##             [,1]
## [1,] 0.009154739
## 
## , , 2
## 
##            [,1]
## [1,] 0.02401694
## 
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -21659.38 
## AIC           for model wop : -43322.75 
## BIC           for model wop : -43335.79 
## -----------------------------------
## Time taken: 0.35 seconds.

##Illustration 4: rate variation using the gamma distribution and estimating root probabilities.

Here, an analysis with gamma rate variation is carried out. Data was simulated using a gamma rate distribution with the shape parameter set to $2$. The overall rate was $0.33$ while the prior root probabilities used for the simulation were (0.15, 0.35, 0.25, 0.25)'. While specifying the model, the “discgamma” option is used for the ratevar argument with 4 categories specified in nocat. This specifies a discrete gamma approximation that approximates the integral (that would be needed in the likelihood if the continuous gamma distribution was used) at each site by numerical integration. Moreover, instead of assuming equal or stationary root frequencies, or providing user-specified root probabilities, the root probabilities are estimated using maximum likelihood.

data(simdata4)
print(table(simdata4$data))

## 
##    a    c    g    t 
## 2929 7025 5007 5039

model4 <- estimaterates(usertree = simdata4$tree, userphyl = simdata4$data, 
                        alphabet = c("a", "c", "g", "t"), rootprob = "maxlik",
                        ratevar = "discgamma", nocat = 4, 
                        modelmat = "ER")
print(model4)

## Call:
## estimaterates(usertree = simdata4$tree, userphyl = simdata4$data, 
##     alphabet = c("a", "c", "g", "t"), modelmat = "ER", ratevar = "discgamma", 
##     nocat = 4, rootprob = "maxlik")
## 
##  4 taxa with 5000 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 1 2 3 4 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    a  c  g  t
## a NA  1  1  1
## c  1 NA  1  1
## g  1  1 NA  1
## t  1  1  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.3322271
## 
## 
## $alpha
## [1] 1.995144
## 
## $alpharates
##           [,1]      [,2]     [,3]     [,4]
## [1,] 0.3184436 0.6828814 1.108946 1.889729
## 
## $iroot
## [1] 0.1151663 0.3896658 0.2472584 0.2479095
## 
## $se
## $se$rates
## , , 1
## 
##             [,1]
## [1,] 0.008529137
## 
## 
## $se$alpha
## [1] 0.3138994
## 
## $se$iroot
## [1] 0.005504221 0.008271889 0.007391626
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -17763.33 
## AIC           for model wop : -35536.65 
## BIC           for model wop : -35569.24 
## -----------------------------------
## Time taken: 2.107 seconds.

Correcting for unobservable patterns

Here, let’s see what happens if we estimate the rates of interest on a filtered version of the data used above in Illustration 4. First, the data is filtered to remove all patterns where only “a” was recorded or “g” was recorded more than three times (i.e., three or four times). Note that this arbitrary filtering is done simply to illustrate the ease of adjusting for unobservable data patterns.

filteralla <- which(apply(simdata4$data, MARGIN = 1, FUN = 
                            function(x) isTRUE(all.equal(as.vector(x), 
                                                         c("a", "a", "a", "a")))))
filterg3 <- which(apply(simdata4$data, MARGIN = 1, FUN = 
                          function(x) table(x)["g"] >= 3) )
filteredsimdata4 <- simdata4$data[-c(filteralla, filterg3), ]
dim(simdata4$data)

## [1] 5000    4

dim(filteredsimdata4)

## [1] 3588    4

Now, construct a matrix with all possible combinations (4^4) of the states (alphabet) for four taxa. Using this matrix of all patterns, figure out the set of unique unobservable patterns.

alphabet <- c("a", "c", "g", "t")
allpatt <- expand.grid(alphabet, alphabet, alphabet, alphabet) #all possible combinations
unob_patt_index_1 <- which(apply(allpatt, MARGIN = 1, FUN = function(x) table(x)["g"] >= 3) )
unob_patt_index_2 <- which(apply(allpatt, MARGIN = 1, FUN = function(x) 
  isTRUE(all.equal(as.vector(x), c("a", "a", "a", "a"))) ) )
unob_patt_indices <- sort(union(unob_patt_index_1, unob_patt_index_2)) #Ordered indices.
unob_patt <- allpatt[unob_patt_indices, ] #matrix of unique patterns

Finally, this matrix of unique patterns can be passed to the unobserved argument.

model4_f_corrected <- estimaterates(usertree = simdata4$tree, userphyl = filteredsimdata4, 
                                    unobserved = unob_patt,
                        alphabet = c("a", "c", "g", "t"), rootprob = "maxlik", 
                        ratevar = "discgamma", nocat = 4, 
                        modelmat = "ER")
print(model4_f_corrected)

## Call:
## estimaterates(usertree = simdata4$tree, userphyl = filteredsimdata4, 
##     unobserved = unob_patt, alphabet = c("a", "c", "g", "t"), 
##     modelmat = "ER", ratevar = "discgamma", nocat = 4, rootprob = "maxlik")
## 
##  4 taxa with 3588 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 1 2 3 4 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    a  c  g  t
## a NA  1  1  1
## c  1 NA  1  1
## g  1  1 NA  1
## t  1  1  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.3433851
## 
## 
## $alpha
## [1] 1.980072
## 
## $alpharates
##           [,1]      [,2]     [,3]     [,4]
## [1,] 0.3165043 0.6813801 1.108845 1.893271
## 
## $iroot
## [1] 0.09291073 0.40803866 0.23921650 0.25983411
## 
## $se
## $se$rates
## , , 1
## 
##            [,1]
## [1,] 0.01159982
## 
## 
## $se$alpha
## [1] 0.3408542
## 
## $se$iroot
## [1] 0.01029943 0.01671436 0.02612932
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -12267.8 
## AIC           for model wop : -24545.6 
## BIC           for model wop : -24576.53 
## -----------------------------------
## Time taken: 1.906 seconds.

##Illustration 5: partition models with separate rate variation in each partition.

Here, simulated data akin to codon data is analyzed. The data can be organized into triplets, with the first two positions following the same substitution rates different from the third position. The data was simulated under the assumption of gamma rate variation with the shape parameters set to $2$ and $0.75$ for the set of first two positions and last position, respectively. The overall rate was $0.33$. The ancestral sequence had the following character frequencies: (0.15, 0.35, 0.25, 0.25)'. If a common gamma distribution over partitions is assumed, “discgamma” should be supplied as the value for argument ratevar. Here, for the model to be fit, a different gamma rate distribution is assumed in each partition using “partitionspecificgamma” as the value for argument ratevar. Hence, a list of two components needs to be specified for argument partition. This can easily be done, for example, the following can be used to specify partitions for $15$ sites in total with the first (second) component being a vector of the first two (last) positions of triplets.

list((1:15)[-seq(3, 15, by = 3)], seq(3, 15, by = 3) )

## [[1]]
##  [1]  1  2  4  5  7  8 10 11 13 14
## 
## [[2]]
## [1]  3  6  9 12 15

Moreover, the root probabilities are estimated using maximum likelihood:

data(simdata5)
print(dim(simdata5$data))

## [1] 6000    4

print(table(simdata5$data))

## 
##    a    c    g    t 
## 3664 8274 6206 5856

model5 <- estimaterates(usertree = simdata5$tree, userphyl = simdata5$data, 
                        alphabet = c("a", "c", "g", "t"), rootprob = "maxlik", 
                        partition = list((1:6000)[-seq(3, 6000, by = 3)], 
                                         seq(3, 6000, by = 3) ),
                        ratevar = "partitionspecificgamma", nocat = 4, 
                        modelmat = "ER")
print(model5)

## Call:
## estimaterates(usertree = simdata5$tree, userphyl = simdata5$data, 
##     alphabet = c("a", "c", "g", "t"), modelmat = "ER", partition = list((1:6000)[-seq(3, 
##         6000, by = 3)], seq(3, 6000, by = 3)), ratevar = "partitionspecificgamma", 
##     nocat = 4, rootprob = "maxlik")
## 
##  4 taxa with 6000 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
## [1] 1 2 3 4 5 6 7
## 
## -----------------------------------
## Parameter Index Matrix
##    a  c  g  t
## a NA  1  1  1
## c  1 NA  1  1
## g  1  1 NA  1
## t  1  1  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.3222489
## 
## , , 2
## 
##           [,1]
## [1,] 0.3289605
## 
## 
## $alpha
## [1] 1.6269658 0.5764479
## 
## $alpharates
##            [,1]      [,2]      [,3]     [,4]
## [1,] 0.26602923 0.6398519 1.1043454 1.989773
## [2,] 0.04506375 0.3311370 0.9676053 2.656194
## 
## $iroot
## [1] 0.1219808 0.3829441 0.2560083 0.2390669
## 
## $se
## $se$rates
## , , 1
## 
##             [,1]
## [1,] 0.009451127
## 
## , , 2
## 
##            [,1]
## [1,] 0.01669982
## 
## 
## $se$alpha
## [1] 0.24756756 0.06857752
## 
## $se$iroot
## [1] 0.005096066 0.007472415 0.006748037
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -20718.19 
## AIC           for model wop : -41450.38 
## BIC           for model wop : -41497.27 
## -----------------------------------
## Time taken: 7.685 seconds.

##Illustration 6: importing data and fitting a birth death model. Import some simulated data and tree from the web. There are 5000 discrete character patterns (four states in total: 1, 2, 3, and 4) for ten taxa. The data was simulated using a birth death model with rates 1 and 2 and equal base frequencies.

mp_data <- read.table("https://raw.githubusercontent.com/ujdang/miscellaneous/master/mp_example_data", header = TRUE)
mp_tree <- ape::read.tree("https://raw.githubusercontent.com/ujdang/miscellaneous/master/mp_example_tree")
mp_model <- estimaterates(usertree = mp_tree, userphyl = mp_data,
                        alphabet = c(1, 2, 3, 4), rootprob = "equal",
                        modelmat = "BD")
print(mp_model)

## Call:
## estimaterates(usertree = mp_tree, userphyl = mp_data, alphabet = c(1, 
##     2, 3, 4), modelmat = "BD", rootprob = "equal")
## 
##  10 taxa with 4999 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
## 
## -----------------------------------
## Parameter Index Matrix
##    1  2  3  4
## 1 NA  2  0  0
## 2  1 NA  2  0
## 3  0  1 NA  2
## 4  0  0  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.9785944
## [2,] 1.9719975
## 
## 
## $se
## $se$rates
## , , 1
## 
##            [,1]
## [1,] 0.01491092
## [2,] 0.02187034
## 
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -50189.78 
## AIC           for model wop : -100383.6 
## BIC           for model wop : -100396.6 
## -----------------------------------
## Time taken: 4.942 seconds.

Of course, we could also have estimated the root probabilities easily as seen before.

mp_model_2 <- estimaterates(usertree = mp_tree, userphyl = mp_data,
                        alphabet = c(1, 2, 3, 4), rootprob = "maxlik",
                        modelmat = "BD")
print(mp_model_2)

## Call:
## estimaterates(usertree = mp_tree, userphyl = mp_data, alphabet = c(1, 
##     2, 3, 4), modelmat = "BD", rootprob = "maxlik")
## 
##  10 taxa with 4999 discrete character patterns were fitted.
## -----------------------------------
## Groups of nodes with the same rates:
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
## 
## -----------------------------------
## Parameter Index Matrix
##    1  2  3  4
## 1 NA  2  0  0
## 2  1 NA  2  0
## 3  0  1 NA  2
## 4  0  0  1 NA
## -----------------------------------
## wop 
## $rates
## , , 1
## 
##           [,1]
## [1,] 0.9784229
## [2,] 1.9720672
## 
## 
## $iroot
## [1] 0.2504584 0.2459896 0.2544246 0.2491274
## 
## $se
## $se$rates
## , , 1
## 
##            [,1]
## [1,] 0.01513158
## [2,] 0.02227382
## 
## 
## $se$iroot
## [1] 0.007568207 0.008733148 0.008764533
## 
## 
## -----------------------------------
## Log-likelihood for model wop : -50189.62 
## AIC           for model wop : -100389.2 
## BIC           for model wop : -100421.8 
## -----------------------------------
## Time taken: 17.719 seconds.

Notice the slightly better log-likelihood (at the MLEs) value for the latter model. However, in this case, both the BIC and the AIC (higher is better here) indicate that the former model is the better fitting (more parsimonious) model.

Specifying custom modelmat

Can be one of two options: pre-built or user-specified. For the pre-built matrices, a four state matrix is used for the following examples. The dimensionality of the matrix is automatically inferred from the alphabet option.

“ER” for an equal rates matrix, e.g.,

matrix(c(NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA
), nrow = 4, ncol = 4)

##      [,1] [,2] [,3] [,4]
## [1,]   NA    1    1    1
## [2,]    1   NA    1    1
## [3,]    1    1   NA    1
## [4,]    1    1    1   NA

“SYM” for a symmetric matrix, e.g.,

matrix(c(NA, 1, 2, 3, 1, NA, 4, 5, 2, 4, NA, 6, 3, 5, 6, NA
), 4,4)

##      [,1] [,2] [,3] [,4]
## [1,]   NA    1    2    3
## [2,]    1   NA    4    5
## [3,]    2    4   NA    6
## [4,]    3    5    6   NA

“ARD” for an all rates different matrix, e.g.,

matrix(c(NA, 1, 2, 3, 4, NA, 5, 6, 7, 8, NA, 9, 10, 11, 12, 
NA), 4, 4)

##      [,1] [,2] [,3] [,4]
## [1,]   NA    4    7   10
## [2,]    1   NA    8   11
## [3,]    2    5   NA   12
## [4,]    3    6    9   NA

“GTR” for a general time reversible model. This sets reversible to be TRUE, rootprob to be “maxlik”, and the rate matrix supplied to the function is symmetric. The estimated rate matrix can be written as the product of a symmetric matrix multiplied by a diagonal matrix (consisting of the likelihood estimated root probabilities).
“BD” for a standard birth death matrix, e.g.,

matrix(c(NA, 1, 0, 0, 2, NA, 1, 0, 0, 2, NA, 1, 0, 0, 2, NA
), 4, 4)

##      [,1] [,2] [,3] [,4]
## [1,]   NA    2    0    0
## [2,]    1   NA    2    0
## [3,]    0    1   NA    2
## [4,]    0    0    1   NA

“BDER” for a standard birth death matrix, e.g.,

matrix(c(NA, 1, 0, 0, 1, NA, 1, 0, 0, 1, NA, 1, 0, 0, 1, NA
), 4, 4)

##      [,1] [,2] [,3] [,4]
## [1,]   NA    1    0    0
## [2,]    1   NA    1    0
## [3,]    0    1   NA    1
## [4,]    0    0    1   NA

“BDSYM” for a standard birth death matrix, e.g.,

matrix(c(NA, 1, 0, 0, 1, NA, 2, 0, 0, 2, NA, 3, 0, 0, 3, NA
), 4, 4)

##      [,1] [,2] [,3] [,4]
## [1,]   NA    1    0    0
## [2,]    1   NA    2    0
## [3,]    0    2   NA    3
## [4,]    0    0    3   NA

“BDARD” for an all rates different birth death matrix, e.g.,

matrix(c(NA, 1, 0, 0, 4, NA, 2, 0, 0, 5, NA, 3, 0, 0, 6, NA
), 4, 4)

##      [,1] [,2] [,3] [,4]
## [1,]   NA    4    0    0
## [2,]    1   NA    5    0
## [3,]    0    2   NA    6
## [4,]    0    0    3   NA

A square matrix can also be input that consists of integer values denoting the indices of the substitution rates to be estimated. Any arbitrary square matrix can be supplied where the zero entries denote substitutions that are not permitted, and the unique non-zero integer entries denote the rates that must be estimated.

In the following example inspired by gene copy loss, genes are lost (going from 1 copy to none, or 2 copies to 1) at the same instantaneous rate characterized by the index 1. On the other hand, losing both copies instantaneously is characterized by its own separate rate indexed by 2. Note that gene gain is not allowed here in this hypothetical example.

m <- matrix(c(NA, 1, 2, 0, NA, 1, 0, 0, NA), 3, 3)
rownames(m) <- colnames(m) <- c("Absent", "1 copy", "2 copies")
print(m)

##          Absent 1 copy 2 copies
## Absent       NA      0        0
## 1 copy        1     NA        0
## 2 copies      2      1       NA

Birth-death matrices can also be easily supplied by using the modelmat options (“BD”,“BDER”,“BDSYM”, and “BDARD”) discussed above. However, they are also easily built. In the following, a transition from character state 2 (row) to character state 1 (column) happens at a unique instantaneous rate (the death rate) denoted by the matrix entry 1. Similarly, the transition from character states 3 to 2, 4 to 3, or 5 to 4 are all characterized by the same instantaneous rate. Similarly, a transition from character state 1 to character state 2 (or 2 to 3, or 3 to 4, or 4 to 5) happens at a unique instantaneous rate (the birth rate).

m <- matrix(0, 5, 5)
diag(m) <- NA
diag(m[-1, ]) <- 1
diag(m[, -1]) <- 2
print(m)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]   NA    2    0    0    0
## [2,]    1   NA    2    0    0
## [3,]    0    1   NA    2    0
## [4,]    0    0    1   NA    2
## [5,]    0    0    0    1   NA

References:

Felsenstein, J. (1992). Phylogenies from restriction sites: A maximum-likelihood approach. Evolution, 159–173.
Lewis, P. O. (2001). A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology, 50, 913–925.
Hao, W., & Golding, G. B. (2006). The fate of laterally transferred genes: life in the fast lane to adaptation or death. Genome Research, 16, 636–643.
Cohen, O., & Pupko, T. (2010). Inference and characterization of horizontally transferred gene families using stochastic mapping. Molecular Biology and Evolution, 27, 703–713.

Markov Chain Analysis on Phylogenetic Trees

Utkarsh J. Dang and G. Brian Golding

2023-09-19

Contact

Installation

CRAN

Source package from CRAN

Source package from file

OS X Yosemite

Windows

Illustration 1

Fitting a simple model and evaluating the results.

Correcting for unobservable patterns

Correcting for unobservable patterns

Specifying custom modelmat