In this vignette, we explain how one can estimate the marginal likelihood using the LoRaD method (Wang et al. 2023).

Two DNA sequences, each of length 200 sites, were simulated under the K80 substitution model (Kimura 1980). This model has two parameters:

**transition/transversion rate ratio**, \(\kappa\): the ratio of the instantaneous rate of transition-type substitutions (A \(\leftrightarrow\)G, C\(\leftrightarrow\)T) to the instantaneous rate of transversion-type substitutions (A\(\leftrightarrow\)C, A\(\leftrightarrow\)T, G\(\leftrightarrow\)C, G\(\leftrightarrow\)T).**edge length**, \(v\) : the evolutionary distance between the two sequences measured in expected number of substitutions per site.

A continuous-time, 4-state Markov model was used for both simulation and analysis. The instantaneous rate matrix \({\bf Q}\) that defines the K80 substitution model is

\[\begin{align} {\bf Q} &= \left[ \begin{array}{cccc} -\beta(\kappa + 2) & \beta & \kappa \beta & \beta \\ \beta & -\beta(\kappa + 2) & \beta & \kappa \beta \\ \kappa \beta & \beta & -\beta(\kappa + 2) & \beta \\ \beta & \kappa \beta & \beta & -\beta(\kappa + 2) \\ \end{array} \right], \end{align}\]where the order of (e.g. DNA) states for both rows and columns is A, C, G, T. The matrix of transition probabilities can be obtained by exponentiating the product of \({\bf Q}\) and time \(t\):

\[\begin{align} {\bf P} &= e^{{\bf Q}t}. \end{align}\]The product of the base rate \(\beta\) and time \(t\) can be determined from the edge length parameter \(v\) and transition/transversion rate ratio parameter \(\kappa\). The edge length is the product of the overall substitution rate (\(0.25 (8 \beta + 4 \kappa \beta)\)) and time (\(t\)), yielding

\[\begin{align} v &= 2 \beta t + \kappa \beta t \\ \beta t &= \frac{v}{2 + \kappa} \end{align}\]The transition probabilities may be used to simulate data for one of the two sequences given the other sequence. The state for each site in the starting sequence is drawn from the stationary distribution \(\pi_A=\pi_C=\pi_G=\pi_T=0.25\) implied by \({\bf Q}\).

Given simulated sequence data \({\bf
D}\), the joint posterior distribution \(p(v, \kappa|{\bf D})\) was sampled using
MCMC, with likelihoods computed using the K80 model and using
Exponential priors with mean equal to 50 for both \(v\) and \(\kappa\). The posterior sample (10,000
points sampled 100 iterations apart) after a burn-in of 1000 iterations
was saved tab-delimited text file *k80-samples.txt*. In the file
there are four columns, iter (MCMC-iteration), log-kernel (log posterior
kernel), edgelen (sampled edge length), and kappa (sampled kappa
parameter).

In this example we will compute an estimate of the marginal likelihood under this model using the LoRaD method.

First, load the lorad package.

Read in the file containing the posterior sample.

Next, we must create a named vector to associate each column name in
`k80samples`

with a column specification. Here are the
possible column specifications:

**iteration**: The MCMC iteration**posterior**: The values in this column are components of the posterior (log scale)**nonnegative**: The parameter has non-negative support \([0,\infty)\)**unconstrained**: The parameter has support \((-\infty,\infty)\)**ignore**: The column should be ignored

All columns labeled **posterior** will be summed to
create the log posterior kernel (sum of log likelihood and log joint
prior).

Here are the column specifications appropriate for this example.

Column | Specification |
---|---|

iter | iteration |

log.kernel | posterior |

edgelen | positive |

kappa | positive |

```
# Create a named vector holding the column specifications
colspec <- c("iter"="iteration", "log.kernel"="posterior", "edgelen"="positive", "kappa"="positive")
```

The LoRaD requires all parameters to be unconstrained in their
support. Thus, the `positive`

column specification for both
`edgelen`

and `kappa`

results in a log
transformation prior to application of the LoRaD method.

To run the LoRaD function we need to specify the **training
fraction**, the **training sample selection
method**, and the **coverage fraction**. The
training fraction and the coverage fraction must be between 0 and 1. The
training sample selection method can be *left* (the first part of
the sample), *right* (the second part of the sample), or
*random* (a random part of the sample).

For this example we used 0.5 for the training fraction, 0.1 for the coverage fraction, and “left” for the training sample selection method.

```
results <- lorad_estimate(k80samples, colspec, 0.5, "left", 0.1)
lorad_summary(results)
#> This is lorad 0.0.1.0
#> Parameter sample comprises 10000 sampled points
#> Each sample point is a vector of 2 parameter values
#>
#> Training fraction is 0.5
#> Coverage specified is 0.1
#>
#> Partitioning samples into training and estimation:
#> Sample size is 10000
#> Training sample size is 5000
#> Estimation sample size 5000
#>
#> Processing training sample...
#> Lowest radial distance is 0.464997772
#> Log Delta -2.278161285
#>
#> Processing estimation sample...
#> Number of samples used is 510
#> Nominal coverage specified is 0.100000
#> Realized coverage is 0.102000
#> Log marginal likelihood is -460.822385
#>
```

For comparison, the two log marginal likelihood estimates reported by Wang et al. (2023) were -460.82239 (LoRaD method) and -460.86154 (Steppingstone method).

Kimura, Motoo. 1980. “A Simple Method for Estimating Evolutionary
Rates of Base Substitutions Through Comparative Studies of Nucleotide
Sequences.” *Journal of Molecular Evolution* 16 (2):
111–20. https://doi.org/10.1007/BF01731581.

Wang, Yu-Bo, Analisa Milkey, Aolan Li, Ming-Hui Chen, Lynn Kuo, and Paul
O. Lewis. 2023. “LoRaD: Marginal Likelihood Estimation with Haste
(but No Waste).” *Systematic Biology* 72 (3): 639–48. https://doi.org/10.1093/sysbio/syad007.