The TestNH Manual

This is the manual of the TestNH package, version 3.0.0.

1 Introduction
- 1.1 Description of the programs
- 1.2 How to run the programs
2 Substitution mapping with MapNH
- 2.1 Data import
- 2.2 Substitution mapping

1 Introduction

TestNH is a package for studying the non-homogeneous process of sequence evolution. It is written on the Bio++ libraries, and uses the command line syntax common to the Bio++ Program Suite (BppSuite). Part of this manual will therefore link to the corresponding manual of BppSuite where needed, and only describe options specific to the TestNH package.

Note that several detailed examples are provided along with the source code of the program, and can serve as good training starts. This manual intends to provide an exhaustive description of the options used in these examples.

Description of the programs
How to run the programs

1.1 Description of the programs

The TestNH package contains one program:

‘mapnh’: maps substitutions onto a phylogenetic tree, and counts types of substitutions per branch of the tree. The resulting counts are used as input of a clustering procedure to output groups of branches with similar substitution processes.

1.2 How to run the programs

All programs in the TestNH package follow the ‘bppSuite’ syntax. They are command line driven, and take as input options with the form ‘name’=‘value’. These options can be gathered into a file, and loaded using param=optionfile. Please refer to the Bio++ Program Suite manual for more details, including the use of variables, priority of option values, etc.

2.1 Data import

MapNH takes as input a sequence alignment, as described in the bppSuite manual (Sequences in Bio++ Program Suite manual). It then performs substitution mapping to count substitutions for each site of the alignment and each branch of a phylogenetic tree, input using the bppSuite syntax (Tree in Bio++ Program Suite manual).

2.2 Substitution mapping

The substitution mapping procedure requires a model of sequence evolution. As the procedure is robust to the type of model used, a Jukes-Cantor model is used by default. It is recommended however to use a less coarse model whenever possible (particularly for large alphabets like codon alphabets). All non-mixed models available in bppSuite are supported (Model in Bio++ Program Suite manual). A homogeneous model (like GTR for nucleotide, JTT92 for proteins and YN98 for codons) is usually a good start. Non homogeneous models are also supported, mainly for a posteriori validation of mapping robustness (to be used with PartNH for instance).

MapNH can perform several types of substitution mapping, which determine which type of substitution have to be counted and used for clustering branches. This is specified with command:

map.type = {register described}

A description of the register to use.

The types of substitutions to map are:

All

Maps all n(n-1) possible substitutions. This option should be only used for small alphabet sizes like DNA or RNA, as it uses a large amount of memory and dilutes the information.

Total

Counts the total number of substitutions.

Selected

Maps substitutions as defined in a list. This list is built as:

substitution.list = (Ts:A->G, G->A, C->T, T->C) 
(Tv: A->C, A->T, T->A, C->A, G->C, G->T, C->G, T->G)

The same group of substitutions is delimited by parentheses. The name, if entered, is entered at the start of a string and followed by ":". Substitutions are delimited by ",", and each substitution is defined with a "->" symbol.

GC {alphabet=nucleotides or codons}

Maps two types of substitutions: ‘AT to GC’ and ‘GC to AT’. With codon alphabet, only synonymous substitutions are considered (otherwise see also SW option). This option takes as input an optional argument telling if the counts should be corrected for nonstationarity: GC(stationarity=no) (yes by default) will normalize the counts by the ancestral frequencies of the corresponding node.

TsTv {alphabet=nucleotides or codons}

Counts transitions (type 1) and transversions (type 2).

SW {alphabet=nucleotides or codons}

Counts substitutions between or within GC vs AT Watson-Crick bounds, ie whether the bound is strong (GC pair) or weak (AT pair). The type numbers are 1 : S->S, 2 : S->W, 3: W->S, 4: W->W.

DnDs {alphabet=codons}

Counts nonsynonymous (type 2) and synonymous substitutions (type 1).

IntraAA {alphabet=codons}

Intra amino-acid substitutions (type following the AA alphabetic order).

InterAA {alphabet=codons}

Inter amino-acid substitutions (in both directions).

KrKc {alphabet=proteins or codons}

Counts conservative (type 1) or non-conservative substitutions (type2).

Combination(reg1={map.type}, reg2={map.type}, ...)

Counts combinations of substitution types.

In option, these calculations can be ponderated by weight or distance assigned to substitutions:

map.type=Total(distance=Diff(index1=Charge))

The options are:

weight: Each count between states i and j is multiplied by a given weight (which can be negative.)
distance: Each substitution rate between states i and j is multiplied by a given distance (which can be negative).

The difference between both options is that weights are used after the computation of the counts, and then are not related to the intermediate states between the first and last states. On the contrary, distances are used inside the computation, and apply to the intermediate states.

These statistics are declared in ‘bppSuite’ documentation file. See https://pbil.univ-lyon1.fr/bpp-doc/bppsuite/bppsuite.html#Index2

output.counts={output type}

Describes the type of outputs. There are several types:

Per type, as several newick trees with counts as branch lengths for each type. Counts are summed over all sites.
Per branch and per site, as a table with one row per site and one column per branch. Counts are summed over all types.
Per type and per site, as a table with one row per site and one column per type. Counts are summed over all branches.
Per site, per branch, per type, as several table files, one per type.

The corresponding options are:

PerType(file = {path}): With the prefix name for all counts tree files. Tree file for counts of type 1 will be named ‘prefix1’, for type 2 ‘prefix2’ and so on.
PerTypePerBranch(file = {path}, format = {Newick|NHX|Tsv}): If format=‘Tsv’, counts are written in a tsv file, otherwise they are written in specified tree format, with the prefix name for all counts tree files. Tree file for counts of type 1 will be named ‘prefix1’, for type 2 ‘prefix2’ and so on.
PerBranchPerSite(file = {path}): The file path indicates where the table should be stored.
PerSitePerType(file = {path}): The file path indicates where the table should be stored.
PerBranchPerSitePerType(file = {path}): With the prefix name for all table files. Table file for counts of type 1 will be named ‘prefix1’, for type 2 ‘prefix2’ and so on.

The distinct outputs can be combined as a list, for instance:

output.counts=PerType(file=mapping_per_type),\
              PerBranchPerSite(file=mapping_per_site.txt)

Counts can be normalized with the counts that could have been performed by another model, on the same history as the one described by the main model. For this, use option:

nullProcessParams = {list{<chars>=<values>}}

to assigne a list of parameter values used to define the normalization model from the main one. The ’*’ wildcard can be used, as in *theta* for all the parameters whose name has theta in it.

For example, to normalize by the counts performed by a neutral model, in YN98 modeling (typically for dN/dS):

nullProcessParams = YN98.omega*=1

In the case where we want separate counts (aka raw counts & normalizations), use the splitNorm=True option in the ouput.counts options.

output.counts = PerBranchPerType(file=$(REP)/$(DATA).counts_,\
                splitNorm=True)

In this case, an additional file with suffix _norm is output for normalizations, while regular output contains raw counts.

Based on this counts, MapNH can make a global test to assess if there is heterogeneity between branches:

test.global = {boolean}

Tell if global tests should be performed. If yes, two test will be done: a chi square contingency table, and a multinomial test. Note that both tests are indicative only, as the assumptions mode for computing the p-values may be incorrect.

manageUnresolved= {Zero|One|Average}

describes how unresolved characters are managed in counts:

Zero : all counts towards those characters are omitted (default);
One : all unresolved characters are considered as present as normal characters (ie sum of counts towards these characters);
Average : the counts towards unresolved characters are averaged (ie mean of counts towars these characters).

TestNH Manual 3.0.0