nhPhyML - Non-homogeneous Maximum Likelihood tree reconstruction.

nhPhyML is a program built to compute phylogenetic trees under the non stationary, non homogeneous model of DNA sequence evolution of Galtier and Gouy (1998).
As such, it provides estimates of the G+C contents of ancient sequences.
It uses the algorithmic structure of PhyML (Guindon and Gascuel, 2003) adapted to the rooted and irreversible case (many thanks to Stéphane Guindon for providing PhyML source code).
The program can also be downloaded from PhyML homepage, under "PHYML unofficial versions".
It is provided as is and should be used with appropriate care.
Please report any bug to my address : bastien.boussau@univ-lyon1.fr.
It has been tested on Unix and Linux systems.
For more information about my work, see my lab page.

If you are interested not in the reconstruction of a phylogenetic tree but in studying the pattern of substitutions along a fixed topology and reconstructing ancestral sequences, I suggest you give a look at Bio++ and the BppSuite suite of software.

nhPhyml needs a rooted tree (the root will never be moved throughout the tree space search) and a sequence file in phylip format. Note that if the tree is not rooted, you will get a "Seg Fault".

Phylip-like interface

Go to its installation directory and type :

./nhPhyml

Then you face a phylip-like (and PhyML-like) interface which asks for self-explanatory information such as the number of rate categories for the gamma law, whether or not the transition/transversion rate should be optimized...

Command line

You can also use nhPhyml directly from the command line using :

./nhPhyml -sequences=SequenceFile -tree=TreeFile -format=i -positions=123 -tstv=e -rates=8 -alpha=e -topology=e -outseqs=y -eqfreq=lim -numeqfreq=5 -treefile=Treefile

Where :

SequenceFile is the sequence file in phylip format,

TreeFile is the starting tree file in bracketted (newick) format

-format=i helps specifying phylip interleaved format (can also be s for phylip sequential format),

-positions=123 means that the user wants to use all the positions in the codons (could also be 1, 12, 13, 2, 23, or 3)

-tstv=e is to tell the program that the transition/transversion rate needs to be evaluated (put a value otherwise),

-rates=8 is the number of rate categories for the discretized gamma distribution,

-alpha=e means that the gamma distribution parameter alpha is to be evaluated (put a value otherwise),

-topology=e means that we want to optimize the topology and the branch lengthes ; putting k for keep would only optimize branch length while keeping the topology,

-eqfreq=lim means we want to use the nhPhyML-Discrete version of nhPhyML, which means that each branch has the "choice" between a limited set of G+C equilibrium frequencies.
In the default version, specified by -eqfreq=unlim, the G+C equilibrium frequency is optimized for each branch. This results in some convergence problems (the true topology is found less often than with -eqfreq=lim).

-numeqfreq=5 : in case you use a limited set of equilibrium frequencies (-eqfreq=lim), you need to specify the number of equilibrium frequencies you want to use. This number is important : too small, and the process of evolution might not be modelled correctly. Too big, and the tree space exploration gets unefficient. Please refer to the article for more details. Recent addition (20/07/2012): Now you can specify upper and lower values for the limited set of equilibrium frequencies. Use options -eqfreqlow=0.2 -eqfrequpp=0.8 for instance.

Only the sequence file and the tree file are mandatory. Default values for the other parameters are :
-format=i -positions=123 -tstv=e -rates=1 -topology=e -outseqs=n -eqfreq=unlim
When there is only one rate of evolution, no alpha is used.

Those options are less central but might be useful to some users.

-precision=0.0001 : sets the precision to 0.0001. When optimizing parameters, if the likelihood difference between the former value of the parameter and the new value of the parameter is below the precision value, the maximum is considered to be found, and the optimization stops. By default this value is 0.000001, fairly low. Increasing this value decreases the computational time.

-quick=y : the program does not make a final optimization of the parameters. Hence the topology obtained is the most likely found by the program, but the parameters such as branch length are not correctly optimized.

-gcvar=y : the root G+C content is not optimized, but a range of values are tried, and for each of these values the likelihood is maximized by optimizing the free parameters. This way one can have an idea of the variance of the root G+C content estimate. Moreover, the resulting LNF file can be used as input to Consel (Shimodaira, 2001) to define a confidence interval with e.g. the AU test.

-gclow=0.50 : sets the lower limit of the root G+C contents to be tried to .50. Values tried are .51, .52, .53... until the upper limit is met.

-gcupp=0.70 : sets the upper limit of the root G+C contents to be tried to .70.

-outseqs=y means that we want to get the ancestral sequence at the root node
(put n otherwise). This feature has not been tested and should not be used without extreme caution. When the user does not want to use all the positions in codons, the ancestral sequence cannot be reconstructed.

5 or 6 files are produced in the directory containing the "SequenceFile" :

- SequenceFile_nhPhyml.lk possesses 2 lines : the first one displays the final likelihood of the output tree, the second one gives the final estimate of the root G+C content.

- SequenceFile_nhPhyml.out provides general information concerning the phylogenetic reconstruction such as what were the input files, the options, how many rate categories were used, what was the final likelihood and how long did the run take.

- SequenceFile_nhPhymlEq.tree is the final tree on which are displayed as bootstrap values the equilibrium G+C contents in each branch.

- SequenceFile_nhPhymlGC.tree is the same final tree on which are displayed as bootstrap values the G+C contents at each node in the
tree, except at the root. For the G+C content of the root, please check sequence_file_nhPhyml.lk, second line.

- SequenceFile_nhPhyml.seq contains a text representation of the tree displaying the labels of all the nodes (numbers or sequence names). Then the present sequences are displayed, in a fasta-like format, together with the root sequence.

- SequenceFile_nhPhyml.lnf is a simplified PAML-like (Yang, 1997) file displaying the site likelihoods. This can be used as an input to CONSEL (Shimodaira, 2001) to compare between various trees.

Galtier N, Gouy M (1998)."Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis", Mol Biol Evol. 1998 Jul;15(7):871-9.

Guindon S, Gascuel O (2003). "A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood", Syst Biol. 2003 Oct;52(5):696-704.

Shimodaira H, Hasegawa M (2001). "CONSEL: for assessing the confidence of phylogenetic tree selection", Bioinformatics. 2001 Dec;17(12):1246-7.

Yang, Z (1997). "PAML: a program package for phylogenetic analysis by maximum likelihood", Computer Applications in BioSciences 13:555-556.