nhPhyML is a program
built to compute phylogenetic trees under the non stationary, non
homogeneous model of DNA sequence evolution of Galtier and Gouy (1998).
As such, it
provides estimates of the G+C contents of ancient sequences.
It uses the
algorithmic structure of PhyML (Guindon and Gascuel, 2003)
adapted to the rooted and irreversible case (many thanks to
Stéphane Guindon for providing PhyML source code).
The program can also be downloaded from PhyML homepage
, under "PHYML
It is provided as is and should be used with appropriate care.
any bug to my address :
been tested on Unix and Linux systems.
information about my work, see my
If you are interested not in the reconstruction of a phylogenetic tree but in studying the pattern of substitutions along a fixed topology and reconstructing ancestral sequences, I suggest you give a look at Bio++ and the BppSuite suite of software.
You can either
download an executable
file for LINUX or download an archive
containing all the source code. Example files can also be downloaded here.
Installation on Linux:
- If you have downloaded the executable file:
go to its directory,
make sure the file is executable (otherwise type chmod +x nhPhyml) and
- If you have downloaded the source code :
To install it, you
first need to extract it :
tar zxvf nhPhyml.tgz
Then, simply go
to the directory named "nhPhyml" and type :
There should be
an executable entitled "nhPhyml".
Installation on Mac OS X (thanks to Cedric Simillion for finding out how to install nhPhyML on Mac OS X)
- cd into the nhPhyml directory after extracting the archive
- remove the nhPhyml binary and all .o files that came with the archive
- open the Makefile in a text editor and remove the -static option from the CFLAGS line
- typing "make" now produces a working binary for OS X.
nhPhyml needs a rooted
tree (the root will never be moved throughout the tree space
search) and a sequence file in
phylip format. Note that if
the tree is not rooted, you will get a "Seg Fault".
Go to its
installation directory and type :
Then you face a
phylip-like (and PhyML-like) interface which asks for self-explanatory
information such as the number of rate categories for the gamma law,
whether or not the transition/transversion rate should be optimized...
You can also use
nhPhyml directly from the command line using :
-sequences=SequenceFile -tree=TreeFile -format=i -positions=123 -tstv=e
-rates=8 -alpha=e -topology=e -outseqs=y -eqfreq=lim -numeqfreq=5
SequenceFile is the
sequence file in phylip format,
Only the sequence file and the
tree file are mandatory. Default values for the other parameters
-positions=123 -tstv=e -rates=1 -topology=e -outseqs=n -eqfreq=unlim
When there is
only one rate of evolution, no alpha is used.
TreeFile is the
starting tree file in bracketted (newick) format
specifying phylip interleaved format (can also be s for phylip sequential
means that the user wants to use all the positions in the codons (could also be 1, 12,
13, 2, 23, or 3)
-tstv=e is to
tell the program that the transition/transversion rate needs to be
evaluated (put a value otherwise),
-rates=8 is the
number of rate categories for the discretized gamma distribution,
that the gamma distribution parameter alpha is to be evaluated (put a
means that we want to optimize the topology and the branch lengthes ; putting k for keep
would only optimize branch length while keeping the topology,
means we want to use the nhPhyML-Discrete version of nhPhyML, which
means that each branch has the "choice" between a limited set of G+C
In the default
version, specified by -eqfreq=unlim, the G+C equilibrium frequency is
optimized for each branch. This results in some convergence problems
(the true topology is found less often than with -eqfreq=lim).
in case you use a limited set of equilibrium frequencies (-eqfreq=lim),
you need to specify the number of equilibrium frequencies you want to
use. This number is important : too small, and the process of evolution
might not be modelled correctly. Too big, and the tree space
exploration gets unefficient. Please refer to the article for more
details. Recent addition (20/07/2012): Now you can specify upper and lower values for the limited set of equilibrium frequencies. Use options -eqfreqlow=0.2 -eqfrequpp=0.8 for instance.
Those options are
less central but might be useful to some users.
sets the precision to 0.0001. When optimizing parameters, if the
likelihood difference between the former value of the parameter and the
new value of the parameter is below the precision value, the maximum is
considered to be found, and the optimization stops. By default this
value is 0.000001, fairly low. Increasing this value decreases the
-quick=y : the
program does not make a final optimization of the parameters. Hence the
topology obtained is the most likely found by the program, but the
parameters such as branch length are not correctly optimized.
-gcvar=y : the
root G+C content is not optimized, but a range of values are tried, and
for each of these values the likelihood is maximized by optimizing the
free parameters. This way one can have an idea of the variance of the
root G+C content estimate. Moreover, the resulting LNF file can be used
as input to Consel (Shimodaira, 2001) to define a confidence interval
with e.g. the AU test.
sets the lower limit of the root G+C contents to be tried to .50.
Values tried are .51, .52, .53... until the upper limit is met.
sets the upper limit of the root G+C contents to be tried to .70.
that we want to get the ancestral sequence at the root node
otherwise). This feature has not been tested and should not be used
without extreme caution. When the user does
not want to use all the positions in codons, the ancestral sequence
cannot be reconstructed.
5 or 6 files are
produced in the directory containing the "SequenceFile" :
SequenceFile_nhPhyml.lk possesses 2 lines : the first one displays the final
likelihood of the output tree, the second one gives the final estimate
of the root G+C content.
SequenceFile_nhPhyml.out provides general information concerning the phylogenetic
reconstruction such as what were the input files, the options, how
many rate categories were used, what was the final likelihood and
how long did the run take.
SequenceFile_nhPhymlEq.tree is the final tree on which are displayed as
bootstrap values the equilibrium G+C contents in each branch.
SequenceFile_nhPhymlGC.tree is the same final tree on which are displayed as
bootstrap values the G+C contents at each node in the
tree, except at
the root. For the G+C content of the root, please check
sequence_file_nhPhyml.lk, second line.
SequenceFile_nhPhyml.seq contains a text representation of the tree
displaying the labels of all the nodes (numbers or sequence names).
Then the present sequences are displayed, in a fasta-like format,
together with the root sequence.
SequenceFile_nhPhyml.lnf is a simplified PAML-like (Yang, 1997) file
displaying the site likelihoods. This can be used as an input to CONSEL
(Shimodaira, 2001) to
compare between various trees.
The tree space
exploration is done as in PhyML v.2.2, by Nearest Neighbor
Interchanges (NNIs). These topological rearrangements are local, and do
permit testing topologies distant from the input one, especially when
the number of sequences is important. Therefore, it is recommended that
you run the program using many different input trees. The resulting
trees can be compared using CONSEL
with the help of the LNF files.
Please cite the
following article when using nhPhyML :
Boussau B, Gouy
M (2006). "Efficient Likelihood Computations with Non-Reversible
Models of Evolution.", Syst Biol. 2006, 55(5):756-68. .
it would also be good to cite the articles by Galtier and Gouy (1998) as
the model used in nhPhyML comes from this work, and by Guindon and Gascuel (2003)
as much of nhPhyML code comes from PhyML.
Galtier N, Gouy M
(1998)."Inferring pattern and process: maximum-likelihood
implementation of a nonhomogeneous model of DNA sequence evolution for
phylogenetic analysis", Mol Biol Evol. 1998 Jul;15(7):871-9.
Guindon S, Gascuel O
(2003). "A simple, fast, and accurate algorithm to estimate large
phylogenies by maximum likelihood", Syst Biol. 2003 Oct;52(5):696-704.
Shimodaira H, Hasegawa M
(2001). "CONSEL: for assessing the confidence of phylogenetic tree
selection", Bioinformatics. 2001 Dec;17(12):1246-7.
Yang, Z (1997). "PAML:
a program package for phylogenetic analysis by maximum likelihood",
Computer Applications in BioSciences 13:555-556.