Phyldog is a program made to simultaneously build gene and species trees when gene families have undergone duplications and losses. It can analyze thousands of gene families in dozens of genomes simultaneously, and was presented in an article in Genome Research
Trees and parameters are estimated in the maximum likelihood framework, by maximizing the probability of alignments given the species tree, the gene trees and the parameters of duplication and loss.
It is built on the Bio++
C++ libraries, and uses MPI
It is provided as is and should be used with appropriate care.
Please report any bug to my e-mail addresses : bastien.boussau /AT/ univ-lyon1.fr, or boussau /AT/ berkeley.edu.
It has been tested on Linux and OSX systems.
For more information about my work, see my personal page
download an archive
containing all the source code. Example files can also be downloaded here.
To compile and install PHYLDOG, two libraries are required:
- Boost with MPI enabled
Then the program PHYLDOG can be compiled.
You can find below how the two libraries are to be installed, and then how PHYLDOG can be compiled.
The latest available Bio++ libraries should be preferentially installed. To this end, the GIT versions should be used. First, the files must be downloaded, and then compiled and installed in the directory INSTALLATION_DIRECTORY_BIOPP (in the following, please replace "INSTALLATION_DIRECTORY_BIOPP" by the path to the directory where you want the Bio++ libraries to be installed). The Bio++ installation described here uses cmake, as it is the most efficient procedure for installing these libraries.
In a terminal:
git clone http://biopp.univ-montp2.fr/git/bpp-core.git
git clone http://biopp.univ-montp2.fr/git/bpp-seq.git
git clone http://biopp.univ-montp2.fr/git/bpp-phyl.git
- Compilation of the libraries (beware of the order, it needs to be bpp-core first, then bpp-seq, then bpp-phyl):
cmake . -DCMAKE_INSTALL_PREFIX=INSTALLATION_DIRECTORY_BIOPP
cmake . -DCMAKE_INSTALL_PREFIX=INSTALLATION_DIRECTORY_BIOPP
cmake . -DCMAKE_INSTALL_PREFIX=INSTALLATION_DIRECTORY_BIOPP
The Bio++ libraries should now be installed in the directory INSTALLATION_DIRECTORY_BIOPP.
The files must be downloaded, and then compiled and installed in the directory INSTALLATION_DIRECTORY_BOOST (where you want the libraries to be
installed). In the directory INSTALLATION_DIRECTORY_BOOST, both header files (in a "include" directory) and compiled libraries (in a "lib" directory)
will be installed. In the following lines, I call INCLUDE_DIRECTORY the whole path to the "include" directory (INSTALLATION_DIRECTORY_BOOST/include), and LIB_DIRECTORY the whole
path to the "lib" directory (INSTALLATION_DIRECTORY_BOOST/lib)).
The Boost libraries can be dowloaded from this link:
In a terminal:
tar zxvf boost_1_47_0.tar.gz
./bootstrap.sh --libdir=LIB_DIRECTORY --includedir=INCLUDE_DIRECTORY --prefix=INSTALLATION_DIRECTORY_BOOST --with-libraries=mpi
The Boost libraries should now be installed.
You should be able to use cmake to compile phyldog.
Hopefully, trying in a terminal:
should work if the libraires have been installed in default directories on tyour system.
Alternatively, if you installed the Bio++ or Boost libraries in non-default directories, you may need to give the paths to those directories to cmake, as follows:
cmake . -DCMAKE_PREFIX_PATH=INSTALLATION_DIRECTORY_BIOPP -DBOOST_ROOT=INSTALLATION_DIRECTORY_BOOST
where INSTALLATION_DIRECTORY_BIOPP is the path leading to the folder that includes both Bio++ libraries and Bio++ header files, as introduced above when installing the Bio++ libraries. Similarly, INSTALLATION_DIRECTORY_BOOST is the directory introduced above where BOOST has been installed.
Alternatively, you can directly edit this Makefile_template
by changing 4 lines, and then entering "make -f Makefile_template phyldog"
in the terminal.
The 4 lines to change in the Makefile_template are at the beginning of the file.
After "BOOST_INCLUDE = ", you should enter the complete path to the Boost "include" directory (i.e. INCLUDE_DIRECTORY above).
After "BIOPP_INCLUDE = ", you should enter the complete path to the Biopp "include" directory (i.e. INSTALLATION_DIRECTORY_BIOPP/include).
After "BIOPP_LIBRARIES = ", you should enter the complete path to the Biopp "lib" directory (i.e. INSTALLATION_DIRECTORY_BIOPP/lib).
After "BOOST_LIBRARIES = ", you should enter the complete path to the Boost "lib" directory (i.e. LIB_DIRECTORY above).
Once these 4 lines have been corrected to correspond to the proper values for your system, typing "make -f Makefile_template phyldog"
in the terminal should compile the program, and produce an executable file "phyldog", ready to run.
To use it, type:
mpirun -np NUM_PROCESSES phyldog param=GeneralOptions.opt
Where the file GeneralOptions.opt is presented below, and NUM_PROCESSES corresponds to the number of processes you want to use for Phyldog. Using more processes than there are gene families is useless.
should show a partial list of options that need to be given to the program.
HOW TO RUN PHYLDOG
phyldog takes as input a series of files. These files form a hierarchy, as shown in Figure 1.
One file (GeneralOptions.opt) contains general options that are used by all processors on which the job runs. Then there is one file per gene family, where gene family-specific options are given.
This file containing the general options can be given as an argument to phyldog, as follows:
mpirun -np NUM_PROCESSORS phyldog param=GeneralOptions.opt
where NUM_PROCESSORS is the number of processors to be used by phyldog, and GeneralOptions.opt is the file containing the general options.
The file contains a list of options as follows:
first_parameter = value1
second_parameter = value2
This follows the syntax used for programs of the bppsuite series, as explained on the following webpage:
THE GeneralOptions.opt FILE
GeneralOptions.opt may contain the following list of options:
- PATH= /home/user/path_to_data_files/ #path to the directory where the input files are, and the output files are left.
- init.species.tree=user #whether the starting species tree is given in a file ("user"), or should be "random", or is to be reconstructed using a fast algorithm ("mrp")
- species.tree.file=$(PATH)InputSpeciesTree.tree #gives the path to the input species tree, in newick format.
- species.names.file=$(PATH)SpeciesNames.txt # gives the species names to be considered for species tree reconstruction. One species name per line, without spaces.
- starting.tree.file=$(PATH)start #the starting species tree is saved in the file start, in the directory $(PATH)
- output.tree.file=$(PATH)output.sptree #the end species tree is saved in the file output.sptree, in the directory $(PATH)
- output.temporary.tree.file=$(PATH)CurrentSpeciesTree.tree #the species tree obtained is output to this file if the job has not finished but was closed because of time constraints.
- genelist.file=$(PATH)listeGene #Contains a list of gene family-specific options
- output.duplications.tree.file=$(PATH)SpeciesTreeDuplications.tree # Species tree where branch lengths represent total numbers of duplications
- output.losses.tree.file=$(PATH)SpeciesTreeLosses.tree # Species tree where branch lengths represent total numbers of losses
- output.numbered.tree.file=$(PATH)SpeciesTreeLosses.tree # Species tree with nodes numbered
- optimization.topology=no # whether the species tree topology should be optimized (yes) or not (no)
- branch.expected.numbers.optimization= average_then_branchwise # whether the branch-wise parameters of duplications or losses should be optimized and branchwise (branchwise) or optimized and averaged over all branches (average) or not optimized (no). The option "average_then_branchwise" is a good compromise between speed and accuracy. WARNING: do not try to estimate these parameters if few families (e.g. 1-100) are used. The estimated parameters would be very inaccurate.
- genome.coverage.file= $(PATH)GenomeCoverage # File giving the expected completeness of the genomes under study, in percents.
- spr.limit=5 # For SPR moves on the species tree, gives the maximum distance between the position of the pruned subtree and its regrafting position.
- time.limit=23 # Time limit for the job: beyond 23 hours, the job stops. Should be useful if the job is limited to less than 24 hours
- current.step=0 # This option is useful to restart a job that has been stopped due to time.limit.
- output.file.suffix=_extension # An extension that will be added to all output files.
- alternate.topology.likelihoods=$(PATH)alternateLks.txt # A file where the likelihoods of alternate topologies encountered during the NNI search on the species tree are saved.
- species.duplication.tree.file=previousDuplicationTree.nwk # A file containing a species tree with branch lengths representing duplication parameters. These duplication parameters are then used as starting values for the algorithm. You don't have to give this option if you don't have good estimates for these parameters.
- species.loss.tree.file=previousDuplicationTree.nwk # A file containing a species tree with branch lengths representing loss parameters. These loss parameters are then used as starting values for the algorithm. You don't have to give this option if you don't have good estimates for these parameters.
This GeneralOptions.opt file contains options specific to the search for the best species tree. However, the options included in this file are also read by client processors in charge of gene families. Therefore, it is possible to include options that apply to the gene tree search for all gene families.
THE FILE LISTING GENE FAMILY-SPECIFIC OPTION FILES
Gene-family specific options need to be given in additional files. The list of these files is given in "genelist.file".
"genelist.file" should look like this:
The second way to list the option files contains an additional element of information, which is the "complexity" of a gene family. This "complexity" is used at the beginning of the algorithm to distribute equally the loads on the different computers. So far, it is still unclear what this complexity should be (some function of the number of sequences and the number of sites, but the number of duplications and losses in a gene tree also has a great impact on time complexity). I generally use the number of sequences in the gene family for lack of a better metric.
GENE FAMILY-SPECIFIC OPTION FILES
Inside these option files, gene-family-specific options should be set. These options again follow the bppsuite syntax. A large number of these options are documented in the bppsuite help:
An example of these options is given below. I have ordered the options in great categories, such as "data files", "algorithm options", "model options", and "optimization options". We assume we are looking at the contents of the file "family_1.option").
First, data files:
- PATH= home/user/path_to_data_files/ #path to the directory where the input files are, and the output files are left.
- DATA= family_1 # Variable used to give the name of the data files.
- alphabet= DNA # Could also be "RNA", "Protein", or Codon. Please see the bppsuite help for more details.
- taxaseq.file=$(PATH)$(DATA).link # File giving the link between species and sequence names (more on this below)
- input.sequence.file=$(PATH)$(DATA).fasta # file giving the input sequence alignment for the gene family
- input.sequence.format=Fasta # Format of the sequence alignment. Could be Fasta, Phylip, Clustal, Mase, Nexus... Please see the bppsuite help for more details.
- output.reconciled.tree.file=$(PATH)$(DATA)_Reconciled.tree # File where to store the output improved and reconciled gene tree, in NHX format. Duplication and speciation nodes are annotated, with the tag "Ev=D" or "Ev=S" respectively.
- output.duplications.tree.file=$(PATH)$(DATA)_Duplications.tree # File where the species tree topology is saved, annotated with numbers of duplications for this gene family.
- output.losses.tree.file=$(PATH)$(DATA)_Losses.tree # File where the species tree topology is saved, annotated with numbers of losses for this gene family.
- output.numbered.tree.file=$(PATH)$(DATA)_Numbered.tree # File where the species tree topology is saved, annotated with node indices.
- input.sequence.sites_to_use=all # tells whether we should use all sites in the alignment or not. Could be "all", "nogap", or "complete". Please see the bppsuite help for more details.
- input.sequence.max_gap_allowed=100% # Maximum number of gaps tolerated for including a site in the analysis.
- init.gene.tree=user # Starting gene tree. Could be "user", "bionj" or "phyml". "user" requires that a user-input tree is given with the "gene.tree.file" option, whereas the options "bionj" and "phyml" have phyldog use these algorithms to create starting gene trees.
- gene.tree.file=$(PATH)$(DATA).tree # File containing the input starting gene tree in newick format. Useful if "init.gene.tree=user".
- output.starting.gene.tree.file=$(PATH)$(DATA)_starting.tree # File where the starting gene tree is saved.
Then, algorithm options:
- rearrangement.gene.tree= nni # Type of rearrangement: "nni" or "spr". "nni" is much faster but less exhaustive than "spr". If the species tree topology is fixed, we advise spr, which provides better gene trees. Otherwise, we advise the use of nnis.
- SPR.limit.gene.tree= 4 # For SPR moves on the gene tree, gives the maximum distance between the position of the pruned subtree and its regrafting position.
- reset.gene.trees= yes # yes by default. If yes, gene trees are reset to starting gene tree topologies before each gene tree improvement. This offers the guarantees that we start from a gene tree that is good according to sequences only. Putting "no" instead results in a faster algorithm, but this option has not been tested thoroughly.
Then, model options:
- model=GTR(a=1.17322, b=0.27717, c=0.279888, d=0.41831, e=0.344783, initFreqs=observed, initFreqs.observedPseudoCount=1) # options of the model used. Should match the alphabet. Please see the bppsuite help for more details.
- rate_distribution=Invariant(dist=Gamma(n=4,alpha=1.0), p=0.1) # Rate heterogeneity option. Here we assume a gamma law with 4 categories and a category of invariants to model rate heterogeneity among sites.
- optimization.ignore_parameter=InvariantMixed.dist_Gamma.alpha, InvariantMixed.p, GTR.a, GTR.b, GTR.c, GTR.d, GTR.e, GTR.theta, GTR.theta1, GTR.theta2 # We choose not to optimize these 10 parameters in order to save computing time, as we have provided reasonable input values. However, in cases where good input values are not available, it may be wise to leave this field empty and optimize these parameters.
Finally, optimization options:
- optimization.topology=yes # We choose to optimize the topology.
- optimization.tolerance=0.01 # We have a large optimization tolerance to speed up the computations.
THE FILE GIVING THE LINK BETWEEN SPECIES NAMES AND SEQUENCE NAMES
The "taxaseq.file" file contains the link between species name and gene name.
This means that sequences ENSPPYP00000018557 and ENSPPYP00000018560 correspond to the species Pongo_pygmaeus for instance.
The species names should correspond to the names used in "species.names.file", an option that should be found in GeneralOptions.opt, or alternatively should correspond to the names in the input species tree ("input.species.tree").
SUMMARY: FILES NEEDED BY PHYLDOG
- a file GeneralOptions.opt
- a file giving the input species tree, or alternatively a list of species names (if the option for the input species tree is "random", this file is necessary). Otherwise this file can be used to limit the list of species to include in the study.
- a file giving a list of gene family-specific option files
- one file per gene family describing the options for this gene family
- one alignment file per gene family
- one file giving the link between species name and sequence name, one per gene family
Files that can be provided but that are not absolutely necessary:
- a file giving genome coverage per species (otherwise phyldog assumes genomes are 100% complete)
- a file giving the duplication parameters for the species tree (otherwise phyldog will estimate these)
- a file giving the loss parameters for the species tree (otherwise phyldog will estimate these)
- one tree file per gene family (otherwise phyldog can estimate a starting gene tree using bionj or phyml-like algorithms)
Please cite the
following article when using PHYLDOG :