Author: Bastien Boussau
Maintainer: Thomas Bigot
bastien.boussau@univ-lyon1.fr
Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558
After reading this page, you can find extra information on the project page.
Development version : 2.0beta (git, branch master).
Phyldog is a program made to simultaneously build gene and species trees when gene families have undergone duplications and losses. It can analyze thousands of gene families in dozens of genomes simultaneously, and was presented in an article in Genome Research.
Trees and parameters are estimated in the maximum likelihood framework, by maximizing the probability of alignments given the species tree, the gene trees and the parameters of duplication and loss.
It is built on the Bio++, Phylogenetic Likelihood Library (PLL) and BOOST C++ libraries, and uses MPI for parallelization.
It is provided as is and should be used with appropriate care. Please report any bug to my e-mail addresses : bastien.boussau@univ-lyon1.fr. It has been tested on Linux and OSX systems. For more information about my work, see my personal page or this one.
To allow you just to give Phyldog a try without installing it, we provide docker and virtual machine images.
Docker is a solution to provide images in which everything needed to execute some application is bundled. It is a kind of light virtual machine, not containing a whole linux system. You can load it installing docker on your system and executing a few commands:
docker pull thomasbigot/phyldog:latest docker run -i -t thomasbigot/phyldog:latest /bin/bash
Please note, depanding on your system, you may need to run these commands with sudo
, or to belong to the docker
group.
You can download a ready-to-use virtual machine. Once downloaded, import it using a virtualization tool such as Virtual Box for instance. Here are some intructions:
ancsuite-latest.ova
file on your computer;File
/ Import a Virtual Application
. Then, select the downloaded file;Application menu
(upper left of the screen) : Settings
/ Keyboard
. Go to Layout, uncheck Use system defaults
and set your own layout;Application menu
/ Settings
/ Display
, you can also change the screen size.You can find a terminal emulator at the bottom panel of the screen. Here are the directories you will find in your personal directory :
ExampleData
contains an example data set used in the tutorial below.You can find a tutorial here explaining how to use Phyldog.
If you do not want to compile the program and its dependencies, you can use our experimental nightly static builds at this address:
ftp://www.prabi.fr/pub/ancestrome/phyldog/nightly-builds/+latest/
It’s been designed to run on a recent linux 64 bits machine.
Otherwise, full installation instructions are available on our wiki. For the moment, you need git to install phyldog.
To use it, type:
mpirun -np NUM_PROCESSES phyldog param=GeneralOptions.opt
Where the file GeneralOptions.opt
is presented below, and NUM_PROCESSES
corresponds to the number of processes you want to use for Phyldog. Using more processes than there are gene families is useless.
Just typing:
./phyldog
should show a partial list of options that need to be given to the program.
You can download example files.
phyldog takes as input a series of files. These files form a hierarchy, as shown in Figure 1 (which you can click to enlarge).
One file (GeneralOptions.opt
) contains general options that are used by all processors on which the job runs. Then there is one file per gene family, where gene family-specific options are given.
This file containing the general options can be given as an argument to phyldog, as follows:
mpirun -np NUM_PROCESSORS phyldog param=GeneralOptions.opt
where NUM_PROCESSORS
is the number of processors to be used by phyldog, and GeneralOptions.opt
is the file containing the general options.
The file contains a list of options as follows:
first_parameter = value1
second_parameter = value2
This follows the syntax used for programs of the bppsuite series, as explained on the bppsuite documentation.
GeneralOptions.opt
fileGeneralOptions.opt
may contain the following list of options:
user
), or should be random
, or is to be reconstructed using a fast algorithm (mrp
)$(PATH)
output.sptree
, in the directory $(PATH)
yes
) or not (no
)branchwise
) or optimized and averaged over all branches (average
) or not optimized (no
). The option average_then_branchwise
is a good compromise between speed and accuracy. WARNING: do not try to estimate these parameters if few families (e.g. 1-100) are used. The estimated parameters would be very inaccurate.This GeneralOptions.opt
file contains options specific to the search for the best species tree. However, the options included in this file are also read by client processors in charge of gene families. Therefore, it is possible to include options that apply to the gene tree search for all gene families.
Gene-family specific options need to be given in additional files. The list of these files is given in genelist.file
This file should look like this:
family_1.option
family_2.option
...
or
family_1.option:10
family_2.option:30
...
The second way to list the option files contains an additional element of information, which is the “complexity” of a gene family. This “complexity” is used at the beginning of the algorithm to distribute equally the loads on the different computers. So far, it is still unclear what this complexity should be (some function of the number of sequences and the number of sites, but the number of duplications and losses in a gene tree also has a great impact on time complexity). I generally use the number of sequences in the gene family for lack of a better metric.
Inside these option files, gene-family-specific options should be set. These options again follow the bppsuite syntax. A large number of these options are documented in the bppsuite help.
An example of these options is given below. I have ordered the options in great categories, such as data files
, algorithm options
, model options
, and optimization options
. We assume we are looking at the contents of the file family_1.option
).
RNA
, Protein
, or Codon
. Please see the bppsuite help for more details.Fasta
, Phylip
, Clustal
, Mase
, Nexus
… Please see the bppsuite help for more details.all
, nogap
, or complete
. Please see the bppsuite help for more details.user
, bionj
or phyml
. user
requires that a user-input tree is given with the gene.tree.file
option, whereas the options bionj
and phyml
have phyldog use these algorithms to create starting gene trees.init.gene.tree=user
.nni
or spr
. nni
is much faster but less exhaustive than spr
. If the species tree topology is fixed, we advise spr, which provides better gene trees. Otherwise, we advise the use of nnis.yes
by default. If yes
, gene trees are reset to starting gene tree topologies before each gene tree improvement. This offers the guarantees that we start from a gene tree that is good according to sequences only. Putting no
instead results in a faster algorithm, but this option has not been tested thoroughly.The taxaseq.file
file contains the link between species name and gene name.
For instance:
Oryctolagus_cuniculus:ENSOCUP00000017695
Dipodomys_ordii:ENSDORP00000011323
Sus_scrofa:ENSSSCP00000001753
Pongo_pygmaeus:ENSPPYP00000018557;ENSPPYP00000018560
...
This means that sequences ENSPPYP00000018557
and ENSPPYP00000018560
correspond to the species Pongo_pygmaeus
for instance.
The species names should correspond to the names used in species.names.file
, an option that should be found in GeneralOptions.opt
, or alternatively should correspond to the names in the input species tree (input.species.tree
).
GeneralOptions.opt
random
, this file is necessary). Otherwise this file can be used to limit the list of species to include in the study.Files that can be provided but that are not absolutely necessary:
Please cite the following article when using PHYLDOG :
Genome-scale coestimation of species and gene trees. Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V.
Genome Research. 2013 Feb;23(2):323-30. doi: 10.1101/gr.141978.112. Epub 2012 Nov 6.
Bastien Boussau, PhD
bastien.boussau@univ-lyon1.fr
Laboratoire de
Biométrie et Biologie Evolutive, UMR CNRS 5558