Contents

Introduction

Phyldog is a program made to simultaneously build gene and species trees when gene families have undergone duplications and losses. It can analyze thousands of gene families in dozens of genomes simultaneously, and was presented in an article in Genome Research. Trees and parameters are estimated in the maximum likelihood framework, by maximizing the probability of alignments given the species tree, the gene trees and the parameters of duplication and loss.

It is built on the Bio++, Phylogenetic Likelihood Library (PLL) and BOOST C++ libraries, and uses MPI for parallelization.

It is provided as is and should be used with appropriate care. Please report any bug to my e-mail addresses : bastien.boussau@univ-lyon1.fr. It has been tested on Linux and OSX systems. For more information about my work, see my personal page or this one.

Try Phyldog, without installing it

If you just want to give Phyldog a try without installing it, you can download a ready-to-use virtual machine (can also be downloaded from here). Once downloaded, import it using a virtualization tool such as Virtual Box for instance. Here are some intructions:

You can find a terminal emulator at the bottom panel of the screen. Here are the directories you will find in your personal directory :

Tutorial

You can find a tutorial here explaining how to use Phyldog.

Installation

For installation instructions, please refer to our wiki. For the moment, you can install phyldog with git.

Usage

To use it, type:

mpirun -np NUM_PROCESSES phyldog param=GeneralOptions.opt

Where the file GeneralOptions.opt is presented below, and NUM_PROCESSES corresponds to the number of processes you want to use for Phyldog. Using more processes than there are gene families is useless.

Just typing: ./phyldog should show a partial list of options that need to be given to the program.

Example Data

You can download example files.

How to run Phyldog

phyldog takes as input a series of files. These files form a hierarchy, as shown in Figure 1 (which you can click to enlarge).

Organization of files

One file (GeneralOptions.opt) contains general options that are used by all processors on which the job runs. Then there is one file per gene family, where gene family-specific options are given. This file containing the general options can be given as an argument to phyldog, as follows:

mpirun -np NUM_PROCESSORS phyldog param=GeneralOptions.opt

where NUM_PROCESSORS is the number of processors to be used by phyldog, and GeneralOptions.opt is the file containing the general options.

The file contains a list of options as follows:

first_parameter   = value1
second_parameter  = value2

This follows the syntax used for programs of the bppsuite series, as explained on the bppsuite documentation.

The GeneralOptions.opt file

GeneralOptions.opt may contain the following list of options:

PATH=/home/user/path_to_data_files/
# path to the directory where the input files are, and the output files are left.
init.species.tree=user
# whether the starting species tree is given in a file (user), or should be random, or is to be reconstructed using a fast algorithm (mrp)
species.tree.file=$(PATH)InputSpeciesTree.tree
# gives the path to the input species tree, in newick format.
species.names.file=$(PATH)SpeciesNames.txt
# gives the species names to be considered for species tree reconstruction. One species name per line, without spaces.
starting.tree.file=$(PATH)start
# the starting species tree is saved in the file start, in the directory $(PATH)
output.tree.file=$(PATH)output.sptree
# the end species tree is saved in the file output.sptree, in the directory $(PATH)
output.temporary.tree.file=$(PATH)CurrentSpeciesTree.tree
# the species tree obtained is output to this file if the job has not finished but was closed because of time constraints.
genelist.file=$(PATH)listeGene
# Contains a list of gene family-specific options
output.duplications.tree.file=$(PATH)SpeciesTreeDuplications.tree
# Species tree where branch lengths represent total numbers of duplications
output.losses.tree.file=$(PATH)SpeciesTreeLosses.tree
# Species tree where branch lengths represent total numbers of losses
output.numbered.tree.file=$(PATH)SpeciesTreeLosses.tree
# Species tree with nodes numbered
optimization.topology=no
# whether the species tree topology should be optimized (yes) or not (no)
branch.expected.numbers.optimization=average_then_branchwise
# whether the branch-wise parameters of duplications or losses should be optimized and branchwise (branchwise) or optimized and averaged over all branches (average) or not optimized (no). The option average_then_branchwise is a good compromise between speed and accuracy. WARNING: do not try to estimate these parameters if few families (e.g. 1-100) are used. The estimated parameters would be very inaccurate.
genome.coverage.file=$(PATH)GenomeCoverage
# File giving the expected completeness of the genomes under study, in percents.
spr.limit=5
# For SPR moves on the species tree, gives the maximum distance between the position of the pruned subtree and its regrafting position.
time.limit=23
# Time limit for the job: beyond 23 hours, the job stops. Should be useful if the job is limited to less than 24 hours
current.step=0
# This option is useful to restart a job that has been stopped due to time.limit.
output.file.suffix=_extension
# An extension that will be added to all output files.
alternate.topology.likelihoods=$(PATH)alternateLks.txt
# A file where the likelihoods of alternate topologies encountered during the NNI search on the species tree are saved.
species.duplication.tree.file=previousDuplicationTree.nwk
# A file containing a species tree with branch lengths representing duplication parameters. These duplication parameters are then used as starting values for the algorithm. You don't have to give this option if you don't have good estimates for these parameters.
species.loss.tree.file=previousDuplicationTree.nwk
# A file containing a species tree with branch lengths representing loss parameters. These loss parameters are then used as starting values for the algorithm. You don't have to give this option if you don't have good estimates for these parameters.

This GeneralOptions.opt file contains options specific to the search for the best species tree. However, the options included in this file are also read by client processors in charge of gene families. Therefore, it is possible to include options that apply to the gene tree search for all gene families.

The File Listing Gene Family-Specific Option Files

Gene-family specific options need to be given in additional files. The list of these files is given in genelist.file This file should look like this:

family_1.option
family_2.option
...

or

family_1.option:10
family_2.option:30
...

The second way to list the option files contains an additional element of information, which is the “complexity” of a gene family. This “complexity” is used at the beginning of the algorithm to distribute equally the loads on the different computers. So far, it is still unclear what this complexity should be (some function of the number of sequences and the number of sites, but the number of duplications and losses in a gene tree also has a great impact on time complexity). I generally use the number of sequences in the gene family for lack of a better metric.

Gene Family-Specific Option Files

Inside these option files, gene-family-specific options should be set. These options again follow the bppsuite syntax. A large number of these options are documented in the bppsuite help.

An example of these options is given below. I have ordered the options in great categories, such as data files, algorithm options, model options, and optimization options. We assume we are looking at the contents of the file family_1.option).

First, data files:

PATH=home/user/path_to_data_files/
#path to the directory where the input files are, and the output files are left.
DATA=family_1
# Variable used to give the name of the data files.
alphabet=DNA
# Could also be RNA, Protein, or Codon. Please see the bppsuite help for more details.
taxaseq.file=$(PATH)$(DATA).link
# File giving the link between species and sequence names (more on this below)
input.sequence.file=$(PATH)$(DATA).fasta
# file giving the input sequence alignment for the gene family
input.sequence.format=Fasta
# Format of the sequence alignment. Could be Fasta, Phylip, Clustal, Mase, Nexus… Please see the bppsuite help for more details.
output.reconciled.tree.file=$(PATH)$(DATA)_Reconciled.tree
# File where to store the output improved and reconciled gene tree, in NHX format. Duplication and speciation nodes are annotated, with the tag “Ev=D” or “Ev=S” respectively.
output.duplications.tree.file=$(PATH)$(DATA)_Duplications.tree
# File where the species tree topology is saved, annotated with numbers of duplications for this gene family.
output.losses.tree.file=$(PATH)$(DATA)_Losses.tree
# File where the species tree topology is saved, annotated with numbers of losses for this gene family.
output.numbered.tree.file=$(PATH)$(DATA)_Numbered.tree
# File where the species tree topology is saved, annotated with node indices.
input.sequence.sites_to_use=all
# tells whether we should use all sites in the alignment or not. Could be all, nogap, or complete. Please see the bppsuite help for more details.
input.sequence.max_gap_allowed=100%
# Maximum number of gaps tolerated for including a site in the analysis.
init.gene.tree=user
# Starting gene tree. Could be user, bionj or phyml. user requires that a user-input tree is given with the gene.tree.file option, whereas the options bionj and phyml have phyldog use these algorithms to create starting gene trees.
gene.tree.file=$(PATH)$(DATA).tree
# File containing the input starting gene tree in newick format. Useful if init.gene.tree=user.
output.starting.gene.tree.file=$(PATH)$(DATA)_starting.tree
# File where the starting gene tree is saved.

Then, algorithm options:

rearrangement.gene.tree=nni
# Type of rearrangement: nni or spr. nni is much faster but less exhaustive than spr. If the species tree topology is fixed, we advise spr, which provides better gene trees. Otherwise, we advise the use of nnis.
SPR.limit.gene.tree=4
# For SPR moves on the gene tree, gives the maximum distance between the position of the pruned subtree and its regrafting position.
reset.gene.trees=yes
# yes by default. If yes, gene trees are reset to starting gene tree topologies before each gene tree improvement. This offers the guarantees that we start from a gene tree that is good according to sequences only. Putting no instead results in a faster algorithm, but this option has not been tested thoroughly.

Then, model options:

model=GTR(a=1.17322, b=0.27717, c=0.279888, d=0.41831, e=0.344783, initFreqs=observed, initFreqs.observedPseudoCount=1)
# options of the model used. Should match the alphabet. Please see the bppsuite help for more details.
rate_distribution=Invariant(dist=Gamma(n=4,alpha=1.0), p=0.1)
# Rate heterogeneity option. Here we assume a gamma law with 4 categories and a category of invariants to model rate heterogeneity among sites.
optimization.ignore_parameter=InvariantMixed.dist_Gamma.alpha, InvariantMixed.p, GTR.a, GTR.b, GTR.c, GTR.d, GTR.e, GTR.theta, GTR.theta1, GTR.theta2
# We choose not to optimize these 10 parameters in order to save computing time, as we have provided reasonable input values. However, in cases where good input values are not available, it may be wise to leave this field empty and optimize these parameters.

Finally, optimization options:

optimization.topology=yes
# We choose to optimize the topology.
optimization.tolerance=0.01
# We have a large optimization tolerance to speed up the computations.
optimization.method_DB.nstep=0
optimization.topology.numfirst=false
optimization.topology.tolerance.before=100
optimization.topology.tolerance.during=100

The File Giving the Link Between Species Names and Sequence Names

The taxaseq.file file contains the link between species name and gene name.
For instance:

Oryctolagus_cuniculus:ENSOCUP00000017695
Dipodomys_ordii:ENSDORP00000011323
Sus_scrofa:ENSSSCP00000001753
Pongo_pygmaeus:ENSPPYP00000018557;ENSPPYP00000018560
...

This means that sequences ENSPPYP00000018557 and ENSPPYP00000018560 correspond to the species Pongo_pygmaeus for instance.

The species names should correspond to the names used in species.names.file, an option that should be found in GeneralOptions.opt, or alternatively should correspond to the names in the input species tree (input.species.tree).

Summary: Files Needed by Phyldog

  1. a file GeneralOptions.opt
  2. a file giving the input species tree, or alternatively a list of species names (if the option for the input species tree is random, this file is necessary). Otherwise this file can be used to limit the list of species to include in the study.
  3. a file giving a list of gene family-specific option files
  4. one file per gene family describing the options for this gene family
  5. one alignment file per gene family
  6. one file giving the link between species name and sequence name, one per gene family

Files that can be provided but that are not absolutely necessary:

  1. a file giving genome coverage per species (otherwise phyldog assumes genomes are 100% complete)
  2. a file giving the duplication parameters for the species tree (otherwise phyldog will estimate these)
  3. a file giving the loss parameters for the species tree (otherwise phyldog will estimate these)
  4. one tree file per gene family (otherwise phyldog can estimate a starting gene tree using bionj or phyml-like algorithms)

Citation

Please cite the following article when using PHYLDOG :
Genome-scale coestimation of species and gene trees. Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome Research. 2013 Feb;23(2):323-30. doi: 10.1101/gr.141978.112. Epub 2012 Nov 6.

Bastien Boussau, PhD
bastien.boussau@univ-lyon1.fr
Laboratoire de Biométrie et Biologie Evolutive, UMR CNRS 5558

Valid XHTML 1.1