PhEVER | Viruses

What is PhEVER?

PhEVER is a database of homologous gene families providing information for the understanding of virus/host co-evolution. Each family supplies pre-computed alignment and phylogeny.

How can I use PhEVER?

PhEVER allows you to select sets of homologous genes among viral species as well as between viral species and some eukaryotes. Multiple alignments and phylogenetic trees are available for each family. This makes it a particularly useful tool for comparative sequence analysis, phylogeny and molecular evolution studies in the context of co-evolution.

The user manual will provide you with helpful information on how to query PhEVER from the web interface.

For more details on how to query PhEVER, please take a look at the search page.

Which organisms are present in PhEVER?

The current release integrates extensive data from up-to-date completely sequenced genomes spanning a wide taxonomic range (2426 non-redundant viral genomes, 1296 non-redundant bacterial genomes, 44 eukaryotic genomes from plants to human). PhEVER is built from the following genomic data sources:

RefSeq Viral:
- complete data (e.g. 2426 genomes of viruses from all 5 Baltimore classes). Retrieved on May 2010.
Ensembl:
- Aedes aegypti
- Anopheles gambiae
- Bos taurus
- Caenorhabditis elegans
- Danio rerio
- Drosophila melanogaster
- Gallus gallus
- Homo sapiens
- Mus musculus
Genome Reviews:
- all fully sequenced Bacteria.
- all fully sequenced Archaea.
- the following Eukaryota:
  - Dictyostelium discoideum AX4
  - Leishmania major strain Friedlin
  - Leishmania braziliensis
  - Caenorhabditis briggsae
  - Ashbya gossypii ATCC 10895
  - Saccharomyces cerevisiae
  - Candida glabrata CBS 138
  - Kluyveromyces lactis NRRL Y-1140
  - Pichia stipitis CBS 6054
  - Debaryomyces hansenii CBS767
  - Yarrowia lipolytica CLIB122
  - Candida dubliniensis CD36
  - Aspergillus niger CBS 513.88
  - Aspergillus oryzae
  - Penicillium chrysogenum Wisconsin 54-1255
  - Aspergillus nidulans FGSC A4
  - Aspergillus fumigatus AF293
  - Schizosaccharomyces pombe
  - Cryptococcus neoformans var. neoformans JEC21
  - Cryptococcus neoformans var. neoformans B-3501A
  - Ustilago maydis 521
  - Encephalitozoon cuniculi GB-M1
  - Plasmodium falciparum 3D7
  - Plasmodium knowlesi strain H
  - Plasmodium vivax
  - Theileria annulata
  - Toxoplasma gondii RH
  - Cryptosporidium parvum Iowa II
  - Paramecium tetraurelia
  - Guillardia theta
  - Hemiselmis andersenii
  - Arabidopsis thaliana
  - Oryza sativa Japonica group
  - Ostreococcus lucimarinus CCE9901
  - Ostreococcus tauri

Data is modified and re-annotated: sequence names are modified according the organism, taxonomy fields are modified when they are unconsistant or inaccurate, then gene family , GC contents, internal introns, 3'UTR and 5'UTR informations are added to annotations.

How are PhEVER families built?

PhEVER is made of two databases: PhEVER(dna) which contains the nucleotide sequences, and PhEVER(aa) which contains the protein sequences.

The clustering of PhEVER sequences into families follows this procedure:

Protein sequences of PhEVER are generated by translating the CDS of PhEVER(dna) and using associated cross-references to generate the annotations.
To build the families we perform a similarity search of all the proteins against each other with BLASTP2. For this purpose, we use the BLOSUM62 similarity matrix and a threshold of 10-4 for E-values. Low complexity sequences are filtered with SEG. Then, the results are processed this way:

For each pair of sequences, Homologous Segment Pairs (HSPs) that are not compatible with a global alignment are removed
Two sequences in a pair are included in the same family if:
- The remaining HSPs cover at least 60% of the proteins length (and at least 100aa).
- Their identity is greater or equal to 35% (two amino-acids are considered similar if the BLOSUM62 similarity score is positive)
We use simple transitive links to build our families. If a pair of sequences named A + B and a pair of sequences B + C fulfill the conditions listed above, then A, B and C are integrated in the same family, this even if the pair A + C does not fulfill these conditions.

The current release was built using SiLiX.

For each family, a complete alignment of all sequences is then computed using MUSCLE and a phylogenetic tree is estimated with PhyML.