Taller de Bioinformática

Práctica de filogenía

Octubre 2001
The workshop will mainly use two programs

A multiple sequence alignment editor seaview

A phylogeny program used for the parsimony and NJ methods : phylo_win
These programs are accessible on your workstation.

You will have to save several example data files to your disk by using the "save as..." option of your browser. Preferentially use the .mase extension for sequence alignments.

A) Starter : Universal phylogeny

File 28sfrags.mase contains a set of prealigned rRNA sequences from the large (LSU) and the small (SSU) subunits.
Visualize the set of "reliably" aligned sites called "all sequences".
Build the universal phylogeny.
Bootstrap it.
Try using the "transversion-only" evolutionary distance.
Is the position of the Euglena chloroplastic sequence expected?

B) A 250 MY old bacterium : is it possible ?

Vreeland et al. have published the isolation of a 250 million years-old bacterium from a salt crystal.
Their data are reproduced in a file of aligned bacterial 16S rRNA sequences permians.mase.
Compare usage of parsimony and of distances + NJ methods. What is the very important information vehiculed by the branch lengths in the NJ analysis ?
What do you think of their conclusions ?
The results of Vreeland et al. have been severely critized by Graur and Pupko, who concluded that the isolated bacterium is most probably recent in age.

C) The evolutionary origins of HIV-1 and HIV-2 viruses among primate viruses (SIV)

Gao et al. have published (Nature 397:436) a phylogenetic analysis of the pol gene of HIV-1 and HIV-2 viruses and of their simian homologs (SIV). File hivpol.mase contains public protein sequences with which it is possible to attempt to reproduce their results. Sequence FIV/Oma (Feline Immunodeficiency Virus) is used as an outgroup for the analysis. File hivpol-dna.mase reproduces the same alignment at the DNA level.
Identify which simian species are at the origin of HIV-1 and HIV-2 viruses ?
Conduct analyses on Ka and on Ks distances when possible.

File hivpol.pdf contains the (nearly) complete article by Gao et al. (there is a problem with its first page!).

D) Test the OPV hypothesis : Oral Polio Vaccine

The hypothesis that virus HIV-1 would have been transmitted to humans by polio vaccine preparations used in Africa in the 1950's has been formulated. The data in file frag12s.mase are the raw data from the article by Blancou et al. which refutes this hypothesis.

These are unaligned sequences from a ~140 bp fragment of the 12S mitochondrial rRNA. First, add homologous sequences from genera Homo, Pan, Gorilla, Cercopithecus, Erythrocebus, Cercocebus, Macaca. Next, align all these sequences. Finally, reproduce the Blancou et al. analysis.

Berry et al. have independently published a refutation of the same OPV hypothesis. Unfortunately, their data do not seem to be publicly available.

In case of troubles for finding hologous sequences, file 12sprimates.mase gives material.

E) A case with the "Long Branch Attraction" artefact

File microsplsu.mase contains an alignment of several large subunit (LSU) rRNAs from eukaryotes and archaea. It contains also the LSU rRNA sequence from Encephalitozoon cuniculi, a member of microsporidia, a protist group whose phylogenetic position is highly debatted.
Observe how the microsporidian sequence is reduced in length comparatively to other eukaryotic sequences.
Notice that LSU rRNAs are not alignable on all their length between distant eukaryotes, but only across some more conserved regions. Try to define a set of "reliably aligned" sequence sites.
What evolutionary origin is predicted for microsporidia when these sequences are studied with distances such as K2P and with the NJ method ?
Nevertheless, Peyretaillade et al. have published (NAR 26:3513) an analysis of these data with and without accounting for variable evolutionary rates among sites of the LSU rRNA molecule. Notice the difference between phylogenetic inferences with the same data set.
File rpb1.mase contains an alignment of RPB1 protein sequences (RNA polymerase II, large subunit) from several eukaryotes and two microsporidia, Vairimorpha necatrix and Nosema locustae. Sequences RPC1_YEAST and RPA1_Yeast make up an outgroup.
Hirt et al. (PNAS 96:580) have analyzed these data with a maximum likelihood method (program ProtML).
Analyze these data using protein distances + NJ + bootstrap.
Define sets of "reliably" aligned sequences sites.
Visualize ProtML precomputed trees for these data (set of sites = choix2) with and without outgroup.
Compare the positions of long branch lineages.

F) Orthology and paralogy among bacterial protein genes

File gltb.mase contains a multiple alignment of large (or alpha) subunits of the enzyme glutamate synthase from several bacteria. The gltB gene encodes this protein in Escherichia coli. In addition the yeast homologous sequence, which contains the homolog of the small subunit fused to its C-terminus is present.
Some bacterial genomes, e.g., Synechocystis sp., contain two genes from this family, while most species contain only one gene.
Compute the tree representing the evolutionary history of this family.
Do you think the Escherichia coli and the Bacillus subtilis gltB genes are orthologous or paralogous genes?
Understand why this example illustrates a situation where orthology determination by reciprocal best Blast hit fails!