Database Building


Sequences

The database is built using sequences taken from SWISS-PROT and its annex TrEMBL. The reasons that led us to choose these collections are threefold:

Families

To build the families we perform a similarity search of all the proteins against each other with BLASTP2. For this purpose, we use the BLOSUM62 similarity matrix and a threshold of 10-4 for E-values. Low complexity sequences are filtered with SEG. Then, the results are processed this way:
  1. For each pair of sequences, Homologous Segment Pairs (HSPs) that are not compatible with a global alignment are removed (see example).

  2. Two sequences in a pair are included in the same family if:

  3. We use simple transitive links to build our families. If a pair of sequences named A + B and a pair of sequences B + C fulfill the conditions listed above, then A, B and C are integrated in the same family, this even if the pair A + C does not fulfill these conditions.

  4. Once families of complete protein sequences have been build, partial sequences (longer than 100 AA or at least 50% of the length of the complete proteins) are included in the classification. A partial sequence matching with a complete protein is included in its family if:

  5. Short partial sequences (less than 100 AA and less than 50% of the length of the complete proteins) are not included in the classification.

Alignments

For each family, protein sequences are aligned using CLUSTALW 1.7. All the default parameters are used excepted that the "Fast/Approximate" option is preferred for pairwise alignments.

Phylogenetic trees

The distance used to build the phylogenetic trees is the observed divergence. When the distance matrix is complete, phylogenetic trees are computed with BIONJ. When the matrix is uncomplete (i.e. when there are partial sequences in the family that don't overlap with each others), we use the Triangle method (Alain Guenoche, unpublished).


If you have problems or comments...

Back to PBIL home page