The database is built using sequences taken from SWISS-PROT and its annex
TrEMBL. The reasons that led us to choose these collections are threefold:
- The SWISS-PROT + TrEMBL set is exhaustive and non-redundant.
- The annotations are of high quality compared to the other general
sequence database systems.
- Almost all entries are cross-referenced with their corresponding
nucleotide sequences in GenBank / EMBL / DDBJ.
In SWISS-PROT+TrEMBL, homologous
genes from different organisms (or paralogous genes within a genome) that code for a same protein are described by a single entry.
In the view of HOVERGEN, there should be a specific entry for each gene to allow one to retrieve all existing homologous genes.
For these reasons the SWISS-PROT + TrEMBL entries that correspond to several genes are duplicated before being used for families
calculation. These duplicated entries are
given a new name. The duplicated sequences are signaled by their
annotations in the
Example: The sequence
143G_MOUSE3 presents the following annotation:
CC -!- modified from 143G_HUMAN.
To build the families we perform a similarity search of all the proteins
against each other with
For this purpose, we use the BLOSUM62 similarity matrix and a threshold of
10-4 for E-values. Low complexity sequences
are filtered with SEG.
Then, the results are processed this way:
- For each pair of sequences, Homologous Segment Pairs (HSPs) that are not
compatible with a global alignment are removed (see example).
- Two sequences in a pair are included in the same family if:
- The remaining HSPs cover at least 80% of the proteins length.
- Their similarity is greater or equal to 50% (two amino-acids are
considered similar if the BLOSUM62 similarity score is positive).
- Both sequences are complete.
- We use simple transitive links to build our families. If a pair of
sequences named A + B and a pair of sequences B + C fulfill the conditions
listed above, then A, B and C are integrated in the same family, this even if
the pair A + C does not fulfill these conditions.
- Once families of complete protein sequences have been build, partial
sequences (longer than 100 AA or at least 50% of the length of the complete
proteins) are included in the classification. A partial sequence matching
with a complete protein is included in its family if:
- The remaining HSPs cover at least 80% of the partial protein length.
- Their similarity is greater or equal to 50%.
- Short partial sequences (less than 100 AA and less than 50% of the length of
the complete proteins) are not included in the classification.
For each family, protein sequences are aligned using
CLUSTALW 1.7. All
the default parameters are used excepted that the "Fast/Approximate" option is
preferred for pairwise alignments.
The distance used to build the phylogenetic trees is the observed divergence.
When the distance matrix is complete, phylogenetic trees are computed
When the matrix is uncomplete (i.e. when there are partial
sequences in the family that don't overlap with each others), we use the
Triangle method (Alain Guenoche, unpublished).
If you have problems or comments...
Back to PBIL home page