Tree Viewer application
-
Acces to trees of homologous gene families from HOVERGEN, HOGENOM, HOMOLENS, etc.
- Tree viewer Allows to retrieve and displasy trees sequences from a sequence or family identifier
Query HOGENOM using web applications
-
Acces to several ACNUC databases: EMBL,GenBank,SwissProt, HOVERGEN, HOGENOM, HOMOLENS, etc.
- WWW-Query Allows to retrieve sequences or families by combining several criteria
- Search families by cross taxa Allows to retrieve families according to complexe taxonomic criteria
Query HOGENOM proteins
-
You may enter any word ( sequence name, keyword, species, ...)
Check the box if you want to report exact matches only.Search for protein sequences Search for protein families
Query HOGENOM nucleotide sequences
-
You may enter any word ( sequence name, keyword, species, ...)
Check the box if you want to report exact matches only.Search for CDS sequences Search for CDS families
Query HOGENOM using BLAST
-
You may blast your sequence against several databases at PBIL.
- Protein BLAST Allows to retrieve protein sequences in HOGENOM
Query HOGENOM using HoSeqI
-
You may search the HOGENOM family which is the closest of your sequence. Associated alignment and phylogenetic tree are automatically generated.
- HoseqI Allows to retrieve protein family in HOGENOM
Orthologs search
- You can retrieve orthologous and paralogous genes with the FamFetch application. This is a powerful tool allowing you to request the phylogenic trees database with a complex tree user-build motif including duplication and speciation events. Alternitavely you can use a command-line version of FamFetch. If you are only intersted in animal genomes, we recommand the HOMOLENS database for which precalculated othologous animal genes are available.
Acces to HOGENOM
-
You can query the database on thispage or via several access :
- the PBIL web applications : WWW-Query, Quick-Search, Cross-Taxa
- the FamFetch client-server application, allowing orthologous and paralogous genes selection
- the QueryWin and Raa_Query, two client-server applications (via sockets)
- the seqinR R package (via sockets)
Contents
-
Organisms
HOGENOM is build from several genomic data sources:- Genome Reviews : Bacteria and archaea from the EBI (and few eukarya, as yeast). We selected 381 genomes
- Microbial Genomes : Bacteria from the NCBI. We selected 94 genomes.
- Ensembl : Animals from the EBI. We selected 11 genomes:
- Anopheles gambiae
- Caenorhabditis elegans
- Ciona intestinalis
- Ciona savignyi
- Danio rerio
- Drospohila melanogaster
- Gallus gallus
- Homo sapiens
- Mus musculus
- Tetraodon nigroviridis
- Xenopus tropicalis
- EBI eukaryotic complete genomes other than Ensembl: 13 genomes:
- Ashbya gossypii atcc 10895
- Aspergillus fumigatus af293
- Candida glabrata cbs 138
- Cryptococcus neoformans var. neoformans b-3501a
- Debaryomyces hansenii cbs767
- Dictyostelium discoideum
- Guillardia theta
- Kluyveromyces lactis nrrl y-1140
- Leishmania major
- Plasmodium falciparum
- Schizosaccharomyces pombe 972h-
- Trypanosoma brucei
- Yarrowia lipolytica clib122
- Other sources: 14 genomes:
- Apis melifera (bee) from NHGRI, USDA, NIH and HBGP
- Arabidopsis thaliana from NCBI
- Branchiostoma floridae (amphioxus) from JGI - proteins only
- Cenarchaeum symbiosum from Genbank
- Chlamydomonas reinhardtii (algae) from JGI - proteins only
- Frankia sp. EAN1PEC (*) from Genbank
- Kineococcus radiotolerans from Genbank
- Kuenenia stuttgartensis from NCBI
- Oryza sativa from NCBI
- Ostreococcus tauri (algae) from JGI - proteins only
- Paramecium tetrauarelia from Genoscope
- Populus trichocarpa (poplar) from JGI - proteins only
- Strongylocentrotus purpuratus (sea urchin) from NCBI
- Tetrahymena thermophilaa from TIGER - proteins only
Sequences, Families, Alignments, Phylogenetic trees
Number of proteins 2,142,639 Number of CDS 2,128,552 Number of genomic sequences 135,105 Number of families (at least 2 sequences) 147,586 Number of orphans 397,545 (18%) Number of protein sequences associated to a family 1,742,390 (81%) Families Size Distribution
Sequences Families 2:10 128,686 10:50 14,752 50:100 1,828 100:500 1,925 500:2000 356 more than 2000 39 Alignments for 147, 547 families containing 2 to 2000 sequences have been calculated with MUSCLE .147,537 alignments have successfully generated, 10 failed.
Phylogenetic trees for 76,262 families containing 3 to 2000 sequences have been calculated with PHYML V3.0 (SH-like branch supports, substitution model = JTT, estimated proportion of invariable sites, 4 categories, estimated gamma, initial tree with BIONJ, "NNI" and "SPR" topology exploration) on conserved blocks of the alignments selected with GBLOCKS.
Calculation progress
updated March 30, 2008:- Alignments of 2-2000 sequences: 147,537 done of 147,547 (100% done)
- Trees of 3-2000 sequences: 76,122 done of 76,262 (100% done)
- Trees of 4-2000 sequences: 55,179 done of 55,322 (100% done)
- Trees of 500-2000 sequences: 300 done of 356 (84% done)
- Trees of 3 sequences: 20,938 done of 20,940 (100% done)
- families annotations on proteins: done
- families annotations on genomes: done
- GC% content annotations on genomes: done
- Improvement of family name definition: done
HOGENOM3 vs HOGENOM4
- Among the 950,216 protein sequences found in HOGENOM 3, there is 782,767 sequences (82%) which aree are still present in HOGENOM 4.
Server mirroring
-
You dont need to install the server itself to have HOGENOM running on your computer as the client is enough for that purpose. On the
other hand you may want to set-up your own server in a way to speed up your database access and to propose that service to potential users
in your geographic area.
Installation instructions can be found at http://pbil.univ-lyon1.fr/databases/acnuc/localinstall.html
The whole database is available from our FTP server at URL: ftp://pbil.univ-lyon1.fr/pub/hogenom/ Note that it is much more efficient to use a dedicated FTP client to download the database rather than an Internet Web browser.
Family Building
-
To build the families we perform a similarity search of all the proteins against each other with BLASTP2.
For this purpose, we use the BLOSUM62 similarity matrix and a threshold of 10-4 for E-values.
Low complexity sequences are filtered with SEG. Then, the results are processed this way:
- For each pair of sequences, Homologous Segment Pairs (HSPs) that are not compatible with a global alignment are removed
- Two sequences in a pair are included in the same family if:
- The remaining HSPs cover at least 80% of the proteins length.
- Their similarity is greater or equal to 50% (two amino-acids are considered similar if the BLOSUM62 similarity score is positive)
- We use simple transitive links to build our families. If a pair of sequences named A + B and a pair of sequences B + C fulfill the conditions listed above, then A, B and C are integrated in the same family, this even if the pair A + C does not fulfill these conditions.
.
Sequence Annotations
-
Family annotation
Protein sequences: we add for each entry a line in theCCfield that gives the number of the family the sequence belongs to:CC -!- GENE_FAMILY: HBG017522.
Genome sequence: we add for each coding sequence a qualifier that gives the number of the family the gene belongs to:FT /gene_family="HBG017522"
This number is incorporated in the keywords associated to the corresponding entry in the ACNUC database structure. Due to that fact it is possible to retrieve all the sequences associated to a family with this number when using the retrieval system Query or the on-line version WWW-Query.
GC content and intron information annotations
We include in the the genomic sequneces the GC content of each coding sequence:FT /%(C+G)="CG<35%" FT /note="C+G content in third codon positions = 31.4 % "
It is thus possible to select sequences according to its GC content.We also include in genmoic sequences descriptions of non-coding regions:
- INT_INT: internals introns (i.e. within CDS)
- 5'INT: introns in 5'UTR
- 3'INT: introns in 3'UTR
- 5'NCR: 5' non-coding region
- 3'NCR: 3' non-coding region
FT 3'ncr 2278..2368
These subsequences can be selected and extracted from the database in the same way as CDS, using WWW-Query (see Help).
Contact and reference
- If you encounter some problems when installing or using HOGENOM, please contact Laurent Duret or Simon Penel Also we welcome any comments or suggestions on the database and/or its interface.
Acknowledgements
- Calculations have been done at the IN2P3 Computing Center.
Licence
-
HOGENOM Database
Copyright 2005 CNRS
Authors: Laurent Duret,Manolo Gouy, Simon Penel, Guy Perriere
This database is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A copy of the GNU General Public License is available at ftp://pbil.univ-lyon1.fr/pub/hogenom and http://www.gnu.org/licenses/.
References
-
If you use families from HOMOLENS or HOGENOM, Please cite :
Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M and Perrière G (2009)
"Databases of homologous gene families for comparative genomics" BMC Bioinformatics, 10 (Suppl 6):S3
