ACNUC glossary

Alphabetically sorted terms : acnuc - gcgacnuc, acnuctaxo, division, flat file, genetic code, help file, index file, label, tree


acnuc - gcgacnuc : two environment variables used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively.

acnuctaxo : an environment variable that defines the directory where the NCBI taxonomy files (nodes.dmp and names.dmp) are located. Programs acnucgener and readncbitaxo read these two files. These files come from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.

division : each flat file is called a database division. Divisions are generally called by the filename without extension (e.g. gbbct1) when running management programs. Division names are all stored in index file SMJYT.

flat file : acnuc databases index sequence + annotation files that can be flat files, that is, plain text files, as distributed by the database creators. GenBank flat files are named gbxxx.seq, EMBL and SwissProt flat files are named xxx.dat. Currently, flat files can exceed 4GB in size because two 4-byte integers are devoted to storing an address within that file. All of acnuc programs, except compressnewdiv, access flat files in readonly mode. Flat files are always located in a directory whose name is given by environment variable gcgacnuc.

genetic code : nucleotide sequence databases use a number of variant genetic codes to properly translate CDS in protein sequences. These genetic codes, defined by NCBI and distributed together with the species classification, are identified by two numerical ids, one given by NCBI, one defined by acnuc. What genetic code is used by what species is stored in acnuc in species labels.
List of defined genetic codes
ncbi
code ID
acnuc
code ID
differences from universal code
(* : stop codons)
1 0 Universal genetic code
3 1 CUN=T AUA=M UGA=W
2 2 AGR=* AUA=M UGA=W
4 3 UGA=W
5 4 AUA=M UGA=W AGR=S
12 5 CUG=S
6 6 UAR=Q
10 7 UGA=C
9 8 UGA=W AGR=S AAA=N
13 9 UGA=W AGR=G AUA=M
14 10 UGA=W AGR=S UAA=Y AAA=N
15 11 UAG=Q
11 12 NUG=AUN=M when initiation codon
16 13 UAG=L
21 14 AUA=M UGA=W AGR=S AAA=N
22 15 UAG=L UCA=*
23 16 UUA=*
24 17 UGA=W AGA=S AGG=K
25 18 UGA=G

help file : Text files HELP and HELP_WIN contain on-line help information for the query and query_win programs. They also contain summary information: name of database, release number, total sequence, reference and residues contents. Both files are located in the $acnuc directory.

index file : acnuc databases are made up of a series of index files (see physical structure) that allow efficient accesses to sequence files according to various retrieval criteria. Index files are ACCESS, AUTHOR, BIBLIO, EXTRACT (not for protein databases), KEYWORDS, LOCUS, LONGL, MERES (optional, serves only to allow quicker launch), SHORTL, SPECIES, SUBSEQ, TAXIDS (optional, to implement retrieval by taxon ID), TAXTREE (optional, contains all species tree information), TEXT. Acnuc index files are always located in a directory whose name is given by environment variable acnuc.

label : species, keywords, journal codes, type names all optionally have a label which is a descriptive character string stored in index file TEXT.
For species, the label can also store genetic code information: the label begins with mtgc:#| or with gc:#| to give the number of the mitochondrial and nuclear, respectively, genetic codes of this species. Species label can also store NCBI's taxon ID information: the label starts with id:#| .
For gene family databases such as Hovergen and Hobacgen, the label can also store taxonomic level information such as [species] or [suborder].

tree : species, or more generally, taxon names and keywords are organized in two trees in acnuc databases. The effect of this is that selecting from a tree node selects all sequences attached to all nodes placed below in the tree. The tree structure is extensive (nearly all nodes are properly placed in the tree) for taxon names. The tree follows NCBI's classification of species. The tree structure is very sparsed (most keywords are at the tree top, with nothing below) for keywords. The keyword tree structure proves useful to organize some logically related keywords.