ACNUC glossary

Alphabetically sorted terms : acnuc - gcgacnuc, division, flat file, genetic code, help file, index file, label, tree


acnuc - gcgacnuc : two environment variables used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively.

division : each flat file is called a database division. Divisions are generally called by the filename without extension (e.g. gbbct1) when running management programs. Division names are all stored in index file SMJYT.

flat file : acnuc databases index sequence + annotation files that can be flat files, that is, plain text files, as distributed by the database creators. GenBank flat files are named gbxxx.seq, EMBL and SwissProt flat files are named xxx.dat. Currently, flat files cannot exceed 4GB in size so that an unsigned 4-byte integer is enough to hold any address within that file. All of acnuc programs, except compressnewdiv, access flat files in readonly mode. Flat files are always located in a directory whose name is given by environment variable gcgacnuc.

genetic code : nucleotide sequence databases use a number of variant genetic codes to properly translate CDS in protein sequences. These genetic codes, defined by NCBI and distributed together with the species classification, are identified by two numerical ids, one given by NCBI, one defined by acnuc. What genetic code is used by what species is stored in acnuc in species labels.
List of defined genetic codes
ncbi
code ID
acnuc
code ID
differences from universal code
(* : stop codons)
1 0 Universal genetic code
3 1 CUN=T AUA=M UGA=W
2 2 AGR=* AUA=M UGA=W
4 3 UGA=W
5 4 AUA=M UGA=W AGR=S
12 5 CUG=S
6 6 UAR=Q
10 7 UGA=C
9 8 UGA=W AGR=S AAA=N
13 9 UGA=W AGR=G AUA=M
14 10 UGA=W AGR=S UAA=Y AAA=N
15 11 UAG=Q
11 12 NUG=AUN=M when initiation codon
16 13 UAG=L
21 14 AUA=M UGA=W AGR=S AAA=N
22 15 UAG=L UCA=*
23 16 UUA=*

help file : Text files HELP and HELP_WIN contain on-line help information for the query and query_win programs. They also contain summary information: name of database, release number, total sequence, reference and residues contents. Both files are located in the $acnuc directory.

index file : acnuc databases are made up of a series of index files that allow efficient accesses to sequence files according to various retrieval criteria. Index files are ACCESS, AUTHOR, BIBLIO, EXTRACT (not for protein databases), KEYWORDS, LOCUS, LONGL, MERES (optional, serves only to allow quicker launch), SHORTL, SPECIES, SUBSEQ, TEXT. Acnuc index files are always located in a directory whose name is given by environment variable acnuc.

label : species, keywords, journal codes, type names all optionally have a label which is a descriptive 60-char string stored in index file TEXT. For species, the label also stores genetic code information: the label begins with mtgc:#| or with gc:#| to give the number of the mitochondrial and nuclear, respectively, genetic codes of this species. For gene family databases such as Hovergen and Hobacgen, the label can also store taxonomic level information such as [species] or [suborder].

tree : species, or more generally, taxon, names and keywords are organized in two trees in acnuc databases. The effect of this is that selecting from a tree node selects all sequences attached to all nodes placed below in the tree. The tree structure is extensive (nearly all nodes are properly placed in the tree) for taxon names. The tree follows NCBI's classification of species. The tree structure is very sparsed (most keywords are at the tree top, with nothing below) for keywords. The keyword tree structure proves useful to organize some logically related keywords.