ACNUC physical structure

An ACNUC database is made of a series of index files ( ACCESS, AUTHOR, BIBLIO, EXTRACT, KEYWORDS , LOCUS, LONGL, SHORTL, SMJYT, SPECIES, SUBSEQ, TEXT, MERES (optional), TAXIDS (optional), TAXTREE (optional) ) that allow efficient access to sequences and annotations through a variety of selection criteria. Sequences and annotations are stored in flat files (e.g., fun.dat, gbbct1.seq) created by the database producers (e.g., EMBL, GenBank, SwissProt) that are accessed by ACNUC in a strictly readonly mode.

One-page summary of structure. Glossary.

Index files are made of a series of fixed-length records containing several fields that are 4-byte unsigned integer values except when indicated. Records are referred to by their number or rank, counting from 1. Binary integer values can be either all big-endian or all little-endian.

The first record of all index files (except MERES, TAXIDS and TAXTREE) follows this structure :
total |sorting state|end_sorted|


Parameters:
L_MNEMO = length of sequence names (variable in new format, fixed to 16 in old one)
WIDTH_SP = length of species names (variable in new format, fixed to 40 in old one)
WIDTH_KW = length of keywords (variable in new format, fixed to 40 in old one)
WIDTH_AUT = length of author names (variable in new format, fixed to 20 in old one)
WIDTH_BIB = length of BIBLIO names (variable in new format, fixed to 40 in old one)
WIDTH_SMJ = length of code in file SMJYT (variable in new format, fixed to 20 in old one)
SUBINLNG = number of SUBSEQ pointers in LONGL records (variable in new format, fixed to 63 in old one)
ACC_LENGTH = value ≥8 read at run-time when the database is opened
lrtxt = length of records of TEXT file (variable in new format, fixed to 60 in old one)
hsub, hkwsp : control hashings of seq names, species and keywords.


SUBSEQ one record for each parent or sub-sequence
name |length|  type  |pext  P:≤0 , S:>0 | plkey  | plinf        |    phase     |  h   |
     |      |to SMJYT| P: subseq list   | SHORTL |P: LOCUS      |100*code+frame|SUBSEQ|
                     | S: to EXTRACT    |        |S: feat start |       
SUBSEQ records can be sorted (by programs newordalphab and sortsubseq), and if so, they are alphabetically sorted at the parent sequence level and by order of appearance in annotations at the subsequence level.
LOCUS one record for each parent sequence
sub   |pnuc|pinf| pnuc2 |pinf2 |spec     |host   |plref |molec|placc |stat | org | div |date|
SUBSEQ|    |    |       |      |N:SPECIES|SPECIES|SHORTL|SMJYT|SHORTL|SMJYT|SMJYT|     |    |
                               |P:SHORTL |

KEYWORDS and SPECIES one record for each keyword or taxon
name|libel|plsub| desc | syno   |    h   |plhost|
    |TEXT |LONGL|SHORTL|KEYWORDS|KEYWORDS|
                       |SPECIES |SPECIES |LONGL |
The last field, plhost, exists in SPECIES and is absent from KEYWORDS.
BIBLIO one record for each reference
name|plsub |plaut |  j  |  y  |
    |SHORTL|SHORTL|SMJYT|SMJYT|

AUTHOR one record for each author name
name|plref |
    |SHORTL|

ACCESS one record for each accession number
name|plsub |
    |SHORTL|

SMJYT one record for each status, molecule, journal, year, type, organelle, division, and db structure information
name|plong|libel|
    |LONGL|TEXT |

EXTRACT (for nucleotide databases only) one record for each exon of each subsequence
mere  |deb|fin| next  |
SUBSEQ|   |   |EXTRACT|

TEXT one lrtxt-character record for each label of a species, keyword, or SMJYT
   label  |
In the case of species, labels may contain information about the correct genetic codes for this species, about NCBI's taxon ids, and about the name of the taxonomic level (e.g., order, family).
LONGL one record for each group of SUBINLNG elements of a long list
sub[0],sub[1],...,sub[SUBINLNG-1] |next |
     SUBSEQ,...                   |LONGL|

SHORTL mostly, one record for each element of a short list
val | next |
    |SHORTL|

MERES an optional index file that allows faster opening of the database.
lenw binary integer values that encode the bitlist of all parent sequences of the database.


TAXIDS an optional index file that implements the TID= retrieval criterion
count | ... count integer values ... |


TAXTREE an optional index file containing all information of the species tree used to accelerate the loadtaxonomy function of the remote acnuc server.
This ASCII file contains one line for each taxon name of the form: rank&parent&count&"name"&"label"
where