ACNUC physical structure

An ACNUC database is made of a series of index files ( ACCESS, AUTHOR, BIBLIO, EXTRACT, KEYWORDS , LOCUS, LONGL, SHORTL, SMJYT, SPECIES, SUBSEQ, TEXT ) that allow efficient access to sequences and annotations through a variety of selection criteria. Sequences and annotations are stored in flat files (e.g., fun.dat, gbbct1.seq) created by the database producers (e.g., EMBL, GenBank, SwissProt) that are accessed by ACNUC in a strictly readonly mode.

One-page summary of structure.

Index files are made of a series of fixed-length records containing several fields that are 4-byte unsigned integer values except when indicated. Records are referred to by their number or rank, counting from 1.

The first record of all index files follows this structure :
total |sorting state|end_sorted|

total: number of last written record in index file.
sorting state: a 6-char string that may be "SORTED" or "1/2SOR", and if so indicate that records are alphabetically sorted, or partially so, respectively; anything else means file may not be sorted.
end_sorted: only if 1/2SOR, gives the rank of the last alphabetically sorted record.

Constants:
L_MNEMO = 16
WIDTH_KS = 40
SUBINLNG = 63
ACC_LENGTH = value ≥8 read at run-time when the database is opened

SUBSEQ one record for each parent or sub-sequence

name |length|  type  |pext  P:≤0 , S:>0 | plkey  | plinf        |    phase     |  h   |
     |      |to SMJYT| P: subseq list   | SHORTL |P: LOCUS      |100*code+frame|SUBSEQ|
                     | S: to EXTRACT    |        |S: feat start |

name : padded by spaces to L_MNEMO uppercase characters; subsequences are named by adding a dot and an extension to their parent's name.
length : when 0, indicates a deleted record; deleted records appear in the list starting at record #3 of file LONGL.
type : to SMJYT, for seq type.
pext : its sign determines if parent (P, when ≤0) or sub-sequence (S, when >0);
- > 0 : to EXTRACT for start of chain of corresponding exons
- = 0 : this is a parent sequence without subsequence
- < 0 : - pext is to LONGL for start of long list of subsequences.
plkey : to SHORTL, for list of attached keywords.
plinf : if Parent, to LOCUS for corresponding record; if Subsequence, to SHORTL for start of annotations.
phase : for protein-coding subseqs, combination of genetic code and reading frame (0,1,2) information according to 100*code+frame, or 0.
h : next element of chain of SUBSEQ records with same hashing value, 0 at end of chain.
SUBSEQ records can be sorted (by programs newordalphab and sortsubseq), and if so, they are alphabetically sorted at the parent sequence level and by order of appearance in annotations at the subsequence level.

LOCUS one record for each parent sequence

sub   |pnuc|pinf| bef |next |spec     |host   |plref |molec|placc |stat | org | div |date|
SUBSEQ|    |    |LOCUS|LOCUS|N:SPECIES|SPECIES|SHORTL|SMJYT|SHORTL|SMJYT|SMJYT|     |    |
                            |P:SHORTL |

sub : to SUBSEQ for corresponding record; 0 for a deleted record.
pnuc, pinf : address in flat file of rank div of start of sequence (pnuc) and annotations (pinf); usage of unsigned integers allows to address up to 4 GB.
bef, next : implements the SEGMENT structure in GenBank format, useless with EMBL.
spec : to SPECIES for corresponding species in nucleotide db (N); to SHORTL for list of species in protein db (P).
host : (implemented in EMBL only) to SPECIES for organism host to the sequence.
plref : to SHORTL for list of attached references.
molec : to SMJYT for molecule.
placc : to SHORTL for list of attached accession numbers.
stat : to SMJYT for status.
org : to SMJYT for organelle.
div : rank of flat file (see SMJYT ) where the sequence appears.
date : 16-char field following format mm/dd/yymm/dd/yy for date of seq entry in database.

KEYWORDS and SPECIES one record for each keyword or taxon

name|libel|plsub| desc | syno   |    h   |plhost|
    |TEXT |LONGL|SHORTL|KEYWORDS|KEYWORDS|
                       |SPECIES |SPECIES |LONGL |

The last field, plhost, exists in SPECIES and is absent from KEYWORDS.

name : uppercase only; padded by spaces to WIDTH_KS characters; set to "xxx...xxx" when deleted.
libel : 0, or to TEXT for a 60-char label; in SPECIES, this label may indicate the genetic codes adequate for the species, and may also contain the taxonomic level (e.g., genus, order, family) of the taxon.
plsub : to LONGL for list of attached sequences (only parent seqs for species, any seq type for keywords).
desc : to SHORTL for list for descendants in tree structure; the absolute value of the first elt of this list is the rank of corresponding record in KEYWORDS/SPECIES; the sign of this number is negative iff there are sequences associated to this record; other elements of list are "desc " values of records of descendants in tree; desc = 0 for synonyms.
syno : to KEYWORDS/SPECIES to implement keyword or species synonymy, or 0 if none; synonymous keyword/species are chained in a looped chain; one and only one member from this looped chain has a negative syno value and is the major keyword/species and the only one with non zero plsub and desc; other members of chain have a positive syno value; |syno| is the rank in KEYWORDS/SPECIES of the next synonym.
h : next element of chain of KEYWORDS/SPECIES records with same hashing value, 0 at end of chain.
plhost : to longl list of seqs that have this species as host (used in EMBL e.g., to relate viral or plasmid sequences to their host).

BIBLIO one record for each reference

name|plsub |plaut |  j  |  y  |
    |SHORTL|SHORTL|SMJYT|SMJYT|

name : uppercase only; padded by spaces to 40 characters.
journal citations appear as JournalCode/volume_number/first_page
book citations as BOOK/year/first_author
theses citations as THESIS/year/first_author
patent citations as PATENT/number
other citations as UNPUBL/year/first_author
plsub, plaut : to SHORTL for lists of attached sequences and authors, respectively.
j, y : to SMJYT records for corresponding journal and publication year, respectively.

AUTHOR one record for each author name

name|plref | fut  |
    |SHORTL|unused|

name : uppercase only; padded by spaces to 20 characters; last name only, no initials.
plref : to SHORTL for list of references attached to this author.

ACCESS one record for each accession number

name|plsub |
    |SHORTL|

name : padded by spaces to ACC_LENGTH characters.
plsub : to SHORTL for list of parent seqs attached to this accession number.

SMJYT one record for each status, molecule, journal, year, type, organelle, division, and db structure information

name|plong|libel|
    |LONGL|TEXT |

name : padded by spaces to 20 characters; first 2 characters identify the nature of the object : status("00"), molecule("01"), journal("02"), year("03"), type("04"), organelle("05"), division("06"), and db structure information("07"); uppercase only except for "06".
plong : 0 or to LONGL for list of seqs attached to this object.
libel : 0 or to 60-char label
More information
- names starting with "06" can be "06FLTfname" or "06GCGfname" and indicate whether sequences and annotations are in flat or in GCG-structured files, and give the name of corresponding files (extension excluded; e.g., 06FLTgbbct1 for flat file gbbct1.seq).
- the label of "06" records are of the form "rank:xx" and give the rank of the corresponding division, counting from 0.
- the length of accession numbers is given by one record named "07ACCESSION" and with label such as "Length of accession numbers = 13".
- presence of one record named "07HASHING_ALGORITHM" and with label such as "Java algorithm" indicates that the java hashing algorithm is used for species, keywords and seq names; absence of such record means a previous algorithm is used.
- presence of one record named "07BIG_ANNOTS" indicates that annotations and sequences are adressed by a combination of 2 fields : field div of LOCUS gives the rank of the division the seq belongs to, fields pinf and pnuc of LOCUS give the offsets within the division where annotations and the sequence begin, respectively; absence of such record is no longer supported.
- presence of one record named "07ALLOW_PUNCTUATION" indicates that sequence data is interspersed with punctuation data in flat files (special for databases of rRNA sequences).
- presence of one record named "07NOCHANGESUBSEQNAME" indicates that feature qualifiers /gene=, /standard_name=, /nomgen= will not be used to construct the extension part of subsequence names.

EXTRACT (for nucleotide databases only) one record for each exon of each subsequence

mere  |deb|fin|pnuc| next  |
SUBSEQ|   |   |    |EXTRACT|

mere : to SUBSEQ for rank of parent seq containing this exon.
deb, fin : endpoints in parent sequence of the exon.
pnuc : address in flat file of start of parent sequence containing this exon.
next : to next exon of same sub-sequence, or 0 if no more.

TEXT one 60-character record for each label of a species, keyword, or SMJYT

   label  |

In the case of species, labels may contain information about the correct genetic codes for this species and about the name of the taxonomic level (e.g., order, family).

LONGL one record for each group of SUBINLNG elements of a long list

sub[0],sub[1],...,sub[SUBINLNG-1] |next |
     SUBSEQ,...                   |LONGL|

sub[i] : 0, or an element of the long list that is always a SUBSEQ record number.
next : 0, or rank of another LONGL record containing other elements of the list.
existing long lists [from: field holding start of list] :
- parent seqs attached to a species (from: field plsub of SPECIES)
- parent seqs arrached to a host (from: field plhost of SPECIES)
- seqs attached to an SMJYT element (from: field plong of SMJYT)
- seqs attached to a keyword (from: field plsub of KEYWORDS)
- sub-seqs of a parent sequence (from: opposite of field pext of SUBSEQ)
- all parent seqs in the database (must start at record # 2)
- all records deleted from file SUBSEQ (must start at record # 3)

SHORTL mostly, one record for each element of a short list

val | next |
    |SHORTL|

val : an element of the short list; it may be a signed integer or an unsigned record number.
next : 0, or rank of another SHORTL record containing another element of the list.
existing short lists [from: field holding start of list; val: nature of list elements] :
- sequences attached to a reference (from: field plsub of BIBLIO; val: SUBSEQ record #)
- references attached to a sequence (from: field plref of LOCUS; val: BIBLIO record #)
- authors attached to a reference (from: field plaut of BIBLIO; val: AUTHOR record #)
- references attached to an author (from: field plref of AUTHOR; val: BIBLIO record #)
- sequences attached to an accession number (from: field plsub of ACCESS; val: SUBSEQ record #)
- keywords attached to a sequence (from: field plkey of SUBSEQ; val: KEYWORDS record #)
- accession numbers attached to a sequence (from: field placc of LOCUS; val: ACCESS record #)
- annotation line of a subsequence (from: field plinf of SUBSEQ; val: offset in flat file; these lists have one element exactly)
- descending nodes of a species or of a keyword (from: field desc of KEYWORDS/SPECIES; val: SHORTL record #)
- species attached to a sequence (for protein databases only) (from: field spec of LOCUS; val: SPECIES record #)
File SHORTL also contains data used for hashing of seq names, species and keywords. Hashing is controlled by two positive, odd integer parameters, hsub (for seq names) and hkwsp (for species and keywords). Record #2 of file SHORTL stores the (negative) values (- hsub) and (- hkwsp) in fields val and next, respectively. Starting from record #3, there are (hsub+1)/2 records containing hsub values for seq name hashing, then (hkwsp+1)/2 records containing hkwsp values for keyword hashing, and then (hkwsp+1)/2 records containing hkwsp values for species hashing. True short lists begin after these records, that is, at number (hsub+1)/2 + (hkwsp+1) + 2 + 1. Each one of the hsub values stored in SHORTL starting at record #3 is the record number of the start of the chain of SUBSEQ records that share a common hashing value from the range [1,hsub]. Similarly, further data are starts of chains of keywords, and later of species, that share a common hashing value from the range [1,hkwsp]. Hashing values are computed by functions hashmn (seq names) and hasnum (keywords and species).