ACNUC management

Acnuc management programs in alphabetical order (acnuc & gcgacnuc environment variables identify the database): check, compressnewdiv, connectindex, crenewdiv, crenewrelnum, flattoaddress, gbemgener, initf, listlostfeat, listtoaddress, modacclength, modhconst, modnet, namesindiv, nbrfgenerdiv, newordalphab, ordnet, processft, proctaxdump, raadbstatus, readncbitaxo, setcode, smjytload, sortsubseq, supold, suppr_unused, swgener, swabsv, test_all_codes, updatehelp, voyage, wwwspecies.

Acnuc management programs in functional groups : (in LBBE all are in directory ~banques/bin)

Add / remove sequences to/from an acnuc database.
- initf creates an empty acnuc database.
- gbemgener add sequences to an acnuc database of nucleotide sequences.
- swgener add sequences to an acnuc database of protein sequences (SwissProt format).
- nbrfgenerdiv add sequences to an acnuc database of protein sequences (PIR codata format).
- supold remove sequences from an acnuc database.
- processft scans the features table of yet-indexed sequences and creates missing subsequences.
Deal with biological classification of species, keywords tree, and genetic codes.
- proctaxdump produces a series of files with taxonomic information from files maintained by the NCBI.
- readncbitaxo reproduces in an acnuc database a given classification of species, typically NCBI's.
- wwwspecies prepares a file containing the acnuc species tree formatted for use by the acnuc web species browser.
- modnet interactive program to edit the tree structure of species or of keywords.
- setcode assigns genetic code numbers to CDS subsequences.
- test_all_codes lists all genetic codes defined in acnuc.
Maintain clean, coherent, efficient acnuc index files.
- connectindex maintains coherence between a set of flat files and a set of index files.
- newordalphab optimizes access to acnuc index files by rewriting all index files.
- suppr_unused removes all unused references, authors, acc nos, species, keywords, or records from file SMJYT.
- updatehelp updates the summary information giving sequence and residue totals of a database.
- sortsubseq alphabetically sorts sequence names and accession nos.
- compressnewdiv removes unused bytes in a series of division files.
- ordnet reorders species names or keywords compatibly with their tree structure.
- modhconst changes the hashing constants of an acnuc database.
- modacclength changes the maximum length of accession numbers of an acnuc database.
Miscellaneous.
- listtoaddress computes division names & file offsets of a series of sequences.
- flattoaddress computes division names & file offsets of a series of flat files.
- smjytload add, delete, modify an element of index file SMJYT.
- voyage interactive program to examine the content of any record of any acnuc index file.
- namesindiv computes the list of names of sequences that belong to given acnuc division files.
- listlostfeat scans the features table of all seqs of a database and detects missing subsequences.
- crenewrelnum creates a new RELEASE # keyword in the tree of keywords.
- crenewdiv creates a new division.
- raadbstatus signals when a remotely accessible db is (un)available
- swabsv needed to transfer index files from a SPARC to a PC or ALPHA.
- series of check programs: detects inconsistencies within index files.

initf creates an empty acnuc database.
This is the only acnuc program that does not use the acnuc environment variable. It produces a series of empty index files in the current directory.
usage:
```
initf  db_type [gcg] [punctuation] [hsub=xx] [hkwsp=xx] [acc=xx] [halgo=java|old] [standardextonly] div=div_name
	or
initf -h           (to get a usage message)
where
```
db_type : genbank or embl or swissprot or nbrf, according to the need
gcg : use this option to create a database that indexes GCG files
punctuation : use this option so the created database allows punctuation to appear in sequence data
hsub=xx : use this to set the value of the seq name hashing constant hsub (default=9973, use a prime number of the same magnitude as the total number of seqs in the to-be-built database)
hkwsp=xx : use this to set the value of the species and keyword name hashing constant hkwsp (default=1999, use a prime number of the same magnitude as the total number of keywords in the to-be-built database)
acc=xx : sets the max length of accession numbers (default 8)
halgo=java|old : sets what algorithm will be used for hashing names of sequences, species and keywords (default: java)
standardextonly : use this option so the created database don't use /gene= and /standard_name= feature qualifiers to construct subsequence name extensions.
div=div_name : name of one of the divisions of the future database.
gbemgener add sequences to an acnuc database of nucleotide sequences.
Usage:
```
gbemgener a adress_file [-mmap index ... ]
	or
gbemgener d division_name [-mmap index ... ]
where
```
address_file : name of a file typically created by connectindex or by listtoaddress containing the names, divisions and file offsets of sequences to enter
division_name : name of a division (e.g., gbnew for file gbnew.seq); all sequences in this division will be indexed, except those already indexed with the same date; existing seqs with anterior date are suppressed and then re-indexed.
index : indicates an index file to be processed entirely in virtual memory; one of ksub, kloc, kkey, kspec, kshrt, klng, ksmj, kaut, kacc, kbib; can be repeated as in: -mmap ksub -mmap kshrt -mmap kkey ; for large databases, it is recommended to use the -mmap option at least with each of ksub, kshrt, kkey, kspec, klng.
Program gbemgener creates subsequences for those items in sequence features that are known by the database as type. By default known types are CDS, tRNA, rRNA, scRNA, snRNA, misc_RNA. Use program smjytload to define additional types if desired.
Customized processing of feature qualifiers is possible. Defined qualifiers can be detected and a keyword can be created from the qualifier or its value and attached to the subsequence corresponding to the feature entry. This is obtained by creating in the $acnuc directory a plain text file called custom_qualifier_policy that describes the desired custom feature qualifier processing. Follow this model (case is not significant) :
```
	Qualifier = GENE_FAMILY              
	Use_Value = True                     
	Parent_Keyword = GENE FAMILIES        

	qualifier = GENE_EXPRESSION
	use_value = TRUE
	parent_keyword = GENE EXPRESSIONS
```
Groups of lines deal with distinct qualifiers. The qualifier line begins a group and names the feature qualifier that requires custom processing (e.g., presence of /GENE_FAMILY in qualifiers). The use_value line says True if the value of the qualifier is used to define the keyword (e.g., keyword HBG00234 is used when /GENE_FAMILY="HBG00234" appears). By default the qualifier itself is used as a keyword. The parent_keyword line names a keyword under which to place the keyword in the tree of keywords (e.g., HBG00234 will be placed under GENE FAMILIES). By default the keyword is at the top of tree. The standard output of gbemgener describes what custom processing is used.
Program gbemgener creates species names found in SOURCE and OS/OC annotation records, and, for new species only, uses the classification information therein to place the new species. But gbemgener does not reflect in the acnuc species tree changed classification of a pre-existing species. For this reason, program readncbitaxo is useful to maintain coherence between the NCBI classification of species, the acnuc species tree, and sequence annotations.

swgener add sequences to an acnuc database of protein sequences (SwissProt format).
Usage:

swgener a address_file [-mmap index ... ]
	or
swgener d division_name [-mmap index ... ]
where arguments are as for gbemgener

nbrfgenerdiv add sequences to an acnuc database of protein sequences (PIR codata format).
This program will become obsolete given the fusion of the PIR and SwissProt databases into UNIPROT.
Usage:
nbrfgenerdiv Name of address file? ? address_file_name Date de la release? (format 12/31/89) rel_date
where
address_file_name : name of file of seq names and file offsets typically created by connectindex.
rel_date : date used only for seqs lacking date info in their annotation.
processft scans the features table of yet-indexed sequences and creates missing subsequences.
Usage: processft name_file
where
name_file: file of sequence names, one per line, typically created by connectindex or by listlostfeat.
Some situations arise where use of program gbemgener fails to correctly create all subsequences that should arise from sequence feature tables. One such case arises when a subsequence declared in the features table of seq. A is JOINed to a fragment of seq. B and when seq. B, but not seq. A, is updated, in the sense that its date is changed. Program connectindex detects the date change, so seq. B is removed (by supold) and re-indexed (by gbemgener), but gbemgener is not in a position to re-create the subsequence because it does not scan A's features table that defines the subseqs. File xxx.lost, created by connectindex, contains the name of seq. A, so running program processft with this file completes the database update by re-creating the subseq.
Another case is when a subsequence-associated feature table entry is added to a sequence without changing its date. Program connectindex does not detect this kind of change. The solution is to run listlostfeat that detects all missing subsequences from an acnuc database, and then processft on its output, to create these missing subseqs.
listlostfeat scans the features table of all seqs of a database and detects missing subsequences.
This program, run without arguments, reads the features table of all seqs of an acnuc database and detects missing sub-sequences. For each such case, it writes on its standard output the name of the parent sequence and the feature entry corresponding to the missing subsequence. If sent to a file, this output data is suitable to be used as argument for the processft program.
connectindex maintains coherence between a set of flat files and a set of index files.
This program can be used in 3 modes:
update mode : Connects an existing set of index files to an updated set of flat files and identify changed, new, and disappeared sequences. Typically used to prepare acnuc indexing of a new release of flat files.
install mode : Connects a set of index files to a set of flat files and hides access to sequences present in index files but not in flat files. Typically used after copying index files from a distribution to ensure their coherence with local flat files.
scan mode : Does install mode on a given series of flat files rather than on all flat files.
Scan mode is obtained by running the program with arguments:
connectindex -number division_name ...
where number is the number of following division names
The other two modes are obtained by running the program without arguments and replying to a program dialog.
In update mode, the dialog replies are
```
u
f   or    g     (for flat of GCG formatted division files, respectively)
base_name       (base name of a series of output files to be created by the program)
number          (number of divisions in the acnuc database)
xxx             (names of these divisions on successive lines, without extension)
```
In update mode, five output files are created. File disparu.mne lists names of sequences present in indices but absent from flat files. File base_name.1 lists new seqs (present in flat, absent in indices). File base_name.2 lists modified seqs (seq date or length or subsequences differ between indices and flat files). File base_name.lost lists names of seqs to be processed later by program processft because their features table changed. File base_name.address gives division names and file offsets of all new or changed sequences; it is to be used as an argument of programs gbemgener or swgener.
In install mode, the dialog replies are
```
i
f   or    g     (for flat of GCG formatted division files, respectively)
y   or    n     (if y an additional dialog item is needed)
	new_div_name (only if previous reply was y, a new division with this name 
	             is created in index files)
number          (number of divisions in the acnuc database)
xxx             (names of these divisions on successive lines, without extension)
```
newordalphab optimizes access to acnuc index files by rewriting all index files.
This program duplicates all index files in the directory pointed to by the acnuc environment variable under names xxx.NEW, and then deletes all old index files and renames the new files.
```
usage
newordalphab
             ...wait for termination with message "Normal end" on stdout.
```
listtoaddress computes division names & file offsets of a series of sequences.
usage:
```
listtoaddress names_file output_file
where
```
names_file : file of names of seqs to be processed.
output_file : file with division names and file offsets of these sequences.
Typical usage is to re-index a series of sequences by doing :
```
listtoaddress mylist.names mylist.address
supold mylist.names -mmap
gbemgener a mylist.address -mmap ksub -mmap kshrt
```
flattoaddress computes division names & file offsets of a series of flat files.
usage:
```
flattoaddress outfname flatfname...
where
```
outfname : name of output file with division names & file offsets of all entries present in flat files
flatfname : names(s) of input flat files containing sequence entries
Typical usage is to index a series of flat files by doing :
```
flattoaddress new.address flat1.dat flat2.dat
crenewdiv flat1
crenewdiv flat2
gbemgener a new.address
```
crenewdiv creates a new division
usage: crenewdiv division_name
supold remove sequences from an acnuc database.
usage:
```
supold names_file [-mmap ]
where
```
names_file : file of names of seqs to be removed, one per line
-mmap : this option lets the program work faster for large number of sequences
smjytload add, delete, modify an element of index file SMJYT.
The acnuc index file SMJYT contains one record for each name of molecule, journal, publication year, sequence type. It contains also the names of the division files of the database, not processed by this program.
smjytload is an interactive program that allows to create, rename, or delete such names. It also allows to modify the label of these names.
smjytload is useful to create new sequence types, so that corresponding subsequences be created by program gbemgener. Each type has a code and a label. Its code is the feature name, converted to uppercase (e.g., CDS, EXON, INTRON). Its label must begin with ".XX" where XX are the two letters used to construct subsequence names (e.g. .PE for CDS to get xxxx.PE1 as a subseq name); the rest of the label may describe the type.
smjytload is also useful to correct journal codes (remove duplicates for example).
compressnewdiv removes unused bytes in a series of divisions.
When an acnuc database is daily updated, new sequences are added at the end of division files dedicated to holding them (example, gbnew.seq). Such new sequences may be further modified, so new versions of them will appear further down the divisions of new seqs, and so previous versions will no longer be indexed.
compressnewdiv reads a series of division files (typically only those holding daily updates), compresses them in place by removing their unindexed portions, and updates pointers to all data that changed place in these files.
Usage:
```
compressnewdiv division_name... 
where
```
division_name : one or several names of division files to be compressed in place
modacclength changes the maximum length of accession numbers of an acnuc database.
Usage: modacclength length
where length is ≥ 10 and ≥ current max length of acc nos.
Program voyage gives the current maximum length of accession numbers.
modhconst changes the hashing constants of an acnuc database.
Usage;
modhconst [hsub=new_value] [hkwsp=new_value]
Access by sequence name in a large database will be faster if constant hsub is a prime number with the magnitude of the total number of seqs in the database. Similarly for constant hkswp and keywords.
readncbitaxo reproduces in an acnuc database a given classification of species, typically NCBI's.
The program reads file id.report of the current directory expected to contain a classification of species expressed in a specific format, once used by NCBI, and reproduces it entirely in the current acnuc database, except for species that exist in acnuc but not in the input file, that remain unchanged, and for species of the input file absent from acnuc, that are not created in acnuc.
The program creates a log file (id.log in current directory) describing input classification, current acnuc classification, and all operations done to transform the second in the first.
There are two optional arguments to this program:
-partial : instructs the program not to delete synonyms existing in the curent acnuc classification but not in the input classification
-niveau : instructs the program to use taxonomic level information of the input classification as node label (used by databases such as Hovergen).
The input classification file, id.report, typically produced by program proctaxdump, contains information about taxon names and classification, names of taxomonic levels, synonyms, common names (expressed as species label in acnuc), and genetic codes for corresponding genomic (ncbi_gc) or mitochondrial (mt_ncbi_gc) sequences. The format for this is as in this example:
```
 1.cellular organisms
 2..Bacteria  [superkingdom]
      synonym: Bacteria
      ncbi_gc: 11
 3...Cyanobacteria [phylum]
       common name: cyanophytes
       synonym: Cyanophyceae
       synonym: Cyanophycota
 4....Chroococcales [order]
 5.....Aphanocapsa [genus]
 6......Aphanocapsa sp. [species]
 6......Aphanocapsa feldmani [species]
 5.....Aphanothece [genus]
 6......Aphanothece sacrum [species]
 6......Aphanothece naegelii [species]
 5.....Microcystis [genus]
 6......Microcystis aeruginosa [species]
 7.......Microcystis aeruginosa UAM254
```
proctaxdump produces a series of files with taxonomic information from files maintained by the NCBI.
The NCBI biological classification of species is distributed as ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz. From that archive file, files nodes.dmp and names.dmp can be extracted. Proctaxdump reads these two files and produces three output files:
- id.report typically used by program readncbitaxo to reproduce in acnuc the ncbi classification
- ncbicodes.out that summarizes the genetic code information present in the input files
- setcode.dialog formatted as input for the setcode program.
In the LBBE setup, file id.report is created nightly in directory /acnucdb/genbank/taxman and used as input classification information for all acnuc databases through program readncbitaxo.
wwwspecies prepares a file containing the acnuc species tree formatted for use by the acnuc web species browser.
This file is written on the standard output of the program.
setcode assigns genetic code numbers to CDS subsequences.
Setcode is a dialog-based program that repeatedly asks for taxon name, genetic code id, and boolean mitochondrial information, and assigns this genetic code id to all subsequences of type CDS (and possibly of organelle MITOCHONDRION) from that taxon or taxa below in the tree.
The dialog is
taxon name or stop ( the program stops if stop)
y or n (for mitochondrial or genomic genetic code info, respect.)
acnuc-genetic-code-id (an acnuc-defined genetic code id)
[loop back to asking taxon name]
Procedure setcodegenbank.com runs the setcode program with the setcode.dialog information. It thus applies the genetic code information present in the NCBI classification to all of an acnuc database.
The flow until a CDS subsequence and its correct genetic code in acnuc is as follows. Program readncbitaxo writes in acnuc the genetic code information given in file id.report as part of the label of any leaf node or any sequence-bearing node. Program gbemgener uses this information to assign the adequate genetic code number to any CDS subsequence it creates. But this flow fails when gbemgener creates a new species and associated subsequences because the genetic code information is not available to the program then. Program setcode is thus useful to enforce a coherent genetic code information througout an acnuc database.
test_all_codes lists all genetic codes defined in acnuc.
This lists on stdout all genetic codes defined in acnuc in a format that allows comparison with NCBI's gencode.dmp file. The output also gives both NCBI's and acnuc's genetic code ids. One can then detect if new genetic codes appeared in NCBI and define them in acnuc.

modnet interactive program to edit the tree structure of species or of keywords.
A series of operations can be done :

 0  Orientation towards Species or Keywords
 1  Creation of a node
 2  Modification of the name and/or the label of a node
 3  Creation of a branch
 4  Move of a branch
 5  State of a node 
 6  Delete a node or a synonym 
 7  Browse the tree 
 8  Create synonyms
 9  List isolated or unused nodes and detect tree loops
10  Modify the order of descendants of a node
11  Remove all unused nodes

modnet allows to correct a few branches or nodes in the classification of species. Program readncbitaxo is to be used for more extensive changes.
modnet is the main way to organize a series of keywords as a tree.

voyage interactive program to examine the content of any record of any acnuc index file.
Voyage is a utility program that helps debugging acnuc programs.
namesindiv computes the list of names of sequences that belong to given acnuc division files.
Usage: namesindiv outfname divname ...
where
outfname: name of an output file to be filled with seq names, one per line
divname : one or several names of acnuc division files (e.g., gbbct2 est_fun)
suppr_unused removes all unused references, authors, acc nos, species, keywords, or records from file SMJYT.
Usage: suppr_unused oper_id
where
oper_id: one of bib aut acc spec key smj to specify references, authors, acc nos, species, keywords, or records from file SMJYT, respectively.
Operation bib should be done before operation aut to be efficient.
When dealing with species or keywords, nodes whose descendents, in the tree, are all unused nodes are also deleted.
ordnet reorders species names or keywords compatibly with their tree structure.
Usage: ordnet oper_id
where
oper_id: s or k to specify species or keywords, respectively.
This program allows high-ranking taxa to appear before low-ranking ones in the index file of species names, which makes the output of browsing the species tree cleaner. The same applies to the keywords index file.
updatehelp updates the summary information giving sequence and residue totals of a database.
Usage: updatehelp [ -noupdate ]
The program computes the total number of sequences, subsequences, references, and nucleotides or amino acids in the current acnuc database, and writes this information at the top of on-line help files HELP and HELP_WIN.
The program also writes the date of the day the program is run, unless run with the -noupdate argument.
sortsubseq alphabetically sorts sequence names and accession nos.
This program, used without argument, is useful during the procedure of daily indexing after the gbemgener/swgener run to have again all sequences and accession numbers alphabetically sorted by name. Subsequences are sorted in the order of their appearance in the features table.
crenewrelnum creates a new RELEASE # keyword in the tree of keywords.
Looks for the 1st descendant of RELEASE NUMBERS that should be of the form RELEASE # and creates a new keyword with a number incremented by one. Useful for the GenBank format, after indexing a full database release and before starting daily updates, so new sequences be associated with the release number of the next full release. Useless with EMBL format because release numbers are read in annotations rather than guessed at by gbemgener.
swabsv converts index files so they can be used on computers with little-endian architecture, such as PCs and alphas, after having been built on a big-endian computer such as SPARCs or PowerPCs.
This program must be run once with the index file-containing directory as current directory and after allowing write access to these files. If a second run is attempted, the message Index files have already been swapped, that's good. appears and no change is done to index files.
This program is not needed when all operations (creation, usage) on index files are done on computers with the same endianness.
This program will do harm if run on a big-endian computer and applied to big-endian index files.
raadbstatus signals when a remotely accessible database becomes unavailable because under update, or available again.
Usage: raadbstatus -f knowndbfile -p namedpipe -n dbname { on | off }
knowndbfile: name of file with list of remotely accessible acnuc databases (environment variable raalist gives this name)
namedpipe: name of pipe to communicate with the racnucd daemon (environment variable raadisable gives this name)
dbname: name of database, taken from first column of knowndbfile
on | off: use off to make db unavailable, on to make it available
series of check programs: detects inconsistencies within index files.
A series of programs that help detect several sorts of inconsistencies within index files, for example, a link from a sequence to a keyword that is not matched by a corresponding link from keyword to sequence. These programs are :
- checkacc : acc no <==> seq links
- checkarbre : tree structure in species and keyword index files
- checkaut : author <==> reference links
- checkbc : coherence between SQ / SUMMARY annotation lines and sequence data
- checkbib : sequence <==> reference links
- checkhash : integrity of hashing of sequence, species and keyword names
- checkinfnucpointers : integrity of pointers to annotations and sequences
- checkkw : sequence <==> keyword links
- checklng : integrity of all data in LONGL index file
- checkmefi : parent-sequence <==> subsequence links
- checksmj : sequence <==> SMJYT links
- checkspec : sequence <==> species links
- checksyno : integrity of synonymy data in species and keyword trees
Each program runs on all of the database and writes a description of any detected inconsistency on stdout.