Acnuc management programs in alphabetical order (acnuc & gcgacnuc environment variables identify the database): acnucgener, check, compressnewdiv, connectindex, crenewdiv, crenewrelnum, cretaxtree, flattoaddress, initf, listlostfeat, listtoaddress, modkeylength, modhconst, modnet, namesindiv, nbrfgenerdiv, newordalphab, ordnet, processft, raadbstatus, readncbitaxo, setcode, smjytload, sortsubseq, supold, suppr_unused, swabacnuc, test_all_codes, updatehelp, voyage, wwwspecies.
Acnuc management programs in functional groups : (in LBBE all are in directory ~banques/debian-bin)
usage:
initf db_type [gcg] [punctuation] [hsub=xx] [hkwsp=xx] [acc=xx] [halgo=java|old] [standardextonly] [protein_idext] [sub=x] [key=x] [spec=x] [aut=x] [bib=x] [smj=x] [txt=x] [lng=x] or initf -h (to get a usage message) wheredb_type : genbank or embl or swissprot or nbrf, according to the need
acnucgener a adress_file [-mmap index ... ] or acnucgener d division_name [-mmap index ... ] whereaddress_file : name of a file typically created by connectindex or by listtoaddress containing the names, divisions and file offsets of sequences to enter
Program acnucgener creates subsequences for those items in sequence features that are known by the database as type. By default known types are CDS, tRNA, rRNA, scRNA, snRNA, misc_RNA. Use program smjytload to define additional types if desired. The /transl_table feature qualifier and the genetic code information from the NCBI classification (see below) is used to assign variant genetic codes to CDS subsequences. Qualifier values found in /EC_number=, /evidence=, /gene=, /product=, /protein_id=, /standard_name= are added as keywords of the subsequence. Qualifier values found in /gene= and /standard_name= are by default used to define the extension of the subsequence name, unless this behavior is turned off if entry 07NOCHANGESUBSEQNAME exists in file SMJYT. Alternatively, subsequence names can be made from the value of /protein_id= qualifiers if entry 07PROTEIN_IDSUBSEQNAME exists in file SMJYT.
Program acnucgener uses first species names found in the /organism= and next in the /dbxref="taxon:###" qualifiers of the source entry of the features table. This rule is reversed (taxon:### first and /organism next) if entry 07PRIORITY_TO_TAXID is present in index file SMJYT (this entry can be created/deleted with program smjytload). If this is absent, it uses the ORGANISM or OS records. Program acnucgener reads the full ncbi classification given in files names.dmp + nodes.dmp from directory $acnuctaxo to classify new species. If these files are not found or do not classify the species name, acnucgener uses the information of ORGANISM/OC lines to classify it. But acnucgener does not reflect in the acnuc species tree changed classification of a pre-existing species. For this reason, program readncbitaxo is useful to reflect changes of the NCBI classification of species in the acnuc species tree.
Customized processing of feature qualifiers is possible. Defined qualifiers can be detected and a keyword can be created from the qualifier or its value and attached to the subsequence corresponding to the feature entry. This is obtained by creating in the $acnuc directory a plain text file called custom_qualifier_policy that describes the desired custom feature qualifier processing. Follow this model (case is not significant) :
Qualifier = GENE_FAMILY Use_Value = True Parent_Keyword = GENE FAMILIES qualifier = GENE_EXPRESSION use_value = TRUE parent_keyword = GENE EXPRESSIONSGroups of lines deal with distinct qualifiers. The qualifier line begins a group and names the feature qualifier that requires custom processing (e.g., presence of /GENE_FAMILY in qualifiers). The use_value line says True if the value of the qualifier is used to define the keyword (e.g., keyword HBG00234 is used when /GENE_FAMILY="HBG00234" appears). By default the qualifier itself is used as a keyword. The parent_keyword line names a keyword under which to place the keyword in the tree of keywords (e.g., HBG00234 will be placed under GENE FAMILIES). By default the keyword is at the top of tree. The standard output of acnucgener describes what custom processing is used.
Some situations arise where use of program acnucgener fails to correctly create all subsequences that should arise from sequence feature tables. One such case arises when a subsequence declared in the features table of seq. A is JOINed to a fragment of seq. B and when seq. B, but not seq. A, is updated, in the sense that its date is changed. Program connectindex detects the date change, so seq. B is removed (by supold) and re-indexed (by acnucgener), but acnucgener is not in a position to re-create the subsequence because it does not scan A's features table that defines the subseqs. File xxx.lost, created by connectindex, contains the name of seq. A, so running program processft with this file completes the database update by re-creating the subseq.
Another case is when a subsequence-associated feature table entry is added to a sequence without changing its date. Program connectindex does not detect this kind of change. The solution is to run listlostfeat that detects all missing subsequences from an acnuc database, and then processft on its output, to create these missing subseqs.
Usage:
connectindex -h
gives a summary of program arguments
connectindex -update -basename base_name [-gz gzdirname] [-threads n] -divfile divlist
where
base_name: base name of a series of output files to be created by the program
gzdirname: name of directory where gzip'ed flat files sit. Compressed files are read from this directory and decompressed to the $gcgacnuc directory.
n: optional number of parallel threads to use
divlist: name of file containing list of all division names, one per line
In update mode, five output files are created. File disparu.mne lists names of sequences present in indices but absent from flat files. File base_name.1 lists new seqs (present in flat, absent in indices). File base_name.2 lists modified seqs (seq date or length or subsequences differ between indices and flat files). File base_name.lost lists names of seqs to be processed later by program processft because their features table changed. File base_name.address gives division names and file offsets of all new or changed sequences; it is to be used as an argument of program acnucgener.
connectindex -install [-gz gzdirname] [-threads n] -divfile divlist
connectindex [-threads n] -scan=number div1 div2 ...
where number is the number of following division names
The update/install modes can also be obtained by running the program without arguments and replying to a program dialog.
In update mode, the dialog replies are
u f or g (for flat of GCG formatted division files, respectively) base_name (base name of a series of output files to be created by the program) number (number of divisions in the acnuc database) xxx (names of these divisions on successive lines, without extension)
In install mode, the dialog replies are
i f or g (for flat of GCG formatted division files, respectively) y or n (if y an additional dialog item is needed) new_div_name (only if previous reply was y, a new division with this name is created in index files) number (number of divisions in the acnuc database) xxx (names of these divisions on successive lines, without extension)
usage newordalphab ...wait for termination with message "Normal end" on stdout.
listtoaddress names_file output_file wherenames_file : file of names of seqs to be processed.
listtoaddress mylist.names mylist.address supold mylist.names -mmap acnucgener a mylist.address -mmap ksub -mmap kshrt
flattoaddress outfname flatfname... whereoutfname : name of output file with division names & file offsets of all entries present in flat files
flattoaddress new.address flat1.dat flat2.dat crenewdiv flat1 crenewdiv flat2 acnucgener a new.address
supold names_file [-mmap ] wherenames_file : file of names of seqs to be removed, one per line
smjytload is an interactive program that allows to create, rename, or delete such names. It also allows to modify the label of these names.
smjytload is useful to create new sequence types, so that corresponding subsequences be created by program acnucgener. Each type has a code and a label. Its code is the feature name, converted to uppercase (e.g., CDS, EXON, INTRON). Its label must begin with ".XX" where XX are the two letters used to construct subsequence names (e.g. .PE for CDS to get xxxx.PE1 as a subseq name); the rest of the label may describe the type.
smjytload is also useful to correct journal codes (remove duplicates for example).
compressnewdiv reads a series of division files (typically only those holding daily updates), compresses them in place by removing their unindexed portions, and updates pointers to all data that changed place in these files.
Usage:
compressnewdiv division_name... wheredivision_name : one or several names of division files to be compressed in place
Access by sequence name in a large database will be faster if constant hsub is a prime number with the magnitude of the total number of seqs in the database. Similarly for constant hkswp and keywords.
There are four optional arguments to this program:
-partial : instructs the program not to delete synonyms existing in the curent acnuc classification but not in the input classification
-niveau : instructs the program to use taxonomic level information of the input classification as node label (used by databases such as Hovergen).
-setcode : instructs the program to create files ncbicodes.out that summarizes the genetic code information present in the input files
and setcode.dialog formatted as input for the setcode program.
-keepall : instructs the program to create in the acnuc database all the species found in the input tree, even if no sequence is attached to them.
Option -h lists possible program options.
The dialog is
taxon name or stop ( the program stops if stop)
y or n (for mitochondrial or genomic genetic code info, respect.)
acnuc-genetic-code-id (an acnuc-defined genetic code id)
[loop back to asking taxon name]
Procedure setcodegenbank.com runs the setcode program with the setcode.dialog information. It thus applies the genetic code information present in the NCBI classification to all of an acnuc database.
The flow until a CDS subsequence and its correct genetic code in acnuc is as follows. Program readncbitaxo writes in acnuc the genetic code information given in files names.dmp/nodes.dmp as part of the label of any leaf node or any sequence-bearing node. Program acnucgener uses this information to assign the adequate genetic code number to any CDS subsequence it creates. But this flow fails when acnucgener creates a new species and associated subsequences because the genetic code information is not available to the program then. Program setcode is thus useful to enforce a coherent genetic code information througout an acnuc database.
0 Orientation towards Species or Keywords 1 Creation of a node 2 Modification of the name and/or the label of a node 3 Creation of a branch 4 Move of a branch 5 State of a node 6 Delete a node or a synonym 7 Browse the tree 8 Create synonyms 9 List isolated or unused nodes and detect tree loops 10 Modify the order of descendants of a node 11 Remove all unused nodes
modnet allows to correct a few branches or nodes in the classification of species. Program readncbitaxo is to be used for more extensive changes.
modnet is the main way to organize a series of keywords as a tree.
Operation bib should be done before operation aut to be efficient.
When dealing with species or keywords, nodes whose descendents, in the tree, are all unused nodes are also deleted.
This program allows high-ranking taxa to appear before low-ranking ones in the index file of species names, which makes the output of browsing the species tree cleaner. The same applies to the keywords index file.
knowndbfile: name of file with list of remotely accessible acnuc databases (environment variable raalist gives this name)
namedpipe: name of pipe to communicate with the racnucd daemon (environment variable raadisable gives this name)
dbname: name of database, taken from first column of knowndbfile
on | off: use off to make db unavailable, on to make it available
Example to set the swissprot database offline:
raadbstatus -f $raalist -p $raadisable -n swissprot off
Example to password-protect the nbrf database:
raadbstatus -f $raalist -a -n nbrf
Enter password: *******
Repeat password: *******