- INTRODUCTION
- ACNUC environment
ACNUC databases are made of a series of flat text files (called divisions) containing annotations and sequences and a series of index files allowing efficient access to sequence data. Two environment variables acnuc and gcgacnuc are used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively.
The C ACNUC API contains a series of functions that carry common basic operations and the full query language. Other operations require understanding the logical structure of ACNUC databases for direct navigation through index files. The header file dir_acnuc.h describes in detail the organization of all index files. The API can handle 'large' files via 64-bit file offsets, both for division and index files.
Usage of the C ACNUC API is as follows:
- Include file dir_acnuc.h as the first include statement of your program.
- Link with library libcacnuc.a (or libcacnucsol.a in the LBBE).
- Anonymous ftp access to C source files
All C sources are in publicly accessible file
acnucsoft.tar. The makefile therein allows to build several ACNUC programs including query, the line-oriented ACNUC retrieval program. The makefile also builds the ACNUC library libcacnuc.a.
Thus a program prog.c placed in the same directory as the ACNUC source files using the ACNUC API can be compiled by :
gcc -o prog prog.c -L. -lcacnuc
- LBBE (i.e., Lyon) access to C source files and libraries
In the LBBE computer setup, all ACNUC software is in directory ~banques/csrc/ .
Thus a program prog.c using the ACNUC API can be compiled by :
gcc -o prog -I~banques/csrc prog.c -L~banques/csrc -lcacnucsol
- Simple C API example. See also the main acnuc header file dir_acnuc.h.
- CONSTANTS / GLOBAL VARIABLES / TYPEDEFs
- int L_MNEMO : the fixed length of (sub-)sequence names; may vary according to the database.
- int WIDTH_SP : the fixed length of species; may vary according to the database.
- int WIDTH_KW : the fixed length of keywords; may vary according to the database.
- int ACC_LENGTH : the max length of accession numbers; may vary according to the database.
- int WIDTH_AUT : the fixed length of author names; may vary according to the database.
- int WIDTH_BIB : the fixed length of references; may vary according to the database.
- int WIDTH_SMJ : the fixed length of SMJYT codes; may vary according to the database.
- int lrtxt : the fixed length of TEXT labels; may vary according to the database.
- constant WIDTH_MAX = 150 (is ≥ than any of L_MNEMO,WIDTH_SP,WIDTH_KW,WIDTH_AUT,WIDTH_BIB,WIDTH_SMJ,ACC_LENGTH,lrtxt)
- int SUBINLNG : the number of SUBSEQ pointers in each LONGL record; may vary according to the database.
- DIR_FILE : A normally opaque struct type used for buffered and random access to ACNUC index files.
- kacc, kaut, kbib, kext, kkey, kloc, klng, kshrt, ksmj, kspec, ksub, ktxt : global variables of type pointer to DIR_FILE associated to each of the ACNUC index files named ACCESS, AUTHOR, BIBLIO, EXTRACT, KEYWORDS, LOCUS, LONGL, SHORTL, SMJYT, SPECIES, SUBSEQ, TEXT, respectively.
- nseq : total number of records in file SUBSEQ
(= maximum # of bits in a bit list of sequences)
- maxa : the largest total record number among files SPECIES and KEYWORDS
- lenbit : the largest among nseq and maxa (size in bits of a bit list that can contain either sequences, species, or keywords).
- lenw : size in int of a bitlist holding lenbit bits (useful to allocate a bitlist of sequences, species, or keywords).
- longa : size in int of a bitlist holding maxa bits (useful to allocate a bitlist of species or keywords).
- flat_format : TRUE when using text flat files; FALSE when using GCG files.
- genbank : TRUE iff annotations follow the GenBank syntax
- embl : TRUE iff annotations follow the EMBL syntax
- nbrf : TRUE iff annotations follow the PIR/NBRF syntax
- swissprot : TRUE iff annotations follow the SwissProt syntax
- divisions : rank of the last division file of the database (counting from 0, so there are divisions+1 divisions)
- char **gcgname : array of division names, all without extension
- int *annotopened : tells whether each division is currently opened
- FILE **divannot : arrays of streams associated to currently opened division
- int hsub, hkwsp : parameters that control hashing of sequence, keywords and species names
- int hoffst : 3 means the new format that allows variable-length names; 2 means the old fixed-length format
- int must_swap_bytes : TRUE iff endianness of index files and of host computer differ; the API transparently accepts reading and writing index files in this case also.
- OPENING / CLOSING
- void acnucopen(void);
Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for full, read-only access.
- void simpleopen(void);
Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for partial, read-only access : only access by sequence name and to annotations and sequences is possible.
- void dir_acnucopen(char *db_access);
Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for full access. Access is read-only if db_access == "RO" or read/write if db_access == "WP".
- void dir_acnucclose(void);
Closes access to the current ACNUC database.
- ACCESS BY SEQUENCE NAME
ACNUC nucleotide databases contain parent sequences, that are regular database entries, and subsequences, that are one or several fragments of one or several parents as defined in a features table entry. Subsequences are named by adding to the parent name a dot and an extension (e.g., ECOTGP.TRPA).
- int gsnuml(char *name, int *length, int *frame, int *gencode);
- name : sequence name terminated with \0 (upper/lowercase accepted)
- *length : upon return, contains the sequence length
- *frame : ignored if NULL, or returned with reading frame (0, 1, 2)
- *gencode : ignored if NULL, or returned with id of genetic code (0=usual code)
- returned value : the rank in the database of the (sub)sequence named "name", or 0 if none exists
- int isenum(char *name);
- name : null-terminated sequence name (upper/lowercase accepted)
- returned value : the rank in the database of the (sub)sequence named "name", or 0 if none exists
- ACCESS TO SEQUENCES
int gfrag(int nsub, int first, int lfrag, char *dseq);
Gets any part of any (sub)sequence of the database.
- nsub : rank of (sub)sequence
- first : starting position (counting from 1) for sequence access
- lfrag : number of positions asked for
- dseq : upon return, null-terminated string filled with bases or aa read
dseq is allocated by the caller
less than lfrag positions can be read if sequence end is reached
- returned value : actual number of residues read, or 0 if any error
int prep_extract(int usefilename, out_option option, char *fname, extract_option choice,
char *feature_name, char *bounds, char *min_bounds, char **message);
Prepares later extraction operations (performed by calling extract_1_seq on one or several sequences) by specifying extraction format, type, and output destination.
- usefilename : if TRUE, output goes to a file named in the filename argument; if FALSE, output goes through caller-provided function
- option : one of these enum values: acnuc, gcg, fasta, analseq, flat, coordinates
- acnuc, gcg, fasta, analseq, flat: name of format of extracted sequences
- coordinates: the function outputs coordinates in parent sequences of target sequence fragments (rather than sequence data)
Output is formatted by series of rank=%d&start=%d&end=%d| giving the rank of parent sequence, start position and end position of each target fragment. Fragments related to a contiguous sequence occur on one line; line changes indicate distinct sequences. start > end indicates fragment is on complementary strand of parent sequence.
- fname : a filename if usefilename argument was TRUE, or a pointer to a struct stream_output that allows output to go through caller-provided function.
- choice : one of these enum values: simple, translate, fragment, feature, region
- simple: sequences are fully extracted by later calls to extract_1_seq
- translate: protein-coding sequences are translated (nothing gets extracted if applied to sequences of type != CDS; does not apply if option == coordinates)
- fragment: Allows to extract any part of processed sequences. Such part is specified by the bounds and min_bounds arguments according to the syntax suggested by these examples:
132,1600 to extract from nucl. 132 to nucl 1600 of the sequence.
If applied to a subsequence, extraction is done in the parent seq
relatively to the subsequence start point.
-10,10 to extract from 10 nucl. BEFORE the 5' end of the sequence
to nucl. 10 of it. Useful only for subsequences, and produces
a fragment extracted from its parent sequence.
e-20,e+10 to extract from 20 nucl. BEFORE the 3' end of the sequence
to 10 nucl. AFTER its 3' end. Useful only for subsequences, and
produces a fragment extracted from its parent sequence.
-20,e+5 to extract from 20 nucl. BEFORE the 5' end of the sequence
to 5 nucl. AFTER its 3' end.
- feature: feature tables of sequences are scanned, each instance of the feature key given in feature_name argument is extracted; meaningful only for parent sequences (subsequences have no feature table)
- region: the fragment operation is applied to all entries of the specified kind
in the feature table of processed sequences. The bounds and min_bounds arguments specify
what part of feature data are extracted.
- feature_name : (useful only if choice is feature or region) a feature key (CDS, mRNA,...)
- bounds : (useful only if choice is fragment or region) see syntax above
- min_bounds : (useful only if choice is fragment or region) NULL or same syntax as bounds. When the sequence data is too short for this quantity to be extracted, nothing is extracted. When the sequence data is between minbounds and bounds, extracted sequence data is extended by N's to the desired length. NULL is same as setting min_bounds = bounds.
- message : upon return, pointer to an error-describing message
- returned value : 0 if OK, != 0 if error
typedef int (*writefunction)(const char *, void *);
struct stream_output {
void *stream;
writefunction outonelinef;
};
All extraction output is sent to caller-provided function outonelinef that is called with one line of output data as first argument and with opaque data pointer stream as second.
int extract_1_seq(int seqnum, char *bounds);
Extracts sequence of rank seqnum according to extraction rules given by previous call to prep_extract.
- seqnum : rank of (sub)sequence to process
- bounds : (only for fragment or region operations) same as bounds argument of prep_extract call
void fin_extract(void);
To be called once after all calls to extract_1_seq to close the extraction output.
- ACCESS TO SEQUENCE ANNOTATIONS
For a parent sequence, the only possible access is to its first annotation line and to following lines.
For a subsequence, the only possible access is to the first line of the corresponding FEATURE (e.g., CDS, tRNA, etc...) and to following lines.
Moreover, access to a previously accessed annotation line is possible provided the address of this line, returned by the next_annots64 function, is memorized.
- void seq_to_annots64(int numseq, off_t *faddr, int *div);
This function gives the caller the information needed to access the first annotation line of a (sub)sequence.
- seqnum : rank of parent or subsequence
- *faddr, *div : upon return, couple of data used to access annotations via the read_annots64 function.
- char *read_annots64(off_t faddr, int div);
Returns in static memory the annotation line addressed by the faddr and div arguments. Trailing \n and spaces are removed.
To access following annotation lines, use :
- char *next_annots64(NULL);
Returns the annotation line following the last one read.
- char *next_annots64(off_t *pfaddr);
This alternative call is useful to allow re-access to an annotation line, later in the program. First, read this line with next_annots64 and a non-NULL argument, memorize the off_t value obtained upon return, and use this value as the faddr argument of a call to read_annots64 any time later. The necessary div argument is the same for any annotation line of one sequence.
- char *short_descr(int seqnum, char *text, int maxlen);
- seqnum : (sub)sequence rank
- text : upon return, char string filled with a short sequence description built with the sequence name and, for a parent sequence, from DE/DEFINITION lines, and for a subsequence, from corresponding "qualifiers".
- maxlen : max memory size for text
- returned value : pointer to text
- char *short_descr_p(int seqnum, char *text, int maxlen);
same as short_descr for a parent sequence;
for a subsequence, applies short_descr to its main parent.
- int read_loc_qualif(int isub, char *location, int maxlocat, char *type, char *qualifiers, int maxqualif);
To return location, qualifiers or feature key of a subsequence.
- isub: rank of subsequence
- location: NULL or memory to receive the subsequence's location
- maxlocat: typically sizeof(location), useless if location is NULL
- type: NULL or memory to receive the subsequence's feature key
- qualifiers: NULL or memory to receive the subsequence's qualifiers
- maxqualif: typicaly sizeof(qualifiers), useless if qualifiers is NULL
- returned value: FALSE if OK; TRUE if isub is not a subsequence or if not enough memory to receive all required data.
- TRANSLATION / GENETIC CODES
- char codaa(char *codon, int code);
- codon : pointer to trinucleotide (e.g. acu, GGT)
- code : genetic code id (e.g., computed by gsnuml, or 0 for the usual code)
- returned value : the corresponding amino acid on one character
- char init_codon_to_aa(char *codon, int gc);
- codon : pointer to initiation codon (e.g. aug, GTG)
- gc : genetic code id
- returned value : the corresponding amino acid on one character using the initiation codon rule of the genetic code.
- char *get_code_descr(int code);
- code : genetic code id (e.g., computed by gsnuml)
- returned value : string <= 60 chars describing how this genetic code differs from the usual one (e.g. AGR=* AUA=M UGA=W )
- char *translate_cds(int seqnum);
Complete translation, returned in malloc'ed memory, of sequence of rank seqnum (often a subsequence) using the sequence's genetic code and its rule concerning the initiation codon.
- char translate_init_codon(int seqnum, int gc, int codon_start /* 1, 2, or 3 */);
returns in one char the translation of the initiation codon of sequence of rank seqnum using the genetic code of id gc and the offset codon_start for correct reading frame.
- int get_ncbi_gc_number(int gc);
returns the NCBI id of the genetic code with ACNUC id gc
- int get_acnuc_gc_number(int ncbi_gc);
returns the ACNUC id of the genetic code with NCBI id ncbi_gc
returns 0 (=usual code) if not found.
- ACCESS BY SPECIES, KEYWORD, AUTHOR, REFERENCE, ACC. NUMBER, etc...
- int iknum(char *name,
DIR_FILE *fp);
- name : taxon or keyword name (null-terminated string ignoring case)
- fp : kkey for keyword or
kspec for a taxon name
- returned value : rank of name of 0 if does not exist
- int fcode(DIR_FILE *fp, char *key, int lcompar);
- fp : kacc, kaut, ksmj, kbib for accession-number, author-name, SMJYT, or reference, respectively
- key : string to search (case is ignored)
- lcompar : number of used characters in key during search
- returned value : rank of found key in corresponding index file, or 0 if key does not exist.
- int shkseq(char *name, int *bitlist, int oper);
- name : taxon or keyword name (null-terminated string ignoring case); can contain @ characters to indicate wildcards.
- bitlist : integer array of size at least lenw to be filled upon return with the bitlist of seqs attached to all taxa or keywords placed below name in the species or keywords trees.
- oper : (input) 1 for species, 2 for host, 3 for keywords.
- returned value : 1 when OK; 2 when nothing matches name in index file
- void sel_seqs_1_node(DIR_FILE *kan, int recnum, int *bitlist, int host);
- kan : kspec for species or kkey for keywords
- recnum : rank in index file adressed by kan of a species or a keyword
- bitlist : integer array of size at least lenw to be filled upon return with the bitlist of seqs attached to all taxa or keywords placed below name in the species or keywords trees. Normally transmitted empty (= all 0s) by caller.
- host: TRUE iff kan==kspec and host sequences of taxon are expected
- int taxidtosp(int tid);
- return value : rank in file SPECIES of the taxon of taxID tid, or 0 if no such taxon.
- int sptotaxid(char *taxname, int rank);
- taxname: NULL or taxon name (case is not significant)
- rank: (used only if taxname == NULL) the acnuc rank of a taxon
- return value : ncbi taxon ID of given taxon or acnuc taxon rank, or 0 if no such taxon exists or if this taxon has no ncbi given taxon ID.
- void descen(DIR_FILE *kan, int recnum, int *bitlist);
- kan : kspec for species or kkey for keywords
- recnum : starting record rank in file adressed by kan
- bitlist : integer array of size at least longa to be filled upon return with the bitlist of taxa or of keywords placed below node of rank recnum in the species or keywords trees.
- char *get_ancestor_taxon(char *name, int rank, int *pancestor);
- name: NULL or a taxon name (case is not significant)
- rank: (only if name == NULL) a taxon rank in index file SPECIES
- pancestor: NULL, or returned filled with the rank in SPECIES of ancestor of name/rank
- return value: the name of the ancestor in static memory ("Root" when name is at tree top)
or NULL if not enough memory or name does not exist.
C code to find seqs attached to an accession no.:
char access[] = "M00001";
int num, point, seq;
num = fcode(kacc, access, ACC_LENGTH);
if(num == 0) return; /* this accession no does not exist */
readacc(num);
point = pacc->plsub;
while(point != 0) {
readshrt(point);
seq = pshrt->val; /* seq is the rank of a sequence attached to given acc no. */
point = pshrt->next;
}
C code to find seqs attached to a taxon, taxID or keyword
char my_taxon[] = "Bovidae"; /* case ignored */
char my_kw[] = "ribosomal protein"; /* case ignored */
int tid = 284813 ; /* taxon id of Encephalitozoon cuniculi */
int num, err, *list, numsp;
list = (int *)calloc(lenw , sizeof(int) ).
err = shkseq(my_taxon, list, 1);
if(err == 2) return; /* taxon does not exist */
num = 1;
while( (num = irbit(list, num, nseq)) != 0) {
/* here num is the rank of a seq attached to taxon my_taxon */
}
numsp = taxidtosp(tid);
if(numsp != 0) sel_seqs_1_node(kspec, numsp, list, FALSE);
num = 1;
while( (num = irbit(list, num, nseq)) != 0) {
/* here num is the rank of a seq attached to taxID tid */
}
err = shkseq(my_kw, list, 3);
if(err == 2) return; /* keyword does not exist */
num = 1;
while( (num = irbit(list, num, nseq)) != 0) {
/* here num is the rank of a seq attached to keyword my_kw */
}
free(list);
C code to find all keywords attached to a sequence
int num, kw, point;
num = isenum("ECOTGP"); /* get rank of starting sequence name */
readsub(num);
point = psub->plkey;
while(point != 0) {
readshrt(point);
kw = pshrt->val; /* here kw is the rank of an attached keyword */
point = pshrt->next;
}
C code to find keywords placed below one keyword in the keyword tree
int kw, *liste_kw, num;
liste_kw = (int *)malloc(longa * sizeof(int));
kw = iknum("division names", kkey); /* get rank of starting keyword */
if(kw == 0) return; /* keyword does not exist */
descen(kkey, kw, liste_kw);
/* list liste_kw contains all keywords placed below starting keyword in the tree
of keywords, including itself */
bit0(liste_kw, kw); /* remove starting keyword from list */
num = 1;
while((num = irbit(liste_kw, num, maxa)) != 0) {
readkey(num); /* here num is the rank of a descending
keyword in the tree of keywords */
}
C code to find all species below one taxon in the taxon tree
int sp, *liste_sp, num;
liste_sp = (int *)malloc(longa * sizeof(int));
sp = iknum("Mammalia", kspec); /* starting taxon */
if(sp == 0) return; /* taxon does not exist */
descen(kspec, sp, liste_sp);
/* list liste_sp contains all taxa placed below starting taxon in the tree of taxa,
including itself */
num = 1;
while((num = irbit(liste_sp, num, maxa)) != 0) {
readspec(num); /* here num is the rank of a descending
taxon in the tree of taxa */
if(pspec->plsub == 0) bit0(liste_sp, num);
/* if a taxon has no associated seq, remove it from list */
}
- ACCESS BY THE QUERY LANGUAGE
Other global variables :
- int tlist = 50 : total number of usable bitlists
- int defoccup[] : array giving the occupancy state of bitlists, TRUE when occupied.
- int *defbitlist : array holding all (occupied and free) bitlists; this array is pre-allocated by the API; each bitlist is lenw int-long and kth bitlist begins at defbitlist + k * lenw
- char *deflnames[] : array of names of bitlists, converted to uppercase, malloc'ed when created, and free'ed when deleted.
- int deflocus[] : array indicating whether bitlists contain parent sequences only (TRUE) or both parent and subsequences (FALSE).
- char defgenre[] : array indicating the type of bitlists; 'S', sequences; 'E', species; 'K' keywords.
- int defllen[] : array giving the number of elements in each bitlist
Query language API
- #include "requete_acnuc.h"
necessary when following functions are used
- void prep_acnuc_requete(void);
call this once before using the proc_requete function any number of times
- int proc_requete(char *query, char message[100], char *listname, int *listrank);
computes the bitlist of sequences (sometimes species or keywords) that match a query;
- query : the query string, for example sp=homo sapiens ou sp=bos taurus
- message : upon return, and in case of error, filled with an error describing message
- listname : (input) name to be given, after conversion to uppercase, to the bitlist to be constructed; if a list with this (uppercase only) name already exists, the list will be replaced by the new one.
- listrank : upon return, points to the rank of the created bitlist,
so that defbitlist + (*listrank)*lenw points to the beginning of this list.
- returned value : 0 if OK, ! = 0 indicates error.
- void free_list(int num);
frees bitlist of rank num for use by future queries.
Query API usage example
Here is a commented example of usage. It boils down to :
#include "dir_acnuc.h"
#include "requete_acnuc.h"
acnucopen();
prep_acnuc_requete();
apply function proc_requete to the query string
scan the bitlist produced by this function
Query language
All ACNUC queries can be processed by the proc_requete function. The query language defines several selection criteria and operations between lists of elements matching criteria. It creates mainly lists of sequences, but also lists of species (or, more generally, taxa) and of keywords.
Selection criteria are : (no space before the = sign)
- SP=taxon : seqs attached to taxon or any other below in tree; @ wildcard possible
- TID=id : seqs attached to given numerical NCBI's taxon id
- H=taxon : seqs whose host is taxon or any other below in tree; @ wildcard possible
- K=keyword : seqs attached to keyword or any other below in tree; @ wildcard possible
- T=type : seqs of specified type
- J=journal_name : seqs published in journal specified using defined journal code
- R=refcode : seqs from reference specified such as in jcode/volume/page (e.g., JMB/13/5432)
- AU=name : seqs from references having specified author (only last name, no initial)
- AC=accession_no : seqs attached to specified accession number
- N=seq_name : seqs of given name (ID or LOCUS); @ wildcard possible
- NS=taxon_name : taxon of given name; @ wildcard possible
- NK=keyword_name : keyword of given name; @ wildcard possible
- Y=year : seqs published in specified year; > and < can be used instead of =
- O=organelle : seqs from specified organelle named following defined code (e.g., chloroplast)
- M=molecule : seqs from specified molecule as named in ID or LOCUS annotation records
- ST=status : seqs from specified data class (EMBL) or review level (UniProt)
- F=file_name : seqs whose names are in given file, one name per line
- FA=file_name : seqs attached to accession numbers in given file, one number per line
- FK=file_name : produces the list of keywords named in given file, one keyword per line
- FS=file_name : produces the list of species named in given file, one species per line
- list_name : the named list that must have been previously constructed
Operators are : (always followed and preceded by spaces or parentheses)
- AND or ET : intersection of the 2 list operands
- OR or OU : union of the 2 list operands
- NOT or NO : complementation of the single list operand
- PAR or ME : compute the list of parent seqs of members of the single list operand
- SUB or FI : add subsequences of members of the single list operand
- PS : project to species: list of species attached to member sequences of the operand list
- PK : project to keywords: list of keywords attached to member sequences of the operand list
- UN : unproject: list of seqs attached to members of the species or keywords list operand
- SD : compute the list of species placed in the tree below the members of the species list operand
- KD : compute the list of keywords placed in the tree below the members of the keywords list operand
The query language is case insensitive except where filenames occur. Parentheses can be used to specify the range of operators. Three operators (AND, OR, NOT) can be ambiguous because they can also occur within valid criterion values. Such ambiguities can be solved by bracketting elementary selection criteria between double quotes. For example:
"sp=Beak and feather disease virus" and "au=ritchie"
- READING/WRITING ACNUC INDEX FILES
Macros or functions are devoted to the reading of one record for each index file in C structures that are always accessible through global variables.
Function or macro File Pntr to record DIR_FILE name
void readacc(int recnum); ACCESS pacc kacc
void readsub(int recnum); SUBSEQ psub ksub
void readloc(int recnum); LOCUS ploc kloc
void readshrt(int recnum); SHORTL pshrt kshrt
void readlng(int recnum); LONGL plng klng
void readext(int recnum); EXTRACT pext kext
void readsmj(int recnum); SMJYT psmj ksmj
void readaut(int recnum); AUTHOR paut kaut
void readbib(int recnum); BIBLIO pbib kbib
void readkey(int recnum); KEYWORDS pkey kkey
void readspec(int recnum); SPECIES pspec kspec
void readtxt(int recnum); TEXT ptxt ktxt
Writing is done similarly with macros writeacc, writesub, etc...
dir_acnuc.h details the structure associated to records of each ACNUC index files. For example, readsub(n) reads the nth record of file SUBSEQ into the following C structure pointed to by global variable psub :
struct rsub { /* SUBSEQ : one record for each (sub)sequence */
int length, /* seq length; or 0 if record was deleted */
type, /* to SMJYT, for seq type */
pext, /* if > 0 this is a subsequence, pext points to EXTRACT for list of exons;
if <= 0 this is a parent sequence, -pext points to LONGL for list of subseqs */
plkey, /* to SHORTL for list of keywords */
plinf, /* if parent sequence, plinf points to LOCUS for corresponding record;
if subsequence, points to SHORTL for list of address of start of annotations;
this list contains only one element to be combined with the division rank
for access to annotations */
phase, /* 100 * code_number + reading_frame_0_1_2 */
h; /* to SUBSEQ for next record with same hashing value or 0 */
char name[1]; sequence name padded by spaces to L_MNEMO chars
} *psub;
Two functions allow reading and writing the first record of each index file which differs from all other records by holding the total record number in the index:
- int read_first_rec(DIR_FILE *fp, int *endsort);
- fp : variable associated to an index file
- *endsort : returned with the rank of the last alphabetically sorted record; ignored if NULL
- return value : total record number in index file (counted from 1)
- void write_first_rec(DIR_FILE *fp, int total, int endsort);
Update the total record count in an index file
- fp : variable associated to an index file
- total : total record number in index file
- endsort: rank of the last alphabetically sorted record or 0 if not sorted at all (applies to ksub, kaut, kbib, kacc, ksmj only).
Index files contain fixed-length-space-padded strings. These are therefore not C strings because they are not ended by a null byte. A true C string is obtained as follows:
char nom[L_MNEMO + 1];
memcpy(nom, psub->name, L_MNEMO); nom[L_MNEMO] = 0; trim_key(nom);
Conversely, to write a C string name to an ACNUC index file buffer, do :
padtosize(psub->name, name, L_MNEMO);
this may affect other fields of the structure that should therefore be filled after.
Reading example :
int num, type;
char seqname[] = "ecotgp.trpa";
#define LCODE sizeof(psmj->name)
char code[LCODE + 1];
num = isenum(seqname); /* get the seq rank from its name */
readsub(num); /* read SUBSEQ record of rank num into buffer pointed to by psub */
type = psub->type; /* this field indicates the seq type */
readsmj(type); /* read SMJYT record corresponding to type */
memcpy(code, psmj->name, LCODE );/*prepare a C string from the name field of the SMJYT record*/
code[LCODE] = 0;
trim_key(code);
printf("type of sequence %s is %s\n", seqname, code);
- USING BIT LISTS
Bitlists allow to handle lists of sequences, species or keywords. List elements are represented by their rank. Ranks are the numbers in the ACNUC index files of corresponding records. Ranks are computed by gsnuml or isenum for sequences and iknum for species or keywords. Bitlists are arrays of integers. The range of rank values begins at 2 because index file records are numbered starting from 1 and record # 1 is reserved for holding the file's total record number.
- Allocation of an empty list:
int *mylist;
mylist = (int *)calloc(lenw, sizeof(int));
for a species or keyword list, longa can be used instead of lenw.
- void bit1(int *mylist, int num) : adds element of rank num to list mylist.
bit1(mylist, num);
- void bit0(int *mylist, int num) : removes element of rank num from list mylist.
bit0(mylist, num);
- int testbit(int *mylist, int num) : tests for presence of element of rank num in list mylist.
if( testbit(mylist, num) ) { num is present } else { num is absent }
- int irbit(int *mylist, int from, int last) : loop over all elements of a list.
int num = 1;
while ( ( num = irbit(mylist, num, lenbit) ) != 0) { work with element of rank num }
for a species or keywords list, lenbit can be replaced by maxa.
- Empty a bitlist
memset(mylist, 0, lenw * sizeof(int));
- void ou(int *result, int *list1, int *list2, int nwords) : Add two lists.
ou(result, list1, list2, lenw); /* replace lenw by longa for species or keywords lists */
List result, to be allocated before, will contain elements of list1 and those of list2, and can be one of list1 or list2.
- void et(int *result, int *list1, int *list2, int nwords) : Intersection of two lists.
et(result, list1, list2, lenw); /* replace lenw by longa for species or keywords lists */
List result, to be allocated before, will contain elements common to both list1 and list2, and can be one of list1 or list2.
- void non(int *result, int *list1, int nwords): complementation of a list.
Combine "non" with "et" to remove from a list the elements of another list:
non(result, list2, lenw);
et(result, list1, result, lenw);
List result, to be allocated before, will contain elements of list1 absent from list2.
- int bcount(int *mylist, int maxbits): count the number of elements in a list.
int nbr = bcount(mylist, lenbit);
- void lngbit(int recnum, int *blist): reads a long list from ACNUC indexes as a bitlist:
recnum: record number of the start of a long list
blist: a preallocated sequence bitlist
- UTILITY FUNCTIONS
- char complementer_base(char nucl);
- nucl : a character, normally one of aAcCgGtTuUrRyYnN
- returned value : the complementary base (lowercase, n if nucl is unknown char)
- void complementer_seq(char *seq, int length);
In place complementation (and inversion) of a sequence.
- void padtosize(char *pname, char *name, int length);
Completes a string to given length by adding spaces
- pname : upon return, string made from name padded/truncated to length (must be large enough to hold final null and must not overlap string name)
- name : unchanged input string
- length : length that pname has upon return
- int strcmptrail(char *s1, int l1, char *s2, int l2);
String comparison limited to lengths l1 and l2 and ignoring terminal spaces.
With s2==NULL and l2==0, s1 can be compared to a string of spaces only.
Returns as strcmp.
- void majuscules(char *name);
applies toupper to all of name.
- int trim_key(char *name);
removes trailing spaces from name, returns resulting length.
- void compact(char *string);
removes all space characters from string.
- int hashmn(char *seqname);
returns the hashing value in range [1..hsub] of the seqname that must have been padded by spaces to L_MNEMO characters.
- int hasnum(char *spkwname, int len);
returns the hashing value in range [1..hkwsp] of the species or keyword name that must have been padded by spaces to len (typically WIDTH_SP/WIDTH_KW) characters.
- enum endianness endian_test(void);
returns the host computer's endianness using enum endianness {big_endian, little_endian}.
- SIMULTANEOUS ACCESS TO SEVERAL ACNUC DATABASES
- int chg_acnuc(char *acnucvar, char *gcgacnucvar);
Allows to set the values of environment variables acnuc and gcgacnuc to direct the API to a desired database.
Returns TRUE iff not enough memory.
- void *store_acnuc_status(void);
Memorizes data relative to access to an opened ACNUC database.
Returns NULL iff not enough memory.
- void set_current_acnuc_db(void *db);
Directs the API to a database access to which had been previously memorized.
- int sizeof_acnuc_status(void);
Returns the byte size of the memorized data structure.
Usage example:
#include "dir_acnuc.h"
/* declare prototypes */
int chg_acnuc(char *acnucvar, char *gcgacnucvar);
void *store_acnuc_status(void);
void set_current_acnuc_db(void *db);
/* declare a void * for each used database */
void *db1, *db2;
/* open + memorize access to 1st database */
chg_acnuc("/banques0/genbank/index", "/banques0/genbank/flat_files");
acnucopen();
db1 = store_acnuc_status();
if(db1 == NULL) {
/* not enough memory */
exit(ERREUR);
}
/* open + memorize access to 2nd database */
chg_acnuc("/banques0/swissprot/index", "/banques0/swissprot/flat_files");
acnucopen();
db2 = store_acnuc_status();
if(db2 == NULL) {
exit(ERREUR);
}
/* directs the API to the 1st database */
set_current_acnuc_db(db1);
/* now access to the 1st database is possible */
gfrag(2, 1, 60, seq);
readsub(2);
printf("%.16s %s\n", psub->name, seq);
/* directs the API to the 2nd database */
set_current_acnuc_db(db2);
/* now access to the 2nd database is possible */
gfrag(2, 1, 60, seq);
readsub(2);
printf("%.16s %s\n", psub->name, seq);
- DATABASE MANAGEMENT FUNCTIONS
- int dir_set_mmap(DIR_FILE *kan);
(unix only) Attempts to place the whole of index file mentionned by kan in virtual memory, through the mmap system call, for faster access. The API for access to the mmap'ed index file is unchanged. Returns != 0 if mmap was impossible, which does not preclude I/O operations to be performed, but through simple read/write calls.
- void delseq(int nsub);
complete suppression of (sub)sequence of rank nsub from database.
- void addhsh(int recnum, DIR_FILE *kan);
adds record of rank recnum to hashing structure of index file kan (can be ksub, kspec, or kkey).
- void suphsh(int recnum, DIR_FILE *kan);
suppress record of rank recnum from hashing structure of index file kan
- void dir_acnucflush(void);
flushes to disk all changes to ACNUC index files
- int mdshrt(DIR_FILE *kan, int nrec, int offset, int val, int *newplist);
Modification of a short list
- kan : index file containing the starting address of the short list:
kloc, ksub, kbib, kacc, kaut, kspec, kkey
- nrec : rank in kan of the record containing the list starting address
- offset : position within record of the starting address of short list;
>0 indicates addition to list, <0 indicates suppression from list
- val : value to be added or suppressed
- *newplist : if not NULL, upon return pointer to start of modified short list
- return value : 1 if ok, 2 if error
- int mdlng(DIR_FILE *kan, int nrec, int offset, int val, int *newplist);
Modification of a long list
- kan : index file containing the starting address of the long list:
ksub,ksmj,kspec,kkey
- nrec : rank in kan of the record containing the list starting address
- offset : position within record of the starting address of long list;
>0 indicates addition to list, <0 indicates suppression from list
- val : value to be added or suppressed
- *newplist : if not NULL, upon return pointer to start of modified long list
- return value : 1 if ok, 2 if error
- int crespecies(char *ascend, char *name);
Creation of a species or taxon name
- ascend : name of taxon under which to place the newly created taxon in the tree (if NULL, new taxon is placed at root of tree)
- name : name of taxon or species to create (no creation if name already exists)
- return value : rank of newly created taxon
- int crekeyword(char *ascend, char *name);
Creation of a keyword
- ascend : name of keyword under which to place the newly created keyword in the tree (if NULL, new keyword is placed at root of tree)
- name : name of keyword to create (no creation if name already exists)
- return value : rank of newly created keyword
- void cre_new_division(char *name);
Creation of a new flat or gcg file division.
name : name of the division file (without extension, example: gbnew)
- int addshrt(int point, int value);
- int addlng(int point, int value);
Adds a value to a short or to a long list.
- point : rank of the record where the list begins in index file SHORTL (short list) or LONGL (long list)
- value : value to be added to the list
- return value : 1 when OK; 2 when the value was already present in the list.
- int supshrt(int point, int value);
- int suplng(int point, int value);
Removes a value from a short or a long list.
- point : rank of the record where the list begins in index file SHORTL (short list) or LONGL (long list)
- value : value to be removed from the list
- return value : 1 when OK; 2 when the list becomes empty after suppression; 3 when the value was not present in the list.
- int cretaxids(void);
Fully computes and writes index file TAXIDS by reading all id:#| data from species labels.
Returns 0 iff no error.
- void write_quick_meres(void);
Writes index file MERES. Must be called after having closed the modified acnuc db.