ACNUC C programming interface

ACNUC C Application Programming Interface

header file: dir_acnuc.h ---- full source code: acnucsoft.tar

INTRODUCTION
CONSTANTS / GLOBAL VARIABLES / TYPEDEFs
OPENING / CLOSING: acnucopen, simpleopen, dir_acnucopen, dir_acnucclose.
ACCESS BY SEQUENCE NAME: gsnuml, isenum.
ACCESS TO SEQUENCES: gfrag.
ACCESS TO SEQUENCE ANNOTATIONS: seq_to_annots, read_annots, next_annots, short_descr, short_descr_p.
TRANSLATION / GENETIC CODES: codaa, init_codon_to_aa, translate_cds, translate_init_codon, get_ncbi_gc_number, get_acnuc_gc_number, get_code_descr.
ACCESS BY SPECIES, KEYWORD, AUTHOR, REFERENCE, ACC. NO, etc: iknum, fcode, shkseq, descen, sel_seqs_1_node
ACCESS BY THE QUERY LANGUAGE
1. Other global variables: tlist, defbitlist, defoccup, deflnames, deflocus, defgenre, defllen.
2. Query language API: prep_requete, proc_requete, free_list.
3. Query API usage example
4. Query language
READING/WRITING ACNUC INDEX FILES: readacc, readsub,... , writeacc, writesub,... , read_first_rec, write_first_rec.
USING BIT LISTS: bit1, bit0, testbit, irbit, ou, et, non, bcount, lngbit.
UTILITY FUNCTIONS: compact, complementer_base, complementer_seq, hashmn, hasnum, majuscules, padtosize, strcmptrail, trim_key.
SIMULTANEOUS ACCESS TO SEVERAL ACNUC DATABASES: chg_acnuc, store_acnuc_status, set_current_acnuc_db, sizeof_acnuc_status.
DATABASE MANAGEMENT FUNCTIONS: addshrt, addlng, supshrt, suplng, mdshrt, mdlng, cre_new_division, crespecies, crekeyword, addhsh, suphsh, delseq, dir_set_mmap, dir_acnucflush.

INTRODUCTION
1. ACNUC environment
  ACNUC databases are made of a series of flat text files (called divisions) containing annotations and sequences and a series of index files allowing efficient access to sequence data. Two environment variables acnuc and gcgacnuc are used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively.
  The C ACNUC API contains a series of functions that carry common basic operations and the full query language. Other operations require understanding the logical structure of ACNUC databases for direct navigation through index files. The header file dir_acnuc.h describes in detail the organization of all index files. The API can handle 'large' files via 64-bit file offsets, both for division and index files.
  Usage of the C ACNUC API is as follows:
  - Include file dir_acnuc.h as the first include statement of your program.
  - Link with library libcacnuc.a (or libcacnucsol.a in the LBBE).
2. Anonymous ftp access to C source files
  All C sources are in publicly accessible file acnucsoft.tar . The makefile therein allows to build several ACNUC programs including query, the line-oriented ACNUC retrieval program. The makefile also builds the ACNUC library libcacnuc.a. Thus a program prog.c placed in the same directory as the ACNUC source files using the ACNUC API can be compiled by :
  gcc -o prog prog.c -L. -lcacnuc
3. LBBE (i.e., Lyon) access to C source files and libraries
  In the LBBE computer setup, all ACNUC software is in directory ~banques/csrc/ . Thus a program prog.c using the ACNUC API can be compiled by :
  gcc -o prog -I~banques/csrc prog.c -L~banques/csrc -lcacnucsol
4. Simple C API example. See also the main acnuc header file dir_acnuc.h.
CONSTANTS / GLOBAL VARIABLES / TYPEDEFs
- constant L_MNEMO : the fixed length of (sub-)sequence names.
- constant WIDTH_KS : the fixed length of species and keywords.
- int ACC_LENGTH : the max length of accession numbers; may vary according to the database.
- DIR_FILE : A normally opaque struct type used for buffered and random access to ACNUC index files.
- kacc, kaut, kbib, kext, kkey, kloc, klng, kshrt, ksmj, kspec, ksub, ktxt : global variables of type pointer to DIR_FILE associated to each of the ACNUC index files named ACCESS, AUTHOR, BIBLIO, EXTRACT, KEYWORDS, LOCUS, LONGL, SHORTL, SMJYT, SPECIES, SUBSEQ, TEXT, respectively.
- nseq : total number of records in file SUBSEQ (= maximum # of bits in a bit list of sequences)
- maxa : the largest total record number among files SPECIES and KEYWORDS
- lenbit : the largest among nseq and maxa (size in bits of a bit list that can contain either sequences, species, or keywords).
- lenw : size in int of a bitlist holding lenbit bits (useful to allocate a bitlist of sequences, species, or keywords).
- longa : size in int of a bitlist holding maxa bits (useful to allocate a bitlist of species or keywords).
- flat_format : TRUE when using text flat files; FALSE when using GCG files.
- genbank : TRUE iff annotations follow the GenBank syntax
- embl : TRUE iff annotations follow the EMBL syntax
- nbrf : TRUE iff annotations follow the PIR/NBRF syntax
- swissprot : TRUE iff annotations follow the SwissProt syntax
- divisions : rank of the last division file of the database (counting from 0, so there are divisions+1 divisions)
- char **gcgname : array of division names, all without extension
- int *annotopened : tells whether each division is currently opened
- FILE **divannot : arrays of streams associated to currently opened division
- int hsub, hkwsp : parameters that control hashing of sequence, keywords and species names
OPENING / CLOSING
- void acnucopen(void);
  Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for full, read-only access.
- void simpleopen(void);
  Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for partial, read-only access : only access by sequence name and to annotations and sequences is possible.
- void dir_acnucopen(char *db_access);
  Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for full access. Access is read-only if db_access == "RO" or read/write if db_access == "WP".
- void dir_acnucclose(void);
  Closes access to the current ACNUC database.
ACCESS BY SEQUENCE NAME
ACNUC nucleotide databases contain parent sequences, that are regular database entries, and subsequences, that are one or several fragments of one or several parents as defined in a features table entry. Subsequences are named by adding to the parent name a dot and an extension (e.g., ECOTGP.TRPA).
- int gsnuml(char *name, int *length, int *frame, int *gencode);
  - name : sequence name terminated with \0 (upper/lowercase accepted)
  - *length : upon return, contains the sequence length
  - *frame : ignored if NULL, or returned with reading frame (0, 1, 2)
  - *gencode : ignored if NULL, or returned with id of genetic code (0=usual code)
  - returned value : the rank in the database of the (sub)sequence named "name", or 0 if none exists
- int isenum(char *name);
  - name : null-terminated sequence name (upper/lowercase accepted)
  - returned value : the rank in the database of the (sub)sequence named "name", or 0 if none exists
ACCESS TO SEQUENCES
- int gfrag(int nsub, int first, int lfrag, char *dseq);
  - nsub : rank of (sub)sequence
  - first : starting position (counting from 1) for sequence access
  - lfrag : number of positions asked for
  - dseq : upon return, null-terminated string filled with bases or aa read
    dseq is allocated by the caller
    less than lfrag positions can be read if sequence end is reached
  - returned value : actual number of residues read, or 0 if any error
- void rdnuc(int point_nuc, int mlong);
  Never use, except with PIR/NBRF to access punctuation between residues.
ACCESS TO SEQUENCE ANNOTATIONS
For a parent sequence, the only possible access is to its first annotation line and to following lines.
For a subsequence, the only possible access is to the first line of the corresponding FEATURE (e.g., CDS, tRNA, etc...) and to following lines.
Moreover, access to a previously accessed annotation line is possible provided the address of this line, returned by the next_annots function, is memorized.
- void seq_to_annots(int numseq, long *faddr, int *div);
  This function gives the caller the information needed to access the first annotation line of a (sub)sequence.
  - seqnum : rank of parent or subsequence
  - *faddr, *div : upon return, couple of data used to access annotations via the read_annots function.
- char *read_annots(long faddr, int div);
  Returns in static memory the annotation line addressed by the faddr and div arguments. Trailing \n and spaces are removed.
  To access following annotation lines, use :
- char *next_annots(NULL);
  Returns the annotation line following the last one read.
- char *next_annots(long *pfaddr);
  This alternative call is useful to allow re-access to an annotation line, later in the program. First, read this line with next_annots and a non-NULL argument, memorize the long value obtained upon return, and use this value as the faddr argument of a call to read_annots any time later. The necessary div argument is the same for any annotation line of one sequence.
- char *short_descr(int seqnum, char *text, int maxlen);
  - seqnum : (sub)sequence rank
  - text : upon return, char string filled with a short sequence description built with the sequence name and, for a parent sequence, from DE/DEFINITION lines, and for a subsequence, from corresponding "qualifiers".
  - maxlen : max memory size for text
  - returned value : pointer to text
- char *short_descr_p(int seqnum, char *text, int maxlen);
  same as short_descr for a parent sequence;
  for a subsequence, applies short_descr to its main parent.
TRANSLATION / GENETIC CODES
- char codaa(char *codon, int code);
  - codon : pointer to trinucleotide (e.g. acu, GGT)
  - code : genetic code id (e.g., computed by gsnuml, or 0 for the usual code)
  - returned value : the corresponding amino acid on one character
- char init_codon_to_aa(char *codon, int gc);
  - codon : pointer to initiation codon (e.g. aug, GTG)
  - gc : genetic code id
  - returned value : the corresponding amino acid on one character using the initiation codon rule of the genetic code.
- char *get_code_descr(int code);
  - code : genetic code id (e.g., computed by gsnuml)
  - returned value : string <= 60 chars describing how this genetic code differs from the usual one (e.g. AGR=* AUA=M UGA=W )
- char *translate_cds(int seqnum);
  Complete translation, returned in malloc'ed memory, of sequence of rank seqnum (often a subsequence) using the sequence's genetic code and its rule concerning the initiation codon.
- char translate_init_codon(int seqnum, int gc, int codon_start /* 1, 2, or 3 */);
  returns in one char the translation of the initiation codon of sequence of rank seqnum using the genetic code of id gc and the offset codon_start for correct reading frame.
- int get_ncbi_gc_number(int gc);
  returns the NCBI id of the genetic code with ACNUC id gc
- int get_acnuc_gc_number(int ncbi_gc);
  returns the ACNUC id of the genetic code with NCBI id ncbi_gc
  returns 0 (=usual code) if not found.

ACCESS BY SPECIES, KEYWORD, AUTHOR, REFERENCE, ACC. NUMBER, etc...

int iknum(char *name, DIR_FILE *fp);
- name : taxon or keyword name (null-terminated string ignoring case)
- fp : kkey for keyword or kspec for a taxon name
- returned value : rank of name of 0 if does not exist
int fcode(DIR_FILE *fp, char *key, int lcompar);
- fp : kacc, kaut, ksmj, kbib for accession-number, author-name, SMJYT, or reference, respectively
- key : string to search (case is ignored)
- lcompar : number of used characters in key during search
- returned value : rank of found key in corresponding index file, or 0 if key does not exist.
int shkseq(char *name, int *bitlist, int oper);
- name : taxon or keyword name (null-terminated string ignoring case); can contain @ characters to indicate wildcards.
- bitlist : integer array of size at least lenw to be filled upon return with the bitlist of seqs attached to all taxa or keywords placed below name in the species or keywords trees.
- oper : (input) 1 for species, 2 for host, 3 for keywords.
- returned value : 1 when OK; 2 when nothing matches name in index file
void sel_seqs_1_node(DIR_FILE *kan, int recnum, int *bitlist, int host);
- kan : kspec for species or kkey for keywords
- recnum : rank in index file adressed by kan of a species or a keyword
- bitlist : integer array of size at least lenw to be filled upon return with the bitlist of seqs attached to all taxa or keywords placed below name in the species or keywords trees. Normally transmitted empty (= all 0s) by caller.
- host: TRUE iff kan==kspec and host sequences of taxon are expected
void descen(DIR_FILE *kan, int recnum, int *bitlist);
- kan : kspec for species or kkey for keywords
- recnum : starting record rank in file adressed by kan
- bitlist : integer array of size at least longa to be filled upon return with the bitlist of taxa or of keywords placed below node of rank recnum in the species or keywords trees.

C code to find seqs attached to an accession no.:


	char access[] = "M00001";
	int num, point, seq;
	num = fcode(kacc, access, ACC_LENGTH);
	if(num == 0) return; /* this accession no does not exist */
	readacc(num);
	point = pacc->plsub;
	while(point != 0) {
		readshrt(point);
		seq = pshrt->val; /* seq is the rank of a sequence attached to given acc no. */
		point = pshrt->next;
		}

C code to find seqs attached to a taxon or keyword


	char my_taxon[] = "Bovidae"; /* case ignored */
	char my_kw[] = "ribosomal protein"; /* case ignored */
	int num, err, *list;

	list = (int *)malloc(lenw * sizeof(int)).
	err = shkseq(my_taxon, list, 1);
	if(err == 2) return; /* taxon does not exist */
	num = 1;
	while( (num = irbit(list, num, nseq)) != 0) {
		/* here num is the rank of a seq attached to taxon my_taxon */
		}

	err = shkseq(my_kw, list, 3);
	if(err == 2) return; /* keyword does not exist */
	num = 1;
	while( (num = irbit(list, num, nseq)) != 0) {
		/* here num is the rank of a seq attached to keyword my_kw */
		}

	free(list);

C code to find all keywords attached to a sequence


	int num, kw, point;

	num = isenum("ECOTGP"); /* get rank of starting sequence name */
	readsub(num);
	point = psub->plkey;
	while(point != 0) {
		readshrt(point);
		kw = pshrt->val; /* here kw is the rank of an attached keyword */
		point = pshrt->next;
		}

C code to find keywords placed below one keyword in the keyword tree


	int kw, *liste_kw, num;

	liste_kw = (int *)malloc(longa * sizeof(int));
	kw = iknum("division names", kkey); /* get rank of starting keyword */
	if(kw == 0) return; /* keyword does not exist */
	descen(kkey, kw, liste_kw);
	/* list liste_kw contains all keywords placed below starting keyword in the tree 
	of keywords, including itself */
	bit0(liste_kw, kw); /* remove starting keyword from list */
	num = 1;
	while((num = irbit(liste_kw, num, maxa)) != 0) {
		readkey(num); /* here num is the rank of a descending 
			keyword in the tree of keywords */
		}

C code to find all species below one taxon in the taxon tree


	int sp, *liste_sp, num;

	liste_sp = (int *)malloc(longa * sizeof(int));
	sp = iknum("Mammalia", kspec); /* starting taxon */
	if(sp == 0) return; /* taxon does not exist */
	descen(kspec, sp, liste_sp);
	/* list liste_sp contains all taxa placed below starting taxon in the tree of taxa, 
	including itself */
	num = 1;
	while((num = irbit(liste_sp, num, maxa)) != 0) {
		readspec(num); /* here num is the rank of a descending 
			taxon in the tree of taxa */
		if(pspec->plsub == 0) bit0(liste_sp, num);
		/* if a taxon has no associated seq, remove it from list */
		}

ACCESS BY THE QUERY LANGUAGE
Other global variables :
- int tlist = 50 : total number of usable bitlists
- int defoccup[] : array giving the occupancy state of bitlists, TRUE when occupied.
- int *defbitlist : array holding all (occupied and free) bitlists; this array is pre-allocated by the API; each bitlist is lenw int-long and k^th bitlist begins at defbitlist + k * lenw
- char *deflnames[] : array of names of bitlists, converted to uppercase, malloc'ed when created, and free'ed when deleted.
- int deflocus[] : array indicating whether bitlists contain parent sequences only (TRUE) or both parent and subsequences (FALSE).
- char defgenre[] : array indicating the type of bitlists; 'S', sequences; 'E', species; 'K' keywords.
- int defllen[] : array giving the number of elements in each bitlist
Query language API
- #include "requete_acnuc.h"
  necessary when following functions are used
- void prep_acnuc_requete(void);
  call this once before using the proc_requete function any number of times
- int proc_requete(char *query, char message[100], char *listname, int *listrank);
  computes the bitlist of sequences (sometimes species or keywords) that match a query;
  - query : the query string, for example sp=homo sapiens ou sp=bos taurus
  - message : upon return, and in case of error, filled with an error describing message
  - listname : (input) name to be given, after conversion to uppercase, to the bitlist to be constructed; if a list with this (uppercase only) name already exists, the list will be replaced by the new one.
  - listrank : upon return, points to the rank of the created bitlist, so that defbitlist + (*listrank)*lenw points to the beginning of this list.
  - returned value : 0 if OK, ! = 0 indicates error.
- void free_list(int num);
  frees bitlist of rank num for use by future queries.
Query API usage example
Here is a commented example of usage. It boils down to :
```
#include "dir_acnuc.h"
#include "requete_acnuc.h"
acnucopen();
prep_acnuc_requete();
apply function proc_requete to the query string
scan the bitlist produced by this function
```
Query language
All ACNUC queries can be processed by the proc_requete function. The query language defines several selection criteria and operations between bitlists of elements matching criteria. It creates mainly bitlists of sequences, but also bitlists of species, or, more generally, taxa, and of keywords. The query language is case insensitive.
Selection criteria are : (no space before the = sign)
- SP=taxon : seqs attached to taxon or any other below in tree; @ wildcard possible
- K=keyword : seqs attached to keyword or any other below in tree; @ wildcard possible
- T=type : seqs of specified type
- J=journal_name : seqs published in journal specified using defined journal code
- R=refcode : seqs from reference specified such as in jcode/volume/page (e.g., JMB/13/5432)
- AU=name : seqs from references having specified author (only last name, no initial)
- AC=accession_no : seqs attached to specified accession number
- N=seq_name : seqs of given name (ID or LOCUS); @ wildcard possible
- Y=year : seqs published in specified year; > and < can be used instead of =
- O=organelle : seqs from specified organelle named following defined code (e.g., chloroplast)
- M=molecule : seqs from specified molecule as named in ID or LOCUS annotation records
- F=file_name : seqs whose names are in given file, one name per line
- FA=file_name : seqs attached to accession numbers in given file, one number per line
- FK=file_name : produces the bitlist of keywords named in given file, one keyword per line
- FS=file_name : produces the bitlist of species named in given file, one species per line
- list_name : the named bitlist that must have been previously constructed
Operators are : (always followed and preceded by a space)
- ET : intersection of the 2 bitlist operands
- OU : union of the 2 bitlist operands
- NO : complementation of the single bitlist operand
- ME : compute the bitlist of parent seqs of members of the single bitlist operand
- FI : add subsequences of members of the single bitlist operand
- PS : compute the bitlist of species attached to member sequences of the operand bitlist
- PK : compute the bitlist of keywords attached to member sequences of the operand bitlist
- UN : compute the bitlist of seqs attached to members of the species or keywords bitlist operand
- SD : compute the bitlist of species placed in the tree below the members of the species bitlist operand
- KD : compute the bitlist of keywords placed in the tree below the members of the keywords bitlist operand

READING/WRITING ACNUC INDEX FILES
Macros or functions are devoted to the reading of one record for each index file in C structures that are always accessible through global variables.


Function or macro          File    Pntr to record  DIR_FILE name
void readacc(int recnum);  ACCESS	pacc		kacc
void readsub(int recnum);  SUBSEQ	psub		ksub
void readloc(int recnum);  LOCUS	ploc		kloc
void readshrt(int recnum); SHORTL	pshrt		kshrt
void readlng(int recnum);  LONGL	plng		klng
void readext(int recnum);  EXTRACT	pext		kext
void readsmj(int recnum);  SMJYT	psmj		ksmj
void readaut(int recnum);  AUTHOR	paut		kaut
void readbib(int recnum);  BIBLIO	pbib		kbib
void readkey(int recnum);  KEYWORDS	pkey		kkey
void readspec(int recnum); SPECIES	pspec		kspec
void readtxt(int recnum);  TEXT         ptxt		ktxt

Writing is done similarly with macros writeacc, writesub, etc...

Two functions allow reading and writing the first record of each index file which differs from all other records by holding the total record number in the index:

int read_first_rec(DIR_FILE *fp, int *endsort);
- fp : variable associated to an index file
- *endsort : returned with the rank of the last alphabetically sorted record; ignored if NULL
- return value : total record number in index file (counted from 1)
void write_first_rec(DIR_FILE *fp, int total, int endsort);
Update the total record count in an index file
- fp : variable associated to an index file
- total : total record number in index file
- endsort: rank of the last alphabetically sorted record or 0 if not sorted at all (applies to ksub, kaut, kbib, kacc, ksmj only).

dir_acnuc.h details the structure of records of all ACNUC index files, example:

struct rsub {     /* SUBSEQ : one record for each (sub)sequence */
    char name[L_MNEMO];
    int length, /* seq length; or 0 if record was deleted */
	type, /* to SMJYT, for seq type */
	pext, /* if > 0 this is a subsequence, pext points to EXTRACT for list of exons;
	   	if <= 0 this is a parent sequence, -pext points to LONGL for list of subseqs */
	plkey, /* to SHORTL for list of keywords */
	plinf, /* if parent sequence, plinf points to LOCUS for corresponding record;
	   	 if subsequence, points to SHORTL for list of address of start of annotations; 
	   	 this list contains only one element to be combined with the division rank
	   	 for access to annotations */
	phase, /* 100 * code_number + reading_frame_0_1_2 */
	h; /* to SUBSEQ for next record with same hashing value or 0  */
	} *psub;

Index files contain fixed-length-space-padded strings. These are therefore not C strings because they are not ended by a null byte. A true C string is obtained as follows:
char nom[L_MNEMO + 1]; memcpy(nom, psub->name, L_MNEMO); nom[L_MNEMO] = 0; trim_key(nom);

Conversely, to write a C string name to an ACNUC index file buffer, do :
padtosize(psub->name, name, L_MNEMO);
this may affect other fields of the structure that should therefore be filled after.

Reading example :

int num, type;
char seqname[] = "ecotgp.trpa";
#define LCODE sizeof(psmj->name)
char code[LCODE +  1];

num = isenum(seqname); /* get the seq rank from its name */
readsub(num); /* read SUBSEQ record of rank num into buffer pointed to by psub */
type = psub->type; /* this field indicates the seq type */
readsmj(type); /* read SMJYT record corresponding to type */
memcpy(code, psmj->name, LCODE );/*prepare a C string from the name field of the SMJYT record*/
code[LCODE] = 0;
trim_key(code);
printf("type of sequence %s is %s\n", seqname, code);

USING BIT LISTS
Bitlists allow to handle lists of sequences, species or keywords. List elements are represented by their rank. Ranks are the numbers in the ACNUC index files of corresponding records. Ranks are computed by gsnuml or isenum for sequences and iknum for species or keywords. Bitlists are arrays of integers. The range of rank values begins at 2 because index file records are numbered starting from 1 and record # 1 is reserved for holding the file's total record number.
- Allocation of an empty list:
  int *mylist; mylist = (int *)calloc(lenw, sizeof(int));
  for a species or keyword list, longa can be used instead of lenw.
- void bit1(int *mylist, int num) : adds element of rank num to list mylist.
  bit1(mylist, num);
- void bit0(int *mylist, int num) : removes element of rank num from list mylist.
  bit0(mylist, num);
- int testbit(int *mylist, int num) : tests for presence of element of rank num in list mylist.
  if( testbit(mylist, num) ) { num is present } else { num is absent }
- int irbit(int *mylist, int from, int last) : loop over all elements of a list.
  int num = 1; while ( ( num = irbit(mylist, num, lenbit) ) != 0) { work with element of rank num }
  for a species or keywords list, lenbit can be replaced by maxa.
- Empty a bitlist
  memset(mylist, 0, lenw * sizeof(int));
- void ou(int *result, int *list1, int *list2, int nwords) : Add two lists.
  ou(result, list1, list2, lenw); /* replace lenw by longa for species or keywords lists */
  List result, to be allocated before, will contain elements of list1 and those of list2, and can be one of list1 or list2.
- void et(int *result, int *list1, int *list2, int nwords) : Intersection of two lists.
  et(result, list1, list2, lenw); /* replace lenw by longa for species or keywords lists */
  List result, to be allocated before, will contain elements common to both list1 and list2, and can be one of list1 or list2.
- void non(int *result, int *list1, int nwords): complementation of a list.
  Combine "non" with "et" to remove from a list the elements of another list:
  non(result, list2, lenw); et(result, list1, result, lenw);
  List result, to be allocated before, will contain elements of list1 absent from list2.
- int bcount(int *mylist, int maxbits): count the number of elements in a list.
  int nbr = bcount(mylist, lenbit);
- void lngbit(int recnum, int *blist): reads a long list from ACNUC indexes as a bitlist:
  recnum: record number of the start of a long list
  blist: a preallocated sequence bitlist
UTILITY FUNCTIONS
- char complementer_base(char nucl);
  - nucl : a character, normally one of aAcCgGtTuUrRyYnN
  - returned value : the complementary base (lowercase, n if nucl is unknown char)
- void complementer_seq(char *seq, int length);
  In place complementation (and inversion) of a sequence.
- void padtosize(char *pname, char *name, int length);
  Completes a string to given length by adding spaces
  - pname : upon return, string made from name padded/truncated to length (must be large enough to hold final null and must not overlap string name)
  - name : unchanged input string
  - length : length that pname has upon return
- int strcmptrail(char *s1, int l1, char *s2, int l2);
  String comparison limited to lengths l1 and l2 and ignoring terminal spaces.
  With s2==NULL and l2==0, s1 can be compared to a string of spaces only.
  Returns as strcmp.
- void majuscules(char *name);
  applies toupper to all of name.
- int trim_key(char *name);
  removes trailing spaces from name, returns resulting length.
- void compact(char *string);
  removes all space characters from string.
- int hashmn(char *seqname);
  returns the hashing value in range [1..hsub] of the seqname that must have been padded by spaces to L_MNEMO characters.
- int hasnum(char *spkwname);
  returns the hashing value in range [1..hkwsp] of the species or keyword name that must have been padded by spaces to WIDTH_KS characters.

SIMULTANEOUS ACCESS TO SEVERAL ACNUC DATABASES

int chg_acnuc(char *acnucvar, char *gcgacnucvar);
Allows to set the values of environment variables acnuc and gcgacnuc to direct the API to a desired database.
Returns TRUE iff not enough memory.
void *store_acnuc_status(void);
Memorizes data relative to access to an opened ACNUC database.
Returns NULL iff not enough memory.
void set_current_acnuc_db(void *db);
Directs the API to a database access to which had been previously memorized.
int sizeof_acnuc_status(void);
Returns the byte size of the memorized data structure.

Usage example:

#include "dir_acnuc.h"

/* declare prototypes */
int chg_acnuc(char *acnucvar, char *gcgacnucvar);
void *store_acnuc_status(void);
void set_current_acnuc_db(void *db);

/* declare a void * for each used database */
void *db1, *db2;

/* open + memorize access to 1st database */
chg_acnuc("/banques0/genbank/index", "/banques0/genbank/flat_files");
acnucopen();
db1 = store_acnuc_status();
if(db1 == NULL) {
	/* not enough memory */
	exit(ERREUR);
	}

/* open + memorize access to 2nd database */
chg_acnuc("/banques0/swissprot/index", "/banques0/swissprot/flat_files");
acnucopen();
db2 = store_acnuc_status();
if(db2 == NULL) {
	exit(ERREUR);
	}

/* directs the API to the 1st database */
set_current_acnuc_db(db1);
/* now access to the 1st database is possible */
gfrag(2, 1, 60, seq);
readsub(2);
printf("%.16s %s\n", psub->name, seq);

/* directs the API to the 2nd database */
set_current_acnuc_db(db2);
/* now access to the 2nd database is possible */
gfrag(2, 1, 60, seq);
readsub(2);
printf("%.16s %s\n", psub->name, seq);

DATABASE MANAGEMENT FUNCTIONS
- int dir_set_mmap(DIR_FILE *kan);
  (unix only) Attempts to place the whole of index file mentionned by kan in virtual memory, through the mmap system call, for faster access. The API for access to the mmap'ed index file is unchanged. Returns != 0 if mmap was impossible, which does not preclude I/O operations to be performed, but through simple read/write calls.
- void delseq(int nsub);
  complete suppression of (sub)sequence of rank nsub from database.
- void addhsh(int recnum, DIR_FILE *kan);
  adds record of rank recnum to hashing structure of index file kan (can be ksub, kspec, or kkey).
- void suphsh(int recnum, DIR_FILE *kan);
  suppress record of rank recnum from hashing structure of index file kan
- void dir_acnucflush(void);
  flushes to disk all changes to ACNUC index files
- int mdshrt(DIR_FILE *kan, int nrec, int offset, int val, int *newplist);
  Modification of a short list
  - kan : index file containing the starting address of the short list: kloc, ksub, kbib, kacc, kaut, kspec, kkey
  - nrec : rank in kan of the record containing the list starting address
  - offset : position within record of the starting address of short list;
    >0 indicates addition to list, <0 indicates suppression from list
  - val : value to be added or suppressed
  - *newplist : if not NULL, upon return pointer to start of modified short list
  - return value : 1 if ok, 2 if error
- int mdlng(DIR_FILE *kan, int nrec, int offset, int val, int *newplist);
  Modification of a long list
  - kan : index file containing the starting address of the long list: ksub,ksmj,kspec,kkey
  - nrec : rank in kan of the record containing the list starting address
  - offset : position within record of the starting address of long list;
    >0 indicates addition to list, <0 indicates suppression from list
  - val : value to be added or suppressed
  - *newplist : if not NULL, upon return pointer to start of modified long list
  - return value : 1 if ok, 2 if error
- int crespecies(char *ascend, char *name);
  Creation of a species or taxon name
  - ascend : name of taxon under which to place the newly created taxon in the tree (if NULL, new taxon is placed at root of tree)
  - name : name of taxon or species to create (no creation if name already exists)
  - return value : rank of newly created taxon
- int crekeyword(char *ascend, char *name);
  Creation of a keyword
  - ascend : name of keyword under which to place the newly created keyword in the tree (if NULL, new keyword is placed at root of tree)
  - name : name of keyword to create (no creation if name already exists)
  - return value : rank of newly created keyword
- void cre_new_division(char *name);
  Creation of a new flat or gcg file division.
  name : name of the division file (without extension, example: gbnew)
- int addshrt(int point, int value);
- int addlng(int point, int value);
  Adds a value to a short or to a long list.
  - point : rank of the record where the list begins in index file SHORTL (short list) or LONGL (long list)
  - value : value to be added to the list
  - return value : 1 when OK; 2 when the value was already present in the list.
- int supshrt(int point, int value);
- int suplng(int point, int value);
  Removes a value from a short or a long list.
  - point : rank of the record where the list begins in index file SHORTL (short list) or LONGL (long list)
  - value : value to be removed from the list
  - return value : 1 when OK; 2 when the list becomes empty after suppression; 3 when the value was not present in the list.

ACNUC C Application Programming Interface

Contents :