ACNUC C Application Programming Interface

header file: dir_acnuc.h ---- full source code: acnucsoft.tar

Contents :

  1. INTRODUCTION
    1. ACNUC environment
    2. Anonymous ftp access to C source files
    3. LBBE (i.e., Lyon) access to C source files and libraries
    4. Simple C API example
  2. CONSTANTS / GLOBAL VARIABLES / TYPEDEFs
  3. OPENING / CLOSING: acnucopen, simpleopen, dir_acnucopen, dir_acnucclose.
  4. ACCESS BY SEQUENCE NAME: gsnuml, isenum.
  5. ACCESS TO SEQUENCES: gfrag, prep_extract, extract_1_seq, fin_extract.
  6. ACCESS TO SEQUENCE ANNOTATIONS: seq_to_annots64, read_annots64, next_annots64, short_descr, short_descr_p, read_loc_qualif.
  7. TRANSLATION / GENETIC CODES: codaa, init_codon_to_aa, translate_cds, translate_init_codon, get_ncbi_gc_number, get_acnuc_gc_number, get_code_descr.
  8. ACCESS BY SPECIES, KEYWORD, AUTHOR, REFERENCE, ACC. NO, etc: iknum, fcode, shkseq, descen, sel_seqs_1_node, taxidtosp, sptotaxid, descen, get_ancestor_taxon
    1. C code to find seqs attached to an accession no.
    2. C code to find seqs attached to a taxon, taxID or keyword
    3. C code to find all keywords attached to a sequence
    4. C code to find keywords placed below one keyword in the keyword tree
    5. C code to find all species below one taxon in the taxon tree
  9. ACCESS BY THE QUERY LANGUAGE
    1. Other global variables: tlist, defbitlist, defoccup, deflnames, deflocus, defgenre, defllen.
    2. Query language API: prep_requete, proc_requete, free_list.
    3. Query API usage example
    4. Query language
  10. READING/WRITING ACNUC INDEX FILES: readacc, readsub,... , writeacc, writesub,... , read_first_rec, write_first_rec.
  11. USING BIT LISTS: bit1, bit0, testbit, irbit, ou, et, non, bcount, lngbit.
  12. UTILITY FUNCTIONS: compact, complementer_base, complementer_seq, endian_test, hashmn, hasnum, majuscules, padtosize, strcmptrail, trim_key.
  13. SIMULTANEOUS ACCESS TO SEVERAL ACNUC DATABASES: chg_acnuc, store_acnuc_status, set_current_acnuc_db, sizeof_acnuc_status.
  14. DATABASE MANAGEMENT FUNCTIONS: addshrt, addlng, supshrt, suplng, mdshrt, mdlng, cre_new_division, crespecies, crekeyword, cretaxids, addhsh, suphsh, delseq, dir_set_mmap, dir_acnucflush, write_quick_meres.



  1. INTRODUCTION
    1. ACNUC environment
      ACNUC databases are made of a series of flat text files (called divisions) containing annotations and sequences and a series of index files allowing efficient access to sequence data. Two environment variables acnuc and gcgacnuc are used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively.

      The C ACNUC API contains a series of functions that carry common basic operations and the full query language. Other operations require understanding the logical structure of ACNUC databases for direct navigation through index files. The header file dir_acnuc.h describes in detail the organization of all index files. The API can handle 'large' files via 64-bit file offsets, both for division and index files.

      Usage of the C ACNUC API is as follows:
      - Include file dir_acnuc.h as the first include statement of your program.
      - Link with library libcacnuc.a (or libcacnucsol.a in the LBBE).

    2. Anonymous ftp access to C source files
      All C sources are in publicly accessible file acnucsoft.tar. The makefile therein allows to build several ACNUC programs including query, the line-oriented ACNUC retrieval program. The makefile also builds the ACNUC library libcacnuc.a. Thus a program prog.c placed in the same directory as the ACNUC source files using the ACNUC API can be compiled by :
      gcc -o prog prog.c -L. -lcacnuc

    3. LBBE (i.e., Lyon) access to C source files and libraries
      In the LBBE computer setup, all ACNUC software is in directory ~banques/csrc/ . Thus a program prog.c using the ACNUC API can be compiled by :
      gcc -o prog -I~banques/csrc prog.c -L~banques/csrc -lcacnucsol

    4. Simple C API example. See also the main acnuc header file dir_acnuc.h.


  2. CONSTANTS / GLOBAL VARIABLES / TYPEDEFs


  3. OPENING / CLOSING


  4. ACCESS BY SEQUENCE NAME
    ACNUC nucleotide databases contain
    parent sequences, that are regular database entries, and subsequences, that are one or several fragments of one or several parents as defined in a features table entry. Subsequences are named by adding to the parent name a dot and an extension (e.g., ECOTGP.TRPA).

  5. ACCESS TO SEQUENCES
    int gfrag(int nsub, int first, int lfrag, char *dseq);
    Gets any part of any (sub)sequence of the database.
    int prep_extract(int usefilename, out_option option, char *fname, extract_option choice, char *feature_name, char *bounds, char *min_bounds, char **message);
    Prepares later extraction operations (performed by calling extract_1_seq on one or several sequences) by specifying extraction format, type, and output destination.
    typedef int (*writefunction)(const char *, void *);
    struct stream_output {
    	void *stream;
    	writefunction outonelinef;
    	};
    All extraction output is sent to caller-provided function outonelinef that is called with one line of output data as first argument and with opaque data pointer stream as second.

    int extract_1_seq(int seqnum, char *bounds);
    Extracts sequence of rank seqnum according to extraction rules given by previous call to prep_extract.

    void fin_extract(void);
    To be called once after all calls to extract_1_seq to close the extraction output.

  6. ACCESS TO SEQUENCE ANNOTATIONS
    For a parent sequence, the only possible access is to its first annotation line and to following lines.
    For a subsequence, the only possible access is to the first line of the corresponding FEATURE (e.g., CDS, tRNA, etc...) and to following lines.
    Moreover, access to a previously accessed annotation line is possible provided the address of this line, returned by the next_annots64 function, is memorized.

  7. TRANSLATION / GENETIC CODES

  8. ACCESS BY SPECIES, KEYWORD, AUTHOR, REFERENCE, ACC. NUMBER, etc...

    C code to find seqs attached to an accession no.:

    
    	char access[] = "M00001";
    	int num, point, seq;
    	num = fcode(kacc, access, ACC_LENGTH);
    	if(num == 0) return; /* this accession no does not exist */
    	readacc(num);
    	point = pacc->plsub;
    	while(point != 0) {
    		readshrt(point);
    		seq = pshrt->val; /* seq is the rank of a sequence attached to given acc no. */
    		point = pshrt->next;
    		}
    

    C code to find seqs attached to a taxon, taxID or keyword

    
    	char my_taxon[] = "Bovidae"; /* case ignored */
    	char my_kw[] = "ribosomal protein"; /* case ignored */
    	int tid = 284813 ; /* taxon id of Encephalitozoon cuniculi */
    	int num, err, *list, numsp;
    
    	list = (int *)calloc(lenw , sizeof(int) ).
    	err = shkseq(my_taxon, list, 1);
    	if(err == 2) return; /* taxon does not exist */
    	num = 1;
    	while( (num = irbit(list, num, nseq)) != 0) {
    		/* here num is the rank of a seq attached to taxon my_taxon */
    		}
    
    	numsp = taxidtosp(tid);
    	if(numsp != 0) sel_seqs_1_node(kspec, numsp, list, FALSE);
    	num = 1;
    	while( (num = irbit(list, num, nseq)) != 0) {
    		/* here num is the rank of a seq attached to taxID tid */
    		}
    
    	err = shkseq(my_kw, list, 3);
    	if(err == 2) return; /* keyword does not exist */
    	num = 1;
    	while( (num = irbit(list, num, nseq)) != 0) {
    		/* here num is the rank of a seq attached to keyword my_kw */
    		}
    
    	free(list);
    

    C code to find all keywords attached to a sequence

    
    	int num, kw, point;
    
    	num = isenum("ECOTGP"); /* get rank of starting sequence name */
    	readsub(num);
    	point = psub->plkey;
    	while(point != 0) {
    		readshrt(point);
    		kw = pshrt->val; /* here kw is the rank of an attached keyword */
    		point = pshrt->next;
    		}
    

    C code to find keywords placed below one keyword in the keyword tree

    
    	int kw, *liste_kw, num;
    
    	liste_kw = (int *)malloc(longa * sizeof(int));
    	kw = iknum("division names", kkey); /* get rank of starting keyword */
    	if(kw == 0) return; /* keyword does not exist */
    	descen(kkey, kw, liste_kw);
    	/* list liste_kw contains all keywords placed below starting keyword in the tree 
    	of keywords, including itself */
    	bit0(liste_kw, kw); /* remove starting keyword from list */
    	num = 1;
    	while((num = irbit(liste_kw, num, maxa)) != 0) {
    		readkey(num); /* here num is the rank of a descending 
    			keyword in the tree of keywords */
    		}
    

    C code to find all species below one taxon in the taxon tree

    
    	int sp, *liste_sp, num;
    
    	liste_sp = (int *)malloc(longa * sizeof(int));
    	sp = iknum("Mammalia", kspec); /* starting taxon */
    	if(sp == 0) return; /* taxon does not exist */
    	descen(kspec, sp, liste_sp);
    	/* list liste_sp contains all taxa placed below starting taxon in the tree of taxa, 
    	including itself */
    	num = 1;
    	while((num = irbit(liste_sp, num, maxa)) != 0) {
    		readspec(num); /* here num is the rank of a descending 
    			taxon in the tree of taxa */
    		if(pspec->plsub == 0) bit0(liste_sp, num);
    		/* if a taxon has no associated seq, remove it from list */
    		}
    

  9. ACCESS BY THE QUERY LANGUAGE

    Other global variables :

    Query language API

    Query API usage example
    Here is a commented example of usage. It boils down to :

    #include "dir_acnuc.h"
    #include "requete_acnuc.h"
    acnucopen();
    prep_acnuc_requete();
    apply function proc_requete to the query string
    scan the bitlist produced by this function
    

    Query language
    All ACNUC queries can be processed by the proc_requete function. The query language defines several selection criteria and operations between lists of elements matching criteria. It creates mainly lists of sequences, but also lists of species (or, more generally, taxa) and of keywords.

    Selection criteria are : (no space before the = sign)


    Operators are : (always followed and preceded by spaces or parentheses)

    The query language is case insensitive except where filenames occur. Parentheses can be used to specify the range of operators. Three operators (AND, OR, NOT) can be ambiguous because they can also occur within valid criterion values. Such ambiguities can be solved by bracketting elementary selection criteria between double quotes. For example:

    "sp=Beak and feather disease virus" and "au=ritchie"


  10. READING/WRITING ACNUC INDEX FILES
    Macros or functions are devoted to the reading of one record for each index file in C structures that are always accessible through global variables.
    
    Function or macro          File    Pntr to record  DIR_FILE name
    void readacc(int recnum);  ACCESS	pacc		kacc
    void readsub(int recnum);  SUBSEQ	psub		ksub
    void readloc(int recnum);  LOCUS	ploc		kloc
    void readshrt(int recnum); SHORTL	pshrt		kshrt
    void readlng(int recnum);  LONGL	plng		klng
    void readext(int recnum);  EXTRACT	pext		kext
    void readsmj(int recnum);  SMJYT	psmj		ksmj
    void readaut(int recnum);  AUTHOR	paut		kaut
    void readbib(int recnum);  BIBLIO	pbib		kbib
    void readkey(int recnum);  KEYWORDS	pkey		kkey
    void readspec(int recnum); SPECIES	pspec		kspec
    void readtxt(int recnum);  TEXT         ptxt		ktxt
    
    Writing is done similarly with macros writeacc, writesub, etc...
    dir_acnuc.h details the structure associated to records of each ACNUC index files. For example, readsub(n) reads the nth record of file SUBSEQ into the following C structure pointed to by global variable psub :
    struct rsub {     /* SUBSEQ : one record for each (sub)sequence */
        int length, /* seq length; or 0 if record was deleted */
    	type, /* to SMJYT, for seq type */
    	pext, /* if > 0 this is a subsequence, pext points to EXTRACT for list of exons;
    	   	if <= 0 this is a parent sequence, -pext points to LONGL for list of subseqs */
    	plkey, /* to SHORTL for list of keywords */
    	plinf, /* if parent sequence, plinf points to LOCUS for corresponding record;
    	   	 if subsequence, points to SHORTL for list of address of start of annotations; 
    	   	 this list contains only one element to be combined with the division rank
    	   	 for access to annotations */
    	phase, /* 100 * code_number + reading_frame_0_1_2 */
    	h; /* to SUBSEQ for next record with same hashing value or 0  */
        char name[1]; sequence name padded by spaces to L_MNEMO chars
        } *psub; 

    Two functions allow reading and writing the first record of each index file which differs from all other records by holding the total record number in the index:

    Index files contain fixed-length-space-padded strings. These are therefore not C strings because they are not ended by a null byte. A true C string is obtained as follows:
    char nom[L_MNEMO + 1];
    memcpy(nom, psub->name, L_MNEMO); nom[L_MNEMO] = 0; trim_key(nom);

    Conversely, to write a C string name to an ACNUC index file buffer, do :
    padtosize(psub->name, name, L_MNEMO);
    this may affect other fields of the structure that should therefore be filled after.

    Reading example :

    int num, type;
    char seqname[] = "ecotgp.trpa";
    #define LCODE sizeof(psmj->name)
    char code[LCODE +  1];
    
    num = isenum(seqname); /* get the seq rank from its name */
    readsub(num); /* read SUBSEQ record of rank num into buffer pointed to by psub */
    type = psub->type; /* this field indicates the seq type */
    readsmj(type); /* read SMJYT record corresponding to type */
    memcpy(code, psmj->name, LCODE );/*prepare a C string from the name field of the SMJYT record*/
    code[LCODE] = 0;
    trim_key(code);
    printf("type of sequence %s is %s\n", seqname, code);
    


  11. USING BIT LISTS
    Bitlists allow to handle lists of sequences, species or keywords. List elements are represented by their rank. Ranks are the numbers in the ACNUC index files of corresponding records. Ranks are computed by gsnuml or isenum for sequences and iknum for species or keywords. Bitlists are arrays of integers. The range of rank values begins at 2 because index file records are numbered starting from 1 and record # 1 is reserved for holding the file's total record number.

  12. UTILITY FUNCTIONS


  13. SIMULTANEOUS ACCESS TO SEVERAL ACNUC DATABASES

    Usage example:

    #include "dir_acnuc.h"
    
    /* declare prototypes */
    int chg_acnuc(char *acnucvar, char *gcgacnucvar);
    void *store_acnuc_status(void);
    void set_current_acnuc_db(void *db);
    
    /* declare a void * for each used database */
    void *db1, *db2;
    
    /* open + memorize access to 1st database */
    chg_acnuc("/banques0/genbank/index", "/banques0/genbank/flat_files");
    acnucopen();
    db1 = store_acnuc_status();
    if(db1 == NULL) {
    	/* not enough memory */
    	exit(ERREUR);
    	}
    
    /* open + memorize access to 2nd database */
    chg_acnuc("/banques0/swissprot/index", "/banques0/swissprot/flat_files");
    acnucopen();
    db2 = store_acnuc_status();
    if(db2 == NULL) {
    	exit(ERREUR);
    	}
    
    /* directs the API to the 1st database */
    set_current_acnuc_db(db1);
    /* now access to the 1st database is possible */
    gfrag(2, 1, 60, seq);
    readsub(2);
    printf("%.16s %s\n", psub->name, seq);
    
    /* directs the API to the 2nd database */
    set_current_acnuc_db(db2);
    /* now access to the 2nd database is possible */
    gfrag(2, 1, 60, seq);
    readsub(2);
    printf("%.16s %s\n", psub->name, seq);
    


  14. DATABASE MANAGEMENT FUNCTIONS