ACNUC FORTRAN Application Programming Interface
Contents :
ACCESSING SEQUENCES OF AN ACNUC DATABASE FROM USER FORTRAN PROGRAMS
Sequences or subsequences (e.g. protein, tRNA or rRNA
genes) can be read in the acnuc database by your own FORTRAN
programs using the following API.
The same interface works with all acnuc databases and structures
( GenBank, EMBL, SwissProt or NBRF/PIR).
Seven subroutines/functions (GSNUML, GSNUMLPHA, GFRAG, LIBSUB, GOPEN, CODAA, CLOSEACNUC)
are provided for your programs to use.
Basically, starting with the sequence name, use subroutine
GSNUML to obtain the sequence length and number in the database and
subroutine GFRAG to read its bases or amino acids, or a fragment of the
sequence. You can also use routine LIBSUB to obtain a short textual
description of the sequence. Protein translation using the adequate
reading frame and genetic code is also possible (see example 3, below).
Also, call GOPEN once at the beginning of your program to gain access to acnuc.
And CLOSEACNUC may be used to close the acnuc database when needed.
Subroutine GSNUML: to get sequence or sub-sequence number and length from its name
CHARACTER NAME*16
CALL GSNUML(NAME,NUM,LENGTH)
NAME: character string *16 containing the sequence
name.
NUM: upon return, the sequence number in the database,
or 0 if NAME is not an existing sequence name.
LENGTH: upon return, the sequence length in nucleotides.
Subroutine GSNUMLPHA: to get sequence or sub-sequence number, length,
reading frame and genetic code from its name
CHARACTER NAME*16
CALL GSNUMLPHA(NAME,NUM,LENGTH,FRAME,CODE)
NAME: character string *16 containing the sequence
name.
NUM: upon return, the sequence number in the database,
or 0 if NAME is not an existing sequence name.
LENGTH: upon return, the sequence length in nucleotides.
FRAME: upon return, the reading frame (0,1,2) of the coding sequence
CODE: upon return, the genetic code id (0 for standard code)
Subroutine GFRAG: to read all or part of a sequence or a sub-sequence
CHARACTER SEQ*`some_adequate_length'
CALL GFRAG(NUM,IFIRST,LFRAG,SEQ)
NUM: the sequence number (returned by GSNUML).
IFIRST: the position in sequence of the 1st base to
be read.
LFRAG: the number of bases to be read, starting at
position IFIRST. Upon return, LFRAG contains
the number of bases actually read. It can be
smaller than the input LFRAG value if
FIRST+LFRAG-1 is larger than the sequence length.
LFRAG is returned null in case of error (illegal
sequence number, length of SEQ too short,
illegal IFIRST value).
SEQ: a character string of length greater than LFRAG
that will contain upon return the bases read.
Subroutine LIBSUB: to get a short description of a sequence or sub-sequence
CHARACTER LIBEL*80
CALL LIBSUB(NUM,LIBEL)
NUM: the sequence number (returned by GSNUML).
LIBEL: character*80 string returned with the sequence or
sub-sequence name and a short description of it.
Subroutine GOPEN: To gain access to ACNUC files.
CALL GOPEN
Place that at the beginning of the program.
Subroutine CLOSEACNUC: To close access to ACNUC files.
CALL CLOSEACNUC
Function CODAA: translates 3 bases into an amino-acid using a given genetic code
CHARACTER CODAA*1,RESIDUE*1,CODON*3
INTEGER GEN_CODE
RESIDUE=CODAA(CODON,GEN_CODE)
CODON: a 3-base codon
GEN_CODE: an integer specifying the genetic code in use (see example 3
below for detailed description of its usage)
0 denotes the `standard' genetic code
RESIDUE: a one-character amino acid (* is returned for a stop codon)
EXAMPLES
c declarations: sequence names MUST BE on 16 characters
character name*16,seq*5000,libel*80
c open the necessary files
call gopen
c process for example ECOTGP.TRPA subsequence
name='ecotgp.trpa' !can use upper or lowercase indifferently
call gsnuml(name,num,length)
if(num.eq.0)stop'invalid sequence name'
c get and print a short textual description of it
call libsub(num,libel)
print*,libel
c example 1: read the complete sequence in memory
call gfrag(num,1,length,seq)
if(length.eq.0)stop'sequence is too long for string seq'
c example 2: read it successively by pieces of k bases
do 10 i=1,length,k
l=k
call gfrag(num,i,l,seq)
.
. process the l bases read in seq(1:l)
. generally l=k, except may be for the last piece
.
10 continue
c Example 3 translate a protein coding region using the
c appropriate genetic code and reading frame:
c the protein sequence will be in string PROT using the 1-letter code
IMPLICIT INTEGER(A-Z) !everything not character string is integer
CHARACTER NAME*16,SEQ*3000,PROT*1000
CHARACTER*1 CODAA !prepare for using function CODAA
CALL GOPEN !open access to the database
NAME='HUMMTCG.PE1' !example: a human mt protein gene
c obtain the reading frame (0,1, or 2)
c obtain the genetic code as known by ACNUC
CALL GSNUMLPHA(NAME,NUM,LENGTH,FRAME,CODE)
CALL GFRAG(NUM,FRAME+1,LENGTH,SEQ) !read the complete sequence
!note the use of var FRAME
J=0
DO 1 I=1,LENGTH-2,3
J=J+1
1 PROT(J:J)=CODAA(SEQ(I:I+2),CODE) !function codaa translates a codon
!using the code specified by CODE
LPROT=J !lprot=length of protein sequence in string PROT
Notes: (1) GFRAG returns lowercase nucleotides for GenBank and EMBL,
and uppercase for NBRF.
(2) GFRAG subroutine contains a large internal buffer,
so that there is no inconvenience in reading sequences by
small pieces if needed.
USING THE FORTRAN ACNUC INTERFACE UNDER UNIX
User-written FORTRAN programs that use the above-defined ACNUC API must
be linked to the ACNUC C library, libcacnuc.a. The link is done as in :
f77 -o myprog myprog.f -L. -lcacnuc
The C library is prepared by downloading the C source code
and then doing :
tar xf acnucsoft.tar
make libcacnuc.a
The environment variables acnuc and gcgacnuc are used by all ACNUC programs.