ACUTS Database Format


The format used for the ACUTS database is as close as possible of the EMBL format. However, ACUTS differs from classical sequence databases in that each entry does not describe a single sequence, but a set of sequences that share some conserved elements. Hence we decided to store different types of information in distinct files:

Sequences

Sequences are stored in FASTA format (see example). WARNING: sequences comprise not only the conserved regions, but also the non-conserved untranslated sequences that surround them.

Alignments

Sequence alignments are stored in two different formats:

Annotations

Different types of data are included in the annotation files (see example): Each ACUTS entry is composed of lines. Different types of lines, each with their own format, are used to record the various data which make up the entry.

Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown below:

    ID     - Identification.
    AC     - Accession number(s).
    DT     - Date.
    DE     - Description.
    AG     - Age of the conserved element.
    LO     - Location of the conserved element.
    KW     - Keywords.
    CC     - Comments or notes.
    RN     - Reference number.
    RC     - Reference comments.
    RX     - Reference cross-references.
    RA     - Reference authors.
    RT     - Reference title.
    RL     - Reference location.
    SN     - Sequence number.
    SI     - Sequence identification.
    SL     - Sequence length.
    OS     - Organism species.
    TX     - Organism taxonomic group.
    DR     - Database cross-references.
    FT     - Feature table data.
    //     - Termination line.
Some entries do not contain all of the line types, and some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). In addition the following line types are always present in an entry: AC (once), DT (2 times), DE (1 or more), AG (once), LO (once), SN (2 or more), SI (2 or more), SL (2 or more), OS (2 or more), TX (2 or more), DR (2 or more). The other line types (RC, RN, RX, RA, RT, RL, CC, DR, KW and FT) are optional. A detailed description of each line type is given in the next section of this document.

The two-character line type code which begins each line is always followed by three blanks, so that the actual information begins with the sixth character. Information is not extended beyond character position 80.




                          THE DIFFERENT LINE TYPES



    1 The ID line

    The ID  (IDentification) line is always the first line of an entry. The
    general form of the ID line is:

    ID   ENTRY_NAME   SEQUENCES:   #; ALIGNED BASES:  #.


         1.1 Entry Name

    The first  item on  the ID line is the entry name of the sequence. This
    name is  a useful  means of  identifying an entry.  The  entry  name
    consists of up to 14 uppercase alphanumeric characters.

    ACUTS uses  a general  purpose  naming  convention  which  can  be
    symbolized as X_Y, where

    X  is a mnemonic code of at most 8 alphanumeric characters representing
       the gene name. Examples: B2MG is for Beta-2-microglobulin, HBA is
       for Hemoglobin alpha chain and INS is for Insulin.

    The `_' sign serves as a separator.

    Y  is a  mnemonic species identification code of at most 5 alphanumeric
       characters representing  the location of the conserved element:

       5UT: 5'UTR               (5' transcribed untranslated region)
       5FL: 5'flank             (5'untranscribed region)
       5NC: 5'non-coding region (5'flank + 5'UTR)
       IN#: intron number #     (e.g. IN12 for the 12th intron)
       3UT: 3'UTR               (from the stop codon the polyA site)
       3FL: 3'flank             (3' of the polyA site)
       3NC: 3'non-coding region (3'flank + 3'UTR)
       NCR: non-coding region   (any non-coding region)




         1.2 Number of sequences

    The second  item on  the ID  line indicates the number of sequences
    available for that entry.


         1.3 Alignment length

    The third  item on  the ID  line indicates whether sequences have 
    been aligned or not, and in the former case, the total length of the 
    alignment with gaps. 


         1.4 Examples of identification lines

    Two examples of ID lines are shown below:

    ID   ACTAC_3UT      SEQUENCES:   6; ALIGNED BASES:  1929.
    ID   ACTB_5NC       SEQUENCES:   5; NOT ALIGNED.


    2 The AC line

    The AC  (ACcession number)  line lists the accession numbers associated
    with an entry. An example of an accession number line is shown below:

    AC   CU00321; CU05348;

    The accession  numbers are  separated by  semicolons and  the  list  is
    terminated by  a semicolon. If necessary, more then one AC line will be
    used.  All  ACUTS  sequence   entries  currently  have  only  one
    accession number.

    The purpose  of accession  numbers  is  to  provide  a  stable  way  of
    identifying entries  from release to release. It is sometimes necessary
    for reasons  of consistency  to change  the names  of the  entries, for
    example, to ensure that related entries have similar names. However, an
    accession number  is always conserved, and therefore allows unambiguous
    citation of ACUTS entries.

    Researchers who  wish to  cite entries  in  their  publications  should
    always cite the first accession number.


    3 The DT line

    The DT  (DaTe) lines show the date of entry or last modification of the
    sequence entry. The format of the DT lines is:

    DT   DD-MMM-YEAR (COMMENT)

    where `DD'  is the  day, `MMM' the month, and `YEAR' the year. The  
    comment portion  of the line indicates the action  taken on that date. 
    There are ALWAYS two DT lines in each entry, each of them is associated 
    with a specific comment:

    -  The first  DT line  indicates when  the entry  first appeared in the
       data bank. The associated comment is `CREATED'.
    -  The second  DT line  indicates  when  the  data  was  last
       modified. The associated comment is `LAST UPDATE'.


    Example of a block of DT lines:

    DT   09-JUL-1996  (CREATED)
    DT   09-JUL-1996  (LAST UPDATE)


    4 The DE line

    The DE  (DEscription) lines  contain  general  descriptive  information
    about the  sequence stored. This information is generally sufficient to
    identify the sequence precisely. The format of the DE lines is:

    DE   DESCRIPTION.

    The description  is given  in ordinary  English and  is free-format. In
    some cases,  more than  one DE line is required; in this case, the text
    is divided  only between  words and only the last DE line is terminated
    by a period.

    Two examples of description lines are given here:


    DE   ACTIN, CYTOPLASMIC BETA, PROMOTER AND FIRST INTRON.

    DE   BRAIN-SPECIFIC RECEPTOR-TYPE PROTEIN-TYROSINE KINASE 
    DE   (BSK/HEK7/CEK7/EHK-1) 3'UTR.




    5 The AG line

    The AG  (AGe) line  indicates the age of the conserved element 
    (i.e. the approximative time of divergence between the species in 
    which the conserved element is detected) and the corresponding
    speciation event.

    Example:

    AG   310 MYRS (MAMMALIA/SAUROPSIDA).


    6 The LO line

    The LO  (LOcation) line  indicates the location of the conserved 
    element (5'flank, 5'UTR, intron, 3'UTR, 3'flank, etc.).


    Example:

    LO   3'UTR (STOP CODON TO POLYADENYLATION SITE).



    7 The KW line

    The KW  (KeyWord) lines  provide  information  which  can  be  used  to
    generate cross-reference  indexes of  the  sequence  entries  based  on
    functional, structural,  or other  categories. The  keywords chosen for
    each entry serve as a subject reference for the sequence. Often several
    KW lines  are necessary  for a single entry. The format of the KW lines
    is:

    KW   KEYWORD[; KEYWORD...].

    More than  one keyword  may be listed on each KW line; the keywords are
    separated by  semicolons, and the last keyword is followed by a period.
    Keywords may  consist of  more than one word (they may contain blanks),
    but are never split between lines. An example of a KW line is:

    KW   EYE LENS PROTEIN; ACETYLATION.

    The order  of the  keywords is not significant. The above example could
    also have been written:

    KW   ACETYLATION; EYE LENS PROTEIN.

    8 The CC line

    The CC  lines are  free text  comments on the entry, and may be used to
    convey any  useful information.  The comments  always appears below the
    last reference line and are grouped together in comment blocks, a block
    being made  of 1 or more comment lines. The first line of a block start
    is marked with the characters `-!-'.

    The format of a comment block is:

    CC   -!- FIRST LINE OF A COMMENT BLOCK.
    CC       SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK.

    A major proportion of the comment blocks are arranged according to what
    we designate  as 'topics`.  The format of a comment block which belongs
    to a 'topic` is:


    CC    -!- TOPIC: FREE TEXT DESCRIPTION.

    The current topics are:

    PROTEIN FUNCTION :             General description of the function(s) 
                                   of the protein encoded by the gene.
    PROTEIN SUBCELLULAR LOCATION : Description of  the subcellular  location 
                                   of the mature protein product.
    GENE EXPRESSION :              Description of the expression pattern of the 
                                   gene.
    mRNA SUBCELLULAR LOCATION :    Description of  the subcellular  location
                                   of the mRNA.
    BEST SCORE :                   Similarity score of the most conserved
                                   element.

    9 The reference (RN, RX, RA, RT, RL) lines

    These lines  comprise the  literature citations  within ACUTS. The
    citations indicate  the papers from which the data has been abstracted.
    The reference  lines for  a given  citation occur  in a  block, and are
    always in  the order RN, RX, RA, RT, RL. Within each such reference
    block the  RN line occurs once, the RX line occurs zero or more 
    times, and the RA, RL and RT lines each occur one or more times. If
    several references are given, there will be a reference block for each.

    An example of a complete reference is:

    RN   [1]
    DR   MEDLINE; 88217501.
    RA   Lohse P., Arnold H.H.;
    RT   "The down-regulation of the chicken cytoplasmic beta actin during 
    RT   myogenic differentiation does not require the gene promoter but 
    RT   involves the 3' end of the gene";
    RL   Nucleic Acids Res. 16:2787-803(1988).

    The formats of the individual lines are explained below.

         9.1 The RN line

    The RN  (Reference Number)  line gives  a  sequential  number  to  each
    reference citation  in an  entry. This  number is  used to indicate the
    reference in  comments and  feature table  notes. The  format of the RN
    line is:

    RN   [N]

    where N  denotes the nth reference for this entry. The reference number
    is always enclosed in square brackets.



         9.2 The RX line

    The RX  (Reference cross-reference)  line is  an optional line which is
    used to  indicate the  identifier assigned to a specific reference in a
    bibliographic database. The format of the RX line is:

    RX   BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.

    where the  valid bibliographic  database  names  and  their  associated
    identifier are:

    Name:       MEDLINE
    Database:   Medline from the National Library of Medicine (NLM)
    Identifier: Eight digit Medline Unique Identifier (UID)

    Example of RX line:

    RX   MEDLINE; 91002678.

         9.3 The RA line

    The RA (Reference Author) lines list the authors of the paper (or other
    work) cited.  All of  the authors  are included,  and are listed in the
    order given  in the  paper. The names are listed surname first followed
    by a  blank followed by initial(s) with periods. The authors' names are
    separated by commas and terminated by a semicolon. Author names are not
    split between lines. An example of the use of RA lines is shown below:

    RA   YANOFSKY C., PLATT T., CRAWFORD I.P., NICHOLS B.P., CHRISTIE G.E.,
    RA   HOROWITZ H., VAN CLEEMPUT M., WU A.M.;

    As many RA lines as necessary are included for each reference.

         9.4 The RT line

    The RT (Reference Title) lines list the title of the paper (or other
    work) cited.  

    As many RT lines as necessary are included for each reference.

         9.5 The RL line

    The RL  (Reference Location)  lines contain  the conventional  citation
    information for  the reference.  In general,  the RL  lines  alone  are
    sufficient to find the paper in question.

    a) Journal citations

    The RL  line for  a journal citation includes the journal abbreviation,
    the volume  number, the page range, and the year. The format for such a
    RL line is:

    RL   JOURNAL VOL:PP-PP(YEAR).

    Journal names  are abbreviated according to the conventions used by the
    National Library  of Medicine  (NLM) and  are based on the existing ISO
    and ANSI  standards. A  list of  the abbreviations  currently in use is
    given in the SWISS-PROT document file JOURLIST.TXT.

    An example of an RL line is:

    RL   J. MOL. BIOL. 168:321-331(1983).

    When a  reference is  made to  a paper  which is `in press' at the time
    when the  data bank  is released,  the page  range, and  eventually the
    volume number  are indicated  as '0' (zero). An example of a RL line of
    such type is shown here:

    RL   NUCLEIC ACIDS RES. 22:0-0(1994).

    b) Book citations

    A variation  of the RL line format is used for papers found in books or
    other similar publications, which are cited as shown below:

    RL   (IN) THE ENZYMES, 3RD ED., VOL.11, PART A, BOYER P.D., ED.,
    RL   PP.397-547, ACADEMIC PRESS, NEW YORK, (1975).

    The first RL line contains the designation `(IN)', which indicates that
    this is  a  book  reference.  These  citations  generally  include  the
    following  information:  the  title  of  the  book,  the  name  of  the
    editor(s), the  page range,  the publisher  name, the  city where it is
    published, and  the year  of publication (which is always shown between
    parenthesis).

    c) Unpublished results

    RL lines  for unpublished  results follows  the  format  shown  in  the
    following example:

    RL   UNPUBLISHED RESULTS, CITED BY:
    RL   ULRICH E.L., KROGMANN D.W., MARKLEY J.L.;
    RL   J. BIOL. CHEM. 257:9356-9364(1982).

    d) Unpublished observations

    For unpublished observations the format of the RL line is:

    RL   UNPUBLISHED OBSERVATIONS (MMM-YEAR).

    Where `MMM' is the month and `YEAR' is the year.

    We use the `unpublished observations' RL line to cite communications by
    scientists to  SWISS-PROT of unpublished information concerning various
    aspects of a sequence entry.

    e) Thesis

    For Ph.D. theses the format of the RL line is:

    RL   THESIS (YEAR), INSTITUTION_NAME, COUNTRY.


    An example of such a line is given here:

    RL   THESIS (1972), GEORGE WASHINGTON UNIVERSITY, U.S.A.




    10 The sequence (SN, SI, SL, OS, TX, DR) lines


    These lines  describe the sequences and their origin. 
    The sequence  lines for  a given  sequence occur  in a  block, and are
    always in  the order SN, SI, SL, OS, TX, DR. Within each such reference
    block the  SN, SI, SL, OS and  TX lines occur once, the DR line occurs
    one or more times. One reference block is given for each sequence.

    An example of a complete reference is:

    SN   [1]
    SI   ACTB_3UT.1.HUMAN
    SL   646.
    OS   HOMO SAPIENS (HUMAN).
    TX   MAMMALIA.
    DR   EMBL; M10277. gene; beta-actin
    DR   EMBL; X63432. mRNA; 
    DR   EMBL; X00351. mRNA; 

    The formats of the individual lines are explained below.

         10.1 The SN line

    The SN  (Sequence Number)  line gives  a  sequential  number  to  each
    sequence  in an  entry. This  number is  used to indicate the
    sequence in  comments and  feature table  notes. The  format of the SN
    line is:

    SN   [N]

    where N  denotes the nth sequence for this entry. The sequence number
    is always enclosed in square brackets.


         10.2 The SI line

    The SI  (Sequence Identification)  line gives  the name of the sequence.
    ACUTS uses  a general  purpose  naming  convention  which  can  be
    symbolized as ID.N.SP, where


    ID is the name of the ACUTS entry.

    The `.' sign serves as a separator.

    N  is a  number (useful to distinguish paralogous sequences from a
       same species). 

    SP is a  mnemonic species identification code of at most 5 alphanumeric
       characters representing  the biological  source of the sequence. This
       code is  generally made  of the first three letters of the genus and
       the first  two letters  of  the  species.  Examples:  NAJNI is for 
       Naja nivea.

       However, for  species commonly  encountered in  the data bank, self-
       explanatory codes  are used.  There are 9 of those codes. They are:
       BOVIN for  Bovine, CHICK  for Chicken, HORSE for Horse, HUMAN for 
       Human, MOUSE for Mouse,  PIG for Pig, RABIT for Rabbit, RAT for Rat, 
       SHEEP for Sheep.

    The name  of all the presently defined species identification codes are
    listed in the SWISS-PROT document file SPECLIST.TXT.

    Examples of  complete  sequence  names are: MHC_5FL.1.MOUSE
    for the 5'flank sequence of mouse myosin heavy chain gene,
    MYC_3UT.2.XENLA for 3'UTR of the second Xenopus laevis c-myc gene.

         10.3 The SL line

    The SL  (Sequence Length)  line gives  the length of the sequence.

         10.4 The OS line

    The OS  (Organism Species) line specifies the organism which was the
    source of  the stored  sequence. In the rare case where all the species
    information will  not fit  on a  single line  more than  one OS line is
    used. The last OS line is terminated by a period.

    The species designation consists, in most cases, of the Latin genus and
    species designation  followed by the English name (in parentheses). 

    Examples of OS lines are shown here:

    OS   HOMO SAPIENS (HUMAN).
    OS   NAJA NAJA (INDIAN COBRA).


         10.5 The TX line


    The TX  (TaXonomic) line indicates the taxonomic group to which
    the organism belong.

    Example:

    TX   SAUROPSIDA (BIRDS AND REPTILES).


    The taxonomic groups that have been considered for the comparative
    analysis are:

    MAMMALIA
    SAUROPSIDA (BIRDS AND REPTILES)
    AMPHIBIA
    ACTINOPTERYGII (BONY FISHES)
    CHONDRICHTHYES (CARTILAGINOUS FISHES)
    CEPHALOCHORDATA
    UROCHORDATA
    ECHINODERMS


         10.6 The DR line


    The DR  (Database  cross-Reference)  lines  are  used  as  pointers to
    the original sequence entries from the EMBL/Genbank/DDBJ  nucleotide
    sequence database. For a same locus, there may be several redundant
    sequences in EMBL/Genbank/DDBJ.


    The format of the DR line is:

    DR   EMBL; ACCNUM. SEQTYPE; DEFINITION.


    where

    ACCNUM is the accession number
    SEQTYPE indicates whether the sequence is a mRNA or a genomic fragment
    DEFINITION gives the original sequence definition.


    Examples:

    DR   EMBL; X80130. mRNA; alpha-cardiac actin
    DR   EMBL; X02212. gene; alpha-cardiac actin


    11 The FT lines


    The format of the FT (FeaTures) lines is the one used by the 
    EMBL data library. Positions indicated in the feature table 
    correspond to POSITIONS IN THE ALIGNMENT (gap included).

    We introduced new feature keys to describe highly conserved regions
    (HCR). Different types of HCRs are described, according to their age:

    HCR310: conserved at least since mammalia/sauropsida divergence
    HCR335: conserved at least since amniota/amphibia divergence
    HCR400: conserved at least since actinopterygii/sarcopterygii divergence
    HCR450: conserved at least since agnatha/gnathostoma divergence
    HCR500: conserved at least since cephalochordata/vertebrates divergence
    HCR520: conserved at least since echinoderms/chordata divergence


    We also introduced two qualifiers to describe these HCRs:

    /average_identity="value %"
    /score="value"


    FT   HCR400          203..280
    FT                   /name="HCR400_5"
    FT                   /average_identity="70.3 %"
    FT                   /score="193"


    The "average_identity" is calculated as follows:

       average_identity = match / aln_length

    where:
       match = number of identical residues
       aln_length = length of the aligned region, gap included)


    The "score" is calculated as follows:

       score = 5 * match - 4 * mismatch

    where:
       match = number of identical residues
       mismatch = number of non-identical residues (NB: 1 gap = 1 mismatch)


    12 The XX lines


       XX  lines contain  no data  and are  present in the ACUTS
       Database only  to improve readability of an entry when it is printed
       or displayed  on a  terminal screen. 




If you have problems or comments...

Back to PBIL home page

back to the ACUTS home page