The format used for the ACUTS database is as close as possible of the EMBL format. However, ACUTS differs from classical sequence databases in that each entry does not describe a single sequence, but a set of sequences that share some conserved elements. Hence we decided to store different types of information in distinct files:
Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown below:
ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
AG - Age of the conserved element.
LO - Location of the conserved element.
KW - Keywords.
CC - Comments or notes.
RN - Reference number.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RT - Reference title.
RL - Reference location.
SN - Sequence number.
SI - Sequence identification.
SL - Sequence length.
OS - Organism species.
TX - Organism taxonomic group.
DR - Database cross-references.
FT - Feature table data.
// - Termination line.
Some entries do not contain all of the line types, and some line types
occur many times in a single entry. Each entry must begin with an
identification line (ID) and end with a terminator line (//). In
addition the following line types are always present in an entry: AC
(once), DT (2 times), DE (1 or more), AG (once), LO (once), SN (2 or
more), SI (2 or more), SL (2 or more), OS (2 or more), TX (2 or more),
DR (2 or more). The other line types (RC, RN, RX, RA, RT, RL, CC, DR,
KW and FT) are optional. A detailed description of each line type is
given in the next section of this document.
The two-character line type code which begins each line is always followed by three blanks, so that the actual information begins with the sixth character. Information is not extended beyond character position 80.
THE DIFFERENT LINE TYPES
1 The ID line
The ID (IDentification) line is always the first line of an entry. The
general form of the ID line is:
ID ENTRY_NAME SEQUENCES: #; ALIGNED BASES: #.
1.1 Entry Name
The first item on the ID line is the entry name of the sequence. This
name is a useful means of identifying an entry. The entry name
consists of up to 14 uppercase alphanumeric characters.
ACUTS uses a general purpose naming convention which can be
symbolized as X_Y, where
X is a mnemonic code of at most 8 alphanumeric characters representing
the gene name. Examples: B2MG is for Beta-2-microglobulin, HBA is
for Hemoglobin alpha chain and INS is for Insulin.
The `_' sign serves as a separator.
Y is a mnemonic species identification code of at most 5 alphanumeric
characters representing the location of the conserved element:
5UT: 5'UTR (5' transcribed untranslated region)
5FL: 5'flank (5'untranscribed region)
5NC: 5'non-coding region (5'flank + 5'UTR)
IN#: intron number # (e.g. IN12 for the 12th intron)
3UT: 3'UTR (from the stop codon the polyA site)
3FL: 3'flank (3' of the polyA site)
3NC: 3'non-coding region (3'flank + 3'UTR)
NCR: non-coding region (any non-coding region)
1.2 Number of sequences
The second item on the ID line indicates the number of sequences
available for that entry.
1.3 Alignment length
The third item on the ID line indicates whether sequences have
been aligned or not, and in the former case, the total length of the
alignment with gaps.
1.4 Examples of identification lines
Two examples of ID lines are shown below:
ID ACTAC_3UT SEQUENCES: 6; ALIGNED BASES: 1929.
ID ACTB_5NC SEQUENCES: 5; NOT ALIGNED.
2 The AC line
The AC (ACcession number) line lists the accession numbers associated
with an entry. An example of an accession number line is shown below:
AC CU00321; CU05348;
The accession numbers are separated by semicolons and the list is
terminated by a semicolon. If necessary, more then one AC line will be
used. All ACUTS sequence entries currently have only one
accession number.
The purpose of accession numbers is to provide a stable way of
identifying entries from release to release. It is sometimes necessary
for reasons of consistency to change the names of the entries, for
example, to ensure that related entries have similar names. However, an
accession number is always conserved, and therefore allows unambiguous
citation of ACUTS entries.
Researchers who wish to cite entries in their publications should
always cite the first accession number.
3 The DT line
The DT (DaTe) lines show the date of entry or last modification of the
sequence entry. The format of the DT lines is:
DT DD-MMM-YEAR (COMMENT)
where `DD' is the day, `MMM' the month, and `YEAR' the year. The
comment portion of the line indicates the action taken on that date.
There are ALWAYS two DT lines in each entry, each of them is associated
with a specific comment:
- The first DT line indicates when the entry first appeared in the
data bank. The associated comment is `CREATED'.
- The second DT line indicates when the data was last
modified. The associated comment is `LAST UPDATE'.
Example of a block of DT lines:
DT 09-JUL-1996 (CREATED)
DT 09-JUL-1996 (LAST UPDATE)
4 The DE line
The DE (DEscription) lines contain general descriptive information
about the sequence stored. This information is generally sufficient to
identify the sequence precisely. The format of the DE lines is:
DE DESCRIPTION.
The description is given in ordinary English and is free-format. In
some cases, more than one DE line is required; in this case, the text
is divided only between words and only the last DE line is terminated
by a period.
Two examples of description lines are given here:
DE ACTIN, CYTOPLASMIC BETA, PROMOTER AND FIRST INTRON.
DE BRAIN-SPECIFIC RECEPTOR-TYPE PROTEIN-TYROSINE KINASE
DE (BSK/HEK7/CEK7/EHK-1) 3'UTR.
5 The AG line
The AG (AGe) line indicates the age of the conserved element
(i.e. the approximative time of divergence between the species in
which the conserved element is detected) and the corresponding
speciation event.
Example:
AG 310 MYRS (MAMMALIA/SAUROPSIDA).
6 The LO line
The LO (LOcation) line indicates the location of the conserved
element (5'flank, 5'UTR, intron, 3'UTR, 3'flank, etc.).
Example:
LO 3'UTR (STOP CODON TO POLYADENYLATION SITE).
7 The KW line
The KW (KeyWord) lines provide information which can be used to
generate cross-reference indexes of the sequence entries based on
functional, structural, or other categories. The keywords chosen for
each entry serve as a subject reference for the sequence. Often several
KW lines are necessary for a single entry. The format of the KW lines
is:
KW KEYWORD[; KEYWORD...].
More than one keyword may be listed on each KW line; the keywords are
separated by semicolons, and the last keyword is followed by a period.
Keywords may consist of more than one word (they may contain blanks),
but are never split between lines. An example of a KW line is:
KW EYE LENS PROTEIN; ACETYLATION.
The order of the keywords is not significant. The above example could
also have been written:
KW ACETYLATION; EYE LENS PROTEIN.
8 The CC line
The CC lines are free text comments on the entry, and may be used to
convey any useful information. The comments always appears below the
last reference line and are grouped together in comment blocks, a block
being made of 1 or more comment lines. The first line of a block start
is marked with the characters `-!-'.
The format of a comment block is:
CC -!- FIRST LINE OF A COMMENT BLOCK.
CC SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK.
A major proportion of the comment blocks are arranged according to what
we designate as 'topics`. The format of a comment block which belongs
to a 'topic` is:
CC -!- TOPIC: FREE TEXT DESCRIPTION.
The current topics are:
PROTEIN FUNCTION : General description of the function(s)
of the protein encoded by the gene.
PROTEIN SUBCELLULAR LOCATION : Description of the subcellular location
of the mature protein product.
GENE EXPRESSION : Description of the expression pattern of the
gene.
mRNA SUBCELLULAR LOCATION : Description of the subcellular location
of the mRNA.
BEST SCORE : Similarity score of the most conserved
element.
9 The reference (RN, RX, RA, RT, RL) lines
These lines comprise the literature citations within ACUTS. The
citations indicate the papers from which the data has been abstracted.
The reference lines for a given citation occur in a block, and are
always in the order RN, RX, RA, RT, RL. Within each such reference
block the RN line occurs once, the RX line occurs zero or more
times, and the RA, RL and RT lines each occur one or more times. If
several references are given, there will be a reference block for each.
An example of a complete reference is:
RN [1]
DR MEDLINE; 88217501.
RA Lohse P., Arnold H.H.;
RT "The down-regulation of the chicken cytoplasmic beta actin during
RT myogenic differentiation does not require the gene promoter but
RT involves the 3' end of the gene";
RL Nucleic Acids Res. 16:2787-803(1988).
The formats of the individual lines are explained below.
9.1 The RN line
The RN (Reference Number) line gives a sequential number to each
reference citation in an entry. This number is used to indicate the
reference in comments and feature table notes. The format of the RN
line is:
RN [N]
where N denotes the nth reference for this entry. The reference number
is always enclosed in square brackets.
9.2 The RX line
The RX (Reference cross-reference) line is an optional line which is
used to indicate the identifier assigned to a specific reference in a
bibliographic database. The format of the RX line is:
RX BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.
where the valid bibliographic database names and their associated
identifier are:
Name: MEDLINE
Database: Medline from the National Library of Medicine (NLM)
Identifier: Eight digit Medline Unique Identifier (UID)
Example of RX line:
RX MEDLINE; 91002678.
9.3 The RA line
The RA (Reference Author) lines list the authors of the paper (or other
work) cited. All of the authors are included, and are listed in the
order given in the paper. The names are listed surname first followed
by a blank followed by initial(s) with periods. The authors' names are
separated by commas and terminated by a semicolon. Author names are not
split between lines. An example of the use of RA lines is shown below:
RA YANOFSKY C., PLATT T., CRAWFORD I.P., NICHOLS B.P., CHRISTIE G.E.,
RA HOROWITZ H., VAN CLEEMPUT M., WU A.M.;
As many RA lines as necessary are included for each reference.
9.4 The RT line
The RT (Reference Title) lines list the title of the paper (or other
work) cited.
As many RT lines as necessary are included for each reference.
9.5 The RL line
The RL (Reference Location) lines contain the conventional citation
information for the reference. In general, the RL lines alone are
sufficient to find the paper in question.
a) Journal citations
The RL line for a journal citation includes the journal abbreviation,
the volume number, the page range, and the year. The format for such a
RL line is:
RL JOURNAL VOL:PP-PP(YEAR).
Journal names are abbreviated according to the conventions used by the
National Library of Medicine (NLM) and are based on the existing ISO
and ANSI standards. A list of the abbreviations currently in use is
given in the SWISS-PROT document file JOURLIST.TXT.
An example of an RL line is:
RL J. MOL. BIOL. 168:321-331(1983).
When a reference is made to a paper which is `in press' at the time
when the data bank is released, the page range, and eventually the
volume number are indicated as '0' (zero). An example of a RL line of
such type is shown here:
RL NUCLEIC ACIDS RES. 22:0-0(1994).
b) Book citations
A variation of the RL line format is used for papers found in books or
other similar publications, which are cited as shown below:
RL (IN) THE ENZYMES, 3RD ED., VOL.11, PART A, BOYER P.D., ED.,
RL PP.397-547, ACADEMIC PRESS, NEW YORK, (1975).
The first RL line contains the designation `(IN)', which indicates that
this is a book reference. These citations generally include the
following information: the title of the book, the name of the
editor(s), the page range, the publisher name, the city where it is
published, and the year of publication (which is always shown between
parenthesis).
c) Unpublished results
RL lines for unpublished results follows the format shown in the
following example:
RL UNPUBLISHED RESULTS, CITED BY:
RL ULRICH E.L., KROGMANN D.W., MARKLEY J.L.;
RL J. BIOL. CHEM. 257:9356-9364(1982).
d) Unpublished observations
For unpublished observations the format of the RL line is:
RL UNPUBLISHED OBSERVATIONS (MMM-YEAR).
Where `MMM' is the month and `YEAR' is the year.
We use the `unpublished observations' RL line to cite communications by
scientists to SWISS-PROT of unpublished information concerning various
aspects of a sequence entry.
e) Thesis
For Ph.D. theses the format of the RL line is:
RL THESIS (YEAR), INSTITUTION_NAME, COUNTRY.
An example of such a line is given here:
RL THESIS (1972), GEORGE WASHINGTON UNIVERSITY, U.S.A.
10 The sequence (SN, SI, SL, OS, TX, DR) lines
These lines describe the sequences and their origin.
The sequence lines for a given sequence occur in a block, and are
always in the order SN, SI, SL, OS, TX, DR. Within each such reference
block the SN, SI, SL, OS and TX lines occur once, the DR line occurs
one or more times. One reference block is given for each sequence.
An example of a complete reference is:
SN [1]
SI ACTB_3UT.1.HUMAN
SL 646.
OS HOMO SAPIENS (HUMAN).
TX MAMMALIA.
DR EMBL; M10277. gene; beta-actin
DR EMBL; X63432. mRNA;
DR EMBL; X00351. mRNA;
The formats of the individual lines are explained below.
10.1 The SN line
The SN (Sequence Number) line gives a sequential number to each
sequence in an entry. This number is used to indicate the
sequence in comments and feature table notes. The format of the SN
line is:
SN [N]
where N denotes the nth sequence for this entry. The sequence number
is always enclosed in square brackets.
10.2 The SI line
The SI (Sequence Identification) line gives the name of the sequence.
ACUTS uses a general purpose naming convention which can be
symbolized as ID.N.SP, where
ID is the name of the ACUTS entry.
The `.' sign serves as a separator.
N is a number (useful to distinguish paralogous sequences from a
same species).
SP is a mnemonic species identification code of at most 5 alphanumeric
characters representing the biological source of the sequence. This
code is generally made of the first three letters of the genus and
the first two letters of the species. Examples: NAJNI is for
Naja nivea.
However, for species commonly encountered in the data bank, self-
explanatory codes are used. There are 9 of those codes. They are:
BOVIN for Bovine, CHICK for Chicken, HORSE for Horse, HUMAN for
Human, MOUSE for Mouse, PIG for Pig, RABIT for Rabbit, RAT for Rat,
SHEEP for Sheep.
The name of all the presently defined species identification codes are
listed in the SWISS-PROT document file SPECLIST.TXT.
Examples of complete sequence names are: MHC_5FL.1.MOUSE
for the 5'flank sequence of mouse myosin heavy chain gene,
MYC_3UT.2.XENLA for 3'UTR of the second Xenopus laevis c-myc gene.
10.3 The SL line
The SL (Sequence Length) line gives the length of the sequence.
10.4 The OS line
The OS (Organism Species) line specifies the organism which was the
source of the stored sequence. In the rare case where all the species
information will not fit on a single line more than one OS line is
used. The last OS line is terminated by a period.
The species designation consists, in most cases, of the Latin genus and
species designation followed by the English name (in parentheses).
Examples of OS lines are shown here:
OS HOMO SAPIENS (HUMAN).
OS NAJA NAJA (INDIAN COBRA).
10.5 The TX line
The TX (TaXonomic) line indicates the taxonomic group to which
the organism belong.
Example:
TX SAUROPSIDA (BIRDS AND REPTILES).
The taxonomic groups that have been considered for the comparative
analysis are:
MAMMALIA
SAUROPSIDA (BIRDS AND REPTILES)
AMPHIBIA
ACTINOPTERYGII (BONY FISHES)
CHONDRICHTHYES (CARTILAGINOUS FISHES)
CEPHALOCHORDATA
UROCHORDATA
ECHINODERMS
10.6 The DR line
The DR (Database cross-Reference) lines are used as pointers to
the original sequence entries from the EMBL/Genbank/DDBJ nucleotide
sequence database. For a same locus, there may be several redundant
sequences in EMBL/Genbank/DDBJ.
The format of the DR line is:
DR EMBL; ACCNUM. SEQTYPE; DEFINITION.
where
ACCNUM is the accession number
SEQTYPE indicates whether the sequence is a mRNA or a genomic fragment
DEFINITION gives the original sequence definition.
Examples:
DR EMBL; X80130. mRNA; alpha-cardiac actin
DR EMBL; X02212. gene; alpha-cardiac actin
11 The FT lines
The format of the FT (FeaTures) lines is the one used by the
EMBL data library. Positions indicated in the feature table
correspond to POSITIONS IN THE ALIGNMENT (gap included).
We introduced new feature keys to describe highly conserved regions
(HCR). Different types of HCRs are described, according to their age:
HCR310: conserved at least since mammalia/sauropsida divergence
HCR335: conserved at least since amniota/amphibia divergence
HCR400: conserved at least since actinopterygii/sarcopterygii divergence
HCR450: conserved at least since agnatha/gnathostoma divergence
HCR500: conserved at least since cephalochordata/vertebrates divergence
HCR520: conserved at least since echinoderms/chordata divergence
We also introduced two qualifiers to describe these HCRs:
/average_identity="value %"
/score="value"
FT HCR400 203..280
FT /name="HCR400_5"
FT /average_identity="70.3 %"
FT /score="193"
The "average_identity" is calculated as follows:
average_identity = match / aln_length
where:
match = number of identical residues
aln_length = length of the aligned region, gap included)
The "score" is calculated as follows:
score = 5 * match - 4 * mismatch
where:
match = number of identical residues
mismatch = number of non-identical residues (NB: 1 gap = 1 mismatch)
12 The XX lines
XX lines contain no data and are present in the ACUTS
Database only to improve readability of an entry when it is printed
or displayed on a terminal screen.
back to the ACUTS home page