Exercise 2 - Retrieve Information in Sequence Databases

Aim: retrieve information on human insulin

1- Search the vertebrates insulin genes in EMBL

The search will be done with WWW-Query (***). For that purpose, select the options Search for sequences and Nucleotide databank, then select EMBL in the scrolling list. To compose your query, use the following criteria:

DEFAULT     Keyword     insulin
AND         Species     vertebrata
AND         Type        CDS 
AND NOT     Keyword     partial

List name   ins

Criteria AND Type = CDS allows to select only the protein coding regions and criteria AND NOT Type = partial allows to exclude incomplete sequences.

Note the number of sequences you retrieved and now, go back to the WWW-Query page and perform the following query:

DEFAULT     Keyword     @insulin@
AND         Species     vertebrata
AND         Type        CDS 
AND NOT     Keyword     partial

List name   ins2

What do you notice? Are all sequences coding for insulin? What are your conclusions about the use of keywords in general sequence databases?

2- Access data on the human insulin sequence

Go back to the WWW-Query page and type the following query, allowing you to retrieve only one insulin sequence from human:

DEFAULT     Name        HSINS01
AND         Type        CDS 

List name   ins3

Click on the HTML link allowing to access the coding sequence of the gene (***). By looking at the features (line CDS_pept) what can you conclude about the structure of the insulin gene?

Click on the name of the mother sequence (HSINS01) in a way to access it (***). Then click on the link MEDLINE; 82221404 to get the corresponding bibliographic reference in Medline (***).

Go back to the mother sequence and click on the link /db_xref="SWISS-PROT:P01308 to retrieve the corresponding entry in SWISS-PROT (***). In the SWISS-PROT annotations, find the features table and note the location of insulin B an A chains and peptide C.

Click on the link PRODOM; P01308 to retrieve information on the domain structure of insulin in PRODOM. Once the graphical page is loaded, click on the blue box containing a red dot at the left of the page: you will access the list of all proteins sharing at leat one homologous domain with the human insulin. Are all proteins insulins or not? Note that, if you click on the box containing the picture corresponding to the insulin domain (), you will access the alignment of this domain.

Go back to the SWISS-PROT entry and click on the link PROSITE; PS00262 to retrieve information on the insulin protein signature in PROSITE (***). Click on the link PDOC00235 to see the textual description of insulin signature.

3- Retrieve the sequence from the server

To retrieve the sequence, to back to the page generated as a result of your query allowing to get a single human insulin sequence. If this page is no longer in the cache of your Web browser, do again the query descibed at the beginning of section 2. Once you get the results page with only the HSINS01.INS sequence, click on the Retrieve button. The viewed page allows you to retrieve all sequences stored in a list. We need to get the insulin protein sequence in Fasta format, so do the following selections:

Sequence:   proteins
Format:     Fasta
Mode:       direct sending

List name:  ins3

Once the sequence appears in the window of your Web browser (***), save it on your computer in text format.

If you have problems or comments...

Back to PBIL home page