Searching for distant homologs wih the PFTOOLS





Objective: retrieve homologues of human insulin in the genome of the nematode C. elegans, using profiles.  

Profile searches are based on 3 steps:

1- Compare your query sequence to databases using a basic similarity search tool (e.g. BLAST), to identify a first set of homologues
2- Align these homologues (e.g. with clustal), and build a profile
3- Compare this profile to databases to identify more distantly related homologues

Steps 2 and 3 can be iterated as long as new homologues are identified.

Step 1: Retrieve insulin homologues with BLASTP. This time we will use another web server:  BLAST at PBIL. Parameters:  Description: 10000 Alignment: 10000

Step 2: To exclude partial and artificial insulin sequences, filter BLAST output with the following parameters (button "Filter by taxon/keyword/date) ( *** ):


Step 3: Select sequences to be included in the profile. Given the size of the insulin family, we will only select a subset of about 20-30 insulin homologues.  It is important to include in the profile homologues that are distantly related (i.e. to maximise the sequence diversity in the ). To maximise the sequence diversity in the profile). However, take care to only select significant matches (E-value < 0.05) (***)
 

Step 4: Align sequences on your local computer with clustalx (default parameters) (***)
 

Step 5: Look at the alignement with seaview. Select the conserved regions (menu "Sites" in seaview), and save these regions in MSF format. -> insfam.msf (***)

Step 6: compute de profile (***)

 pfmake -1 insfam.msf $BLOSUM62 > ins.prf

Step 7 (optional): Retrieve all nematode proteins in SwissProt-TrEMBL using WWW-Query at PBIL, and then extract sequences in FASTA format.

NB: this file has already been prepared for you: $CELPEP


Step 8: compare the profile to the database of C. elegans proteins (CELPEP):


pfsearch -af insfam.prf $CELPEP | sort -nr  > insfam.pfs1

Step 9: Assess the statistical significance:

A set of random sequences, of same length, same amino-acid and di-peptide composition as the CELPEP sequences were generated with the program shuffle (from the HMMER package):  shuffle -d CELPEP > CELRAND

Repeat the search against the shuffled database:


pfsearch -af insfam.prf $CELRAND | sort -nr  > insfam.pfs2




NB: if you want to install these  software on your own computer, see  the ISREC profile page , the HMMER package, and http://pbil.univ-lyon1.fr/alignment.html . More help.