TP bioinformatique
Exercise 4 : Gene prediction
Objective: identify genes in genomic sequences
We want to identify genes within a chicken genomic sequence and to determine the precise positions of their exons, introns, translation
initiation sites and translation termination sites.
We will use three different approaches to identify genes . You will have to compare des results of the three
methods, to try to understand the discrepancies between the different predictions, and to decide which one is the most reliable.
First of all, it is necessary to identify and mask repeated sequences:
1- Methods ab initio
2- Search for transcribed regions (EST, cDNA) in genomic DNA
-
Search for ESTs matching the genomic sequence (masked): MEGABLAST
(NCBI) (use database est_others)
-
Select all matching ESTs (at least 98% identity, > 70 bp) from chicken (Gallus gallus), and retrieve sequences from NCBI BLAST output
-
Assemble ESTs with CAP3 (save Contigs in a text file)
-
Align each cDNA to the genomic DNA with SIM4
3- Comparative approach: search protein coding regions by similarity
-
Search for proteins matching the genomic sequence (masked) with BLASTX
(translated BLAST searches): BLAST
(NCBI)
-
Retrieve the closest homologue
-
Alignment this protein to the genomic DNA with GENEWISE
(Sanger), or GENEWISE
(Pasteur) (parameter: genewise protein to genomic DNA)
Useful links:
ORF Finder
Now repeat the same exercise with two other chicken genomic sequences:
genomic1 genomic2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
see results