Previous Up Next

5  Looking for CpG islands



In mammal genomes, methylation of cytosine in dinucleotides CG (called CpG) entails low frequency of such dinucleotides, excepted in small segments, called CpG islands (see [PDM01]). A usual way to detect CpG islands is to compute the ratio number of observed CpG/number of expected CpG on sliding windows.

To goal of this part is to segment a mouse sequence to reveal CpG islands. The modelling used has two states: model of CpG island vs model of methylated sequence. Both models are learned from the proportions of words in sets of sequences, respectively in files mus_cpg.fa and mus_tem.fa.

First, we use the MPP algorithm and compute the segmentation probability to build a maximum prediction partition.

After, we proceed an HMM analysis. For this, we introduce the information that in mouse genome, CpG islands are in average 1000 bases long, and are spaced by in average 125,000 bases.

At the end, we draw the partitions computed by these methods with some computed by CpG specialised methods.


  1. We count the 1|1-proportions on the Lsequence built from both sets of sequences.
    import lsequence
    ls_cpg=lsequence.Lsequence(fic="mus_cpg.fa")
    ls_tem=lsequence.Lsequence(fic="mus_tem.fa")
    import compte
    c_cpg=compte.Compte()
    for s in ls_cpg:
        c_cpg.add_seq(s,2)
    
    c_tem=compte.Compte()
    for s in ls_tem:
        c_tem.add_seq(s,2)
    
    p_cpg=c_cpg.strip().prop(lprior=1,lpost=1)
    print p_cpg
    p_tem=c_tem.strip().prop(lprior=1,lpost=1)
    print p_tem
    


  2. We have to remove the N letters:
    p_cpg=c_cpg.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
    print p_cpg
    p_tem=c_tem.restrict_to('ACGT').strip().prop(lprior=1,lpost=1)
    print p_tem
    


  3. Then we build a Lproportion for our modelling. We introduce the inter-state probabilities for the HMM:
    import lcompte
    lp=lcompte.Lproportion()
    lp[0]=p_cpg       # state 0 for cpg islands
    lp[1]=p_tem       # state 1 for sequences between of cpg islands
    lp.g_inter(0,0,1-1/1000.0)
    lp.g_inter(0,1,1/1000.0)
    lp.g_inter(1,0,1/125000.0)
    lp.g_inter(1,1,1-1/125000.0)
    print lp
    


  4. To build partitions of the studied sequence, we have use a Lexique.
    import sequence
    s_mus=sequence.Sequence(fic="mus2.fa")
    import lexique
    lx_hmm=lexique.Lexique()
    lx_hmm.read_Lprop(lp)
    


  5. We compute the segmentation probabilities on the sequence from the models used to build the sequence, up to 200 classes.
    lprob=lx_hmm.probability(s_mus,200)
    


  6. We compute the logarithm of bayesian factor: the probability of the sequence under the segmentations in k-segments for all k>=n, divided by the probability of the sequence under the segmentations in k-segments for all k<n. And we select the biggest number of segments that is over 3.2.
    import math
    def expmean(l): #computes the logarithm of the mean of exponentials
        m=max(l)
        x=0
        for i in l:
            x+=math.exp(i-m)
        return(math.log(x/len(l))+m)
    
    lbay=[]
    for i in range(1,len(lprob)):
        lbay.append(expmean(lprob[i:])-expmean(lprob[:i]))
    nbc=lbay.index(filter(lambda x: x>=3.2,lbay)[-1])+2
    print nbc
    


  7. We compute the MPPcorresponding to the Lexique.; and we draw it, with the height of each arc proportional with the G+C density.
    
    import parti_simp
    ps=parti_simp.Parti_simp()
    ps.mpp(s_mus, lx_hmm, nbc+30)
    


  8. We want to draw the partitioning such that the height of each arc is proportional to the density in C+G. So, we use a Lexique with the descriptor C+G.
    lx_cg=lexique.Lexique(str="+(CG)")
    ps.draw_nf("mus_ps_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
    


  9. And we keep the most relevant partition:
    p_nbc=ps[nbc-1]    
    


  10. The next partition is computed via Viterbi algorithm.
    import partition
    p_vit=partition.Partition()
    p_vit.viterbi(s_mus,lx_hmm)
    print p_vit
    


  11. The following one is built via forward-backward algorithm, by the most probable state on each position. Then, before, the Matrice of the log-probabilities of the states must be computed.
    import matrice
    m_fb=matrice.Matrice()
    m_fb.fb(s_mus,lx_hmm)
    p_fb=partition.Partition()
    p_fb.read_Matrice(m_fb)
    print p_fb
    


  12. And we draw the partitions:
    p_nbc.draw_nf("mus_nbc_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
    p_vit.draw_nf("mus_vit_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
    p_fb.draw_nf("mus_fb_cg.ps",func=lambda s: lx_cg.prediction(s_mus,deb=s.deb(),fin=s.fin())/len(s))
    


  13. We want to draw these partitions such that the height of each arc is proportional to the value:

    number of observed CpG
    number of expected CpG
    =
    number of CpG.length of the segment
    number of C.number of G


    We use a Lexique with 3 descriptors: C G and word CG, and define the function on a lexique, a sequence, and a segment, that returns the ratio.
    lx_oe=lexique.Lexique(str="0:C 1:G 2:`CG'")
    def height(l,sq, sg):
        dl=l.ls_evalue(sq,deb=sg.deb(),fin=sg.fin())
        return float(dl[2]*len(sg))/(dl[1]*dl[0])
    
    
    And we draw the partitions:
    p_nbc.draw_nf("mus_nbc_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
    p_vit.draw_nf("mus_vit_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
    p_fb.draw_nf("mus_fb_oe.ps",func=lambda s: height(lx_oe,s_mus,s))
    


  14. We want to compare those methods with some specifically made for CpG islands detection: CpGproD [PM01], CpGplot [LGLP92] and CpG Island Searcher [TJ02]. As the outputs of these programs have different syntaxes, they are converted in several files. Those files are, respectively, cpgprod.part, cpgplot.part, and cpgis.part.

    They are all put un a Parti_simp, with the ones computed with HMM algorithms:
    pscmp=parti_simp.Parti_simp()
    pscmp.append(partition.Partition(fic="cpgprod.part"))
    pscmp[-1].s_name("cpgprod")
    pscmp.append(partition.Partition(fic="cpgplot.part"))
    pscmp[-1].s_name("cpgplot")
    pscmp.append(partition.Partition(fic="cpgis.part"))
    pscmp[-1].s_name("cpgis")
    pscmp.append(p_fb)
    pscmp[-1].s_name("fb")
    pscmp.append(p_vit)
    pscmp[-1].s_name("vit")
    pscmp.append(p_nbc)
    pscmp[-1].s_name("mpp")
    pscmp.draw_nf("tous_cpg.ps",seg=[0])
    

Previous Up Next